In April 2020, Neo4j has seen the release of the new Graph Data Science (GDS) library. Besides containing over 30 graph algorithms, the GDS library allows your algorithms to scale up to (literally) billions of nodes and edges.
In this post I’ll describe how to get started with the GDS library in Python. The referenced code samples can be found on GitHub. In 3 steps, we’ll walk through a typical GDS journey, from building a knowledge graph to graph feature engineering.
Why graph data science matters
A couple of months ago I attended a great talk by the Neo4j data science team, who summarized the power of graph data science in two sentences:
- More often than not, adding more data does not make your predictive models smarter.
- The secret is to use hidden relationships in data you already have - graph features!
Looking at your data as a graph is a prerequisite to find these hidden insights. Only then can you use graph data science to answer questions that were previously unanswerable. Think about:
- Smarter classification of your customers based on community detection algorithms.
- Better identification similar products with node similarity algorithms.
- Discovering influential people in a network with node centrality algorithms.
From a machine learning perspective, it comes down to smarter feature engineering. Graph features have the potential to be hold much stronger predictive power than discrete variables. A smart ML pipeline should make use of both discrete variables and graph features to get the best results.
A graph data science journey in three steps
Using the Graph of Thrones dataset, we’re going to create a model that predicts who will die in the next book of the Game of Thrones series. Let’s get started!
1. Building your knowledge graph
First, you’ll need to shape your data into a knowledge graph - a network of interconnected elements. There’s a ton of great material out there on building a knowledge graph from your data. I recently wrote a blog post on building a Slack knowledge graph, so that might be a good starting point.
For this demo, I’m using a preloaded graph from the Neo4j Sandbox. You can use a free sandbox for three days to experiment with. The data science sandbox contains a graph containing people, houses, battles and cultures from George R.R. Martin’s most famous novel series, A Song of Ice and Fire. Here’s a simplified version of the data model for this graph:
It is widely known that George R.R. Martin has a tendency for killing off our most beloved characters. But how can we predict who’s going to be killed off next? Are those characters with a wide reach of interactions more/less likely to die? that’s where graph algorithms enter the room.
2. Running graph algorithms
To use Neo4j from Python, you’ll need a Python driver. You’ll have two main options:
pip install neo4j- the official Neo4j Python driver.
pip install py2neo- an alternative Python driver.
You can use either for simple applications without any problems. Just make sure that the driver works with the correct Neo4j version. (My sandbox uses 3.5.11)
For my experiments I’ll be using a Jupyter notebook called the ‘GDS starter kit’. You can find it here.
Creating graph projections
An important part of running graph algorithms is selecting the right input graph. The GDS library allows you to make graph projections, an in-memory copy of parts of your graph you want to run your algorithm on. Graph projections serve two main purposes:
- they give you flexibility in which (sub)-graph you want to run your algorithms on.
- they allow for super-fast algorithm executions (by being in-memory)
In our Game of Thrones example, we want to identify important characters based on their interactions. We’ll create the following graph projection:
We’ve now brought our original graph down to a smaller graph that only holds the interactions, as well as their weights (how frequently they interact).
Running the algorithms
Now that we have the input graph loaded, we can run the algorithm of our choice. The
PageRank algorithm will rank nodes not only based on the number of interactions, but also on transitive influence - the importance of people they interact with. Using the PageRank algorithm is simple:
Many things are happening here:
- We select our previously generated
- We set the algorithm to use the
weightproperty of our graph projection.
- The PageRank algorithm is executed.
- Results are streamed back into our original Neo4j database. Each person node now has a new
In practise, you’ll likely be answering a number of questions on your graph. Common follow up questions could be: What communities of people frequently interact? What characters are similar based on their interactions? Which characters are likely to form new interactions? A smart predictive model may use multiple of these answers to predict results.
Validation and visualization
Next up, we inspect the results of our algorithms. With some analytical Cypher queries, you can inspect the results from your Neo4j browser. Much more interesting is to visualize your results using Bloom or other specialized visualization tools. A great example: Michael Hunger built this visualization on the same graph - A node’s size represent the PageRank, and a color represents its community:
3. Graph features as part of your ML pipelines
Now that we’ve discovered new graph features (important characters, communities, …) it’s time to put them to practice.
Recall that we’re building a model that predicts which character will die in the upcoming book. In our model, we use
pageRank to predict
|Samwell Tarly||22||male||westeros||Night’s Watch||3.380056||False|
The graph data science starter kit contains a worked out example that embeds
pagerank as a predictive feature in a Random Forest classifier.
To validate the impact of our new feature, we compare two models:
- A Random Forest Classifier that uses
X, containing only the original features:
- A Random Forest Classifier that uses
X2, containing both the original features and the graph features:
Conclusion: we find that adding our new feature increases the roc auc score of the model by roughly 8.5% - we’ve just improved our model without adding any new data!
Disclaimer - the amount of training data for this model was tiny (~300 people!), so the choice of train/test split makes a big difference. To get great measurable results, remember that graph features really start to shine with larger interconnected datasets. I highly recommended testing out your own models with a variety of graph features such as centrality, communities and more.
We’ve taken our first steps into the world of graph data science - from building a knowledge graph to graph feature engineering. Keep in mind that there are many more things to explore and investigate to become a true graph data science pro.
Where to go from here
- The online graph data science training provides a great starting point for learning graph data science.
- The documentation for the GDS library is extremely detailed and helpful when you’re getting started.
- The graph data science sandbox goes more in detail on the types of algorithms supported, as well as different ways to create graph projections.
- Want to become an expert? Neo4j also offers dedicated graph data science trainings and bootcamps for teams. Get in touch for more info.