Quickly create a 100k Neo4j graph data model with Cypher only

We want to run some test queries on an existing graph model but have no sample data at hand and also no input files (CSV,GraphML) that would provide it.

Why not create quickly it on our own just using cypher. First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.

The domain is simple (:User)-[:OWN]→(:Product) but good enough for collaborative filtering or demographic analysis.

Nodes: Users and Products

Let’s start with Users, we create 100k of them in one go:

We create an array of names and go over a range of 100k with the FOREACH clause, taking the counter as id and a name from the array.

WITH ["Andres","Wes","Rik","Mark","Peter","Kenny","Michael","Stefan","Max","Chris"] AS names
FOREACH (r IN range(0,100000) | CREATE (:User {id:r, name:names[r % size(names)]+" "+r}));

This finishes quickly, and tells us how many ndoes, labels and properties were created.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100001
Properties set: 200002
Labels added: 100001
5788 ms

Same for products. As names I just used a few of my shiny geek things.

with ["Mac","iPhone","Das Keyboard","Kymera Wand","HyperJuice Battery","Peachy Printer","HexaAirBot","AR-Drone","Sonic Screwdriver","Zentable","PowerUp"] as names
foreach (r in range(0,50) | create (:Product {id:r, name:names[r % size(names)]+" "+r}));

Please note that I only created 50 products. I initially started with 3000 but then the cross product between users and products to sample relationships from grows really large (300M) which is not pulled through so quickly. So I decided to stick with a cross product of 5M which is good enough for our purposes.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 51
Properties set: 102
Labels added: 51
46 ms

Relationships: OWN

The general idea is to create the cross product between users and products and sample a percentage of that to create the relationships. For sampling we use rand, for the cross product MATCH of two independent labels.

My first approach didn’t really work as the WHERE clause belongs to the MATCH and is pulled into the path finding and causes it to sample only users, not user-product pairs. So for one user that was selected all OWN relationships were created. Not what I wanted :)

// don't do this
match (u:User),(p:Product)
where rand() < 0.1
with u,p
limit 50000
merge (u)-[:OWN]->(p);

So we have to detach the WHERE clause from MATCH with a WITH statement that passes on the user, product pairs. We still limit the cross-product results to 5M just as a safeguard in case we have miscalculated the cross product. A rand() < 0.1 samples 10% of the total amount, which is in our case 500k combinations. With those we then can create relationships with CREATE which is faster and doesn’t check for duplicates.

match (u:User),(p:Product)
with u,p
limit 5000000
where rand() < 0.1
create (u)-[:OWN]->(p);
+-------------------+
| No data returned. |
+-------------------+
Relationships created: 509898
11684 ms

We could also use MERGE which makes sure that at most one relationship between two nodes exists. If we use MERGE we should limit the amount of nodes that is created in one execution to avoid exponential time build-up. If we introduce this limit, we also have to move the window of node-pairs to be considered by the percentage of rels we create. A limit of 100k is 1/5 of the total of 500k relationships, so we have to advance the total window also by 20% of 5M, i.e. 1M

match (u:User),(p:Product)
with u,p
// increase skip value from 0 to 4M in 1M steps
skip 1000000
limit 5000000
where rand() < 0.1
with u,p
limit 100000
merge (u)-[:OWN]->(p);

Which results in.

+-------------------+
| No data returned. |
+-------------------+
Relationships created: 100000
51428 ms

If you have more memory for your Neo4j server than my 4G heap, you can also merge larger segments of relationships in a single transaction (200k or more).

We also create an index for :User and product.

create index on :User(id);
create index on :Product(id);

Now we can run some of the test-queries we wanted to check:

Find similar users that own the same stuff that I do.

match (u:User {id:1})-[:OWN]->()<-[:OWN]-(other)
return other.name,count(*)
order by count(*) desc
limit 5;

+--------------------------+
| other.name    | count(*) |
+--------------------------+
| "Peter 23404" | 6        |
| "Peter 26754" | 5        |
| "Mark 35223"  | 5        |
| "Peter 19614" | 5        |
| "Chris 23959" | 5        |
+--------------------------+
5 rows
145 ms

Collaborative filtering - product suggestions

match (u:User {id:3})-[:OWN]->()<-[:OWN]-(other)-[:OWN]->(p)
return p.name,count(*)
order by count(*) desc
limit 5;

+------------------------------------+
| p.name                  | count(*) |
+------------------------------------+
| "HyperJuice Battery 37" | 2894     |
| "Zentable 9"            | 2872     |
| "Kymera Wand 3"         | 2865     |
| "Zentable 31"           | 2863     |
| "Das Keyboard 35"       | 2847     |
+------------------------------------+
5 rows
410 ms