Quickly create a 100k Neo4j graph data model with Cypher only
We want to run some test queries on an existing graph model but have no sample data at hand and also no input files (CSV,GraphML) that would provide it.
Why not create quickly it on our own just using cypher. First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.
The domain is simple (:User)-[:OWN]→(:Product)
but good enough for collaborative filtering or demographic analysis.
Nodes: Users and Products
Let’s start with Users, we create 100k of them in one go:
We create an array of names and go over a range of 100k with the FOREACH
clause, taking the counter as id and a name from the array.
WITH ["Andres","Wes","Rik","Mark","Peter","Kenny","Michael","Stefan","Max","Chris"] AS names
FOREACH (r IN range(0,100000) | CREATE (:User {id:r, name:names[r % size(names)]+" "+r}));
This finishes quickly, and tells us how many ndoes, labels and properties were created.
+-------------------+ | No data returned. | +-------------------+ Nodes created: 100001 Properties set: 200002 Labels added: 100001 5788 ms
Same for products. As names I just used a few of my shiny geek things.
with ["Mac","iPhone","Das Keyboard","Kymera Wand","HyperJuice Battery","Peachy Printer","HexaAirBot","AR-Drone","Sonic Screwdriver","Zentable","PowerUp"] as names
foreach (r in range(0,50) | create (:Product {id:r, name:names[r % size(names)]+" "+r}));
Please note that I only created 50 products. I initially started with 3000 but then the cross product between users and products to sample relationships from grows really large (300M) which is not pulled through so quickly. So I decided to stick with a cross product of 5M which is good enough for our purposes.
+-------------------+ | No data returned. | +-------------------+ Nodes created: 51 Properties set: 102 Labels added: 51 46 ms
Relationships: OWN
The general idea is to create the cross product between users and products and sample a percentage of that to create the relationships. For sampling we use rand
, for the cross product MATCH
of two independent labels.
My first approach didn’t really work as the WHERE
clause belongs to the MATCH
and is pulled into the path finding and causes it to sample only users, not user-product pairs.
So for one user that was selected all OWN
relationships were created. Not what I wanted :)
// don't do this
match (u:User),(p:Product)
where rand() < 0.1
with u,p
limit 50000
merge (u)-[:OWN]->(p);
So we have to detach the WHERE
clause from MATCH
with a WITH
statement that passes on the user, product pairs. We still limit the cross-product results to 5M just as a safeguard in case we have miscalculated the cross product.
A rand() < 0.1
samples 10% of the total amount, which is in our case 500k combinations. With those we then can create relationships with CREATE
which is faster and doesn’t check for duplicates.
match (u:User),(p:Product)
with u,p
limit 5000000
where rand() < 0.1
create (u)-[:OWN]->(p);
+-------------------+ | No data returned. | +-------------------+ Relationships created: 509898 11684 ms
We could also use MERGE
which makes sure that at most one relationship between two nodes exists.
If we use MERGE
we should limit the amount of nodes that is created in one execution to avoid exponential time build-up.
If we introduce this limit, we also have to move the window of node-pairs to be considered by the percentage of rels we create.
A limit of 100k is 1/5 of the total of 500k relationships, so we have to advance the total window also by 20% of 5M, i.e. 1M
match (u:User),(p:Product)
with u,p
// increase skip value from 0 to 4M in 1M steps
skip 1000000
limit 5000000
where rand() < 0.1
with u,p
limit 100000
merge (u)-[:OWN]->(p);
Which results in.
+-------------------+ | No data returned. | +-------------------+ Relationships created: 100000 51428 ms
If you have more memory for your Neo4j server than my 4G heap, you can also merge larger segments of relationships in a single transaction (200k or more).
We also create an index for :User and product.
create index on :User(id);
create index on :Product(id);
Now we can run some of the test-queries we wanted to check:
Find similar users that own the same stuff that I do.
match (u:User {id:1})-[:OWN]->()<-[:OWN]-(other)
return other.name,count(*)
order by count(*) desc
limit 5;
+--------------------------+
| other.name | count(*) |
+--------------------------+
| "Peter 23404" | 6 |
| "Peter 26754" | 5 |
| "Mark 35223" | 5 |
| "Peter 19614" | 5 |
| "Chris 23959" | 5 |
+--------------------------+
5 rows
145 ms
Collaborative filtering - product suggestions
match (u:User {id:3})-[:OWN]->()<-[:OWN]-(other)-[:OWN]->(p)
return p.name,count(*)
order by count(*) desc
limit 5;
+------------------------------------+
| p.name | count(*) |
+------------------------------------+
| "HyperJuice Battery 37" | 2894 |
| "Zentable 9" | 2872 |
| "Kymera Wand 3" | 2865 |
| "Zentable 31" | 2863 |
| "Das Keyboard 35" | 2847 |
+------------------------------------+
5 rows
410 ms