Sampling A Neo4j Database
After reading the interesting blog post of my colleague Rik van Bruggen on "Media, Politics and Graphs" I thought it would be really cool to render it as a GrapGist. Especially, as he already shared all the queries as a GitHub Gist.
Unfortunately the dataset was a bit large for a sensible GraphGist representation, so I thought about means of extracting a smaller sample of his raw data that he made available (see his blog post for the link).
Considering my last blog post on creating data from sampling a cross product, this should be much easier. We know we want to have all nodes with the labels PARTY
, SHOW
and GENDER
in our graph as well as a sample of GUEST
nodes with their relationships.
The first part is easy:
MATCH (n)
WHERE n:PARTY OR n:SHOW OR n:GENDER
RETURN n;
The second part uses something that was not helpful in my last exploration, namely that random sampling when applied directly to a match, is used to filter the first node-pattern in the match and then still traverse all relationships/paths emanating from that node.
MATCH(n:GUEST)-[r]->()
WHERE rand() < 0.1
RETURN n,r;
The number you compare rand()
to is the percentage you want to get back, in this example 10%.
Now I have two nice queries, that can get me the data, how can I bring them together? With UNION ALL
MATCH (n)
WHERE n:PARTY OR n:SHOW OR n:GENDER
RETURN n, null as r
UNION ALL
MATCH(n:GUEST)-[r]->()
WHERE rand() < 0.1
RETURN n,r;
And where do I get the Cypher statements from, that I can use to populate my GraphGist database setup? Fortunately my dump
command made it into the Neo4j-Shell, so that we can just run it on the command-line and redirect the output into a file:
bin/neo4j-shell -path talkshow/graph.db \ -c 'dump MATCH (n) WHERE n:PARTY OR n:SHOW OR n:GENDER RETURN n, null as r UNION ALL MATCH(n:GUEST)-[r]->() WHERE rand() < 0.1 RETURN n,r;' \ > talkshow/sample.cql
Don’t forget the semicolon at the end! Looking at sample.cql
we see something like:
begin
create (_0:`SHOW` {`Modularity Name`:"B&vD", `id`:"B&vD", `label`:"B&vD", `modularity_class`:3, `weighted outdegree`:0.000000})
create (_1:`SHOW` {`Modularity Name`:"P&W", `id`:"P&W", `label`:"P&W", `modularity_class`:4, `weighted outdegree`:0.000000})
create (_2:`SHOW` {`Modularity Name`:"DWDD", `id`:"DWDD", `label`:"DWDD", `modularity_class`:5, `weighted outdegree`:0.000000})
...
...
create _509-[:`VISITED` {`quantity`:1}]->_5
create _509-[:`VISITED` {`quantity`:1}]->_2
create _509-[:`VISITED` {`quantity`:1}]->_1
create _509-[:`VISITED` {`quantity`:1}]->_0
;
commit
Which we can now use to populate our database for our GraphGist, and here it is in all its beauty - GraphGist: "Media, Politics and Graphs". But actually I chose not to use Rik’s GitHub Gist with the queries, but to copy the nice text and pictures from his blog post into the GraphGist.
You might notice that some of the parties go without connections. That would need some tweaking of the sampling which I leave as exercise for you.
Have fun
Michael