Day 5: Neo4j & Graphipedia

Day 5: Neo4j & Graphipedia

I'm taking a slight step backwards today. Last night, I remembered about Neo4j, a database that's built with graphs instead of more common data structures (like SQL).

Neo4j Graph Data Platform – The Leader in Graph Databases
Connect data as it’s stored with Neo4j. Perform powerful, complex queries at scale and speed with our graph data platform.

It has a number of "shortest path" algorithms to find the shortest path between two nodes in the graph. We're going to load the entirety of wikipedia into a Neo4j database, and use the articles as nodes and links to other articles as edges so we can utilize these functions.

And again, someone has done all the hard work to make this possible: mirkonasato/graphipedia.

GitHub - mirkonasato/graphipedia: Creates a Neo4j graph of Wikipedia links.
Creates a Neo4j graph of Wikipedia links. Contribute to mirkonasato/graphipedia development by creating an account on GitHub.

How To

  1. Install Neo4j Desktop
  2. Clone noppaz/graphipedia (the original version has not been updated to Neo4j v4, but this fork has been)
  3. Install maven with homebrew: brew install maven
  4. Get confused because java doesn't seem to exist even though it's installed: java -version
  5. Link the openjdk installed with maven to java: sudo ln -sfn /opt/homebrew/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
  6. Check to make sure java exists now: java -version
  7. cd into the cloned project and build it with mvn package
  8. Download pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.your.org/enwiki/latest/
  9. Run the import script with ./import.sh ~/Downloads/enwiki-latest-pages-articles.xml.bz2 ./output
  10. Wait...