I recently stumbled upon Six Degrees of Wikipedia. This is a really neat website with which you can find the shortest way to between any two Wikipedia articles.
The idea is really interesting and it’s a great excuse to mess around with graphs.
In more technical terms, the network of pages and the links between them is an unweighted directed graph, with the vertices representing pages and the edges representing the links themselves.
Since the graph is unweighted, the simplest and most efficient algorithm we can use to find the shortest path between two given nodes is Breadth-First Search (or BFS for short).
First thing first, we’re going to need data to play with.
Let’s build a graph!
We need two things: a list of pages, and a list of internal links.
Luckily, Wikipedia has a very well documented data structure, so it’s easy to figure out what we need: the page
table and the page_links
table.
The English Wikipedia has a crazy amount of data, so we’re going to make things easier for us by downloading the data for simple.wikipedia.org
here. We’ll need to download simplewiki-latest-page.sql.gz
and simplewiki-latest-pagelinks.sql.gz
and import those into an SQL database.
Once we’ve done this, let’s look at the data we have. In our case, the number of vertices will be equal to the number of pages in the Article namespace.
mysql> SELECT COUNT(*) FROM page WHERE page_namespace = 0;
+----------+
| COUNT(*) |
+----------+
| 185365 |
+----------+
1 row in set (0.35 sec)
The number of edges will be equal to the number of pagelinks from and to Article pages.
mysql> SELECT COUNT(*) FROM pagelinks WHERE pl_from_namespace = 0 AND pl_namespace = 0;
+----------+
| COUNT(*) |
+----------+
| 6634554 |
+----------+
1 row in set (3.44 sec)
185k vertices and 6.6 million edges, that’s a pretty big graph to play with!
The data model isn’t optimal for what we want to do though. To represent a graph, we need a list of vertices, and for each of those vertex an adjacency list.
I’ll be using a very simple checkErr
function throughout this article, so let’s go ahead and define it right now:
func checkError(e error) {
if e != nil {
panic(e)
}
}
First, we need some kind Vertex
type, and a way to add neighbors.
type Vertex struct {
Title string
Id int
Neighbors []*Vertex
}
func (from *Vertex) AddNeighbour(to *Vertex) {
if to != nil {
from.Neighbors = append(from.Neighbors, to)
}
}
func NewVertex(id int, title string) *Vertex {
neighbors := make([]*Vertex, 0)
return &Vertex{Id: id, Title: title, Neighbors: neighbors}
}
Next, we need a function to build the trees:
func buildGraph() map[string]*Vertex {
db, err := sql.Open("mysql", "someuser:@/wikilinks")
checkErr(err)
defer db.Close()
vertices := make(map[string]*Vertex)
// pagelinks uses IDs for the source page and Titles for the destination page
// We're going to need to maintain a mapping to avoid hitting the DB each time when building the Edges
idToTitle := make(map[int]string)
buildVertices(db, vertices, idToTitle)
buildEdges(db, vertices, idToTitle)
return vertices
}
Now, as an astute reader, you’ll have noticed that we haven’t actually done anything yet!
First, let’s build a list of all the vertices in the graph. We want to ask the database for the id
and title
fields for every single entry in the page
table where the namespace is Article
(i.e with namespace ID 0
).
For each of those entries, we need to:
* populate our idToTitle
mapping that we’ll be using later
* instantiate a Vertex
object and add it to our vertices
list, using title
as the key.
func buildVertices(db *sql.DB, vertices map[string]*Vertex, idToTitle map[int]string) {
rows, err := db.Query("SELECT page_id, page_title FROM page WHERE page_namespace = 0")
checkErr(err)
for rows.Next() {
var id int
var title string
err := rows.Scan(&id, &title)
checkErr(err)
idToTitle[id] = title
vertices[title] = NewVertex(id, title)
}
fmt.Println("Done building vertices")
}
Pretty straightforward, now let’s move on to building the edges. Looking at each row of the pagelinks
table, the edge is between the pl_from
and the pl_title
.
The indexes on the pagelinks
table aren’t optimized for our use case. Instead of putting the namespace conditions in the SQL requests, we’re going to iterate over all the records and filter-out the non-Article in our code.
func buildEdges(db *sql.DB, vertices map[string]*Vertex, idToTitle map[int]string) {
rows, err := db.Query("SELECT pl_from, pl_title, pl_namespace, pl_from_namespace FROM pagelinks")
checkErr(err)
for rows.Next() {
var fromId int
var toTitle string
var toNamespace int
var fromNamespace int
err := rows.Scan(&fromId, &toTitle, &toNamespace, &fromNamespace)
checkErr(err)
// Filter the links between two articles
if fromNamespace == 0 && toNamespace == 0 {
// Find the origin and destination vertices, and build the edge
if fromTitle, found := idToTitle[fromId]; found {
vertices[fromTitle].AddNeighbour(vertices[toTitle])
}
}
}
fmt.Println("Done building edges")
}
Hitting the database and building this whole graph is pretty expensive. On my machine (MacBook Pro with i7 2.5GHz, 16GB of RAM and the default brew
MySQL setup), it takes around 20 seconds.
We don’t need to do all this work every time though. We can serialize the tree once we build it, persist it on disk (in a file) and deserialize it on further runs of our program.
You may be familiar with the Marshal
package in Ruby or Python’s pickle
. The idea is simple yet powerful: you can take any data structure and serialize it. From there, you can send it over the wire to another program, dump it to disk for later use, or anything else.
Go provides the gob
package. Let’s have a look at the documentation:
A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types. Pointers are not transmitted, but the things they point to are transmitted; that is, the values are flattened. Nil pointers are not permitted, as they have no value. Recursive types work fine, but recursive values (data with cycles) are problematic. This may change.
Graphs are a prime example of cyclical data structures, so this isn’t going to work.
After doing a bit of research, I came across the sereal
package, which supports cyclical data structures.
Let’s modify our buildGraph
method. If we find serialized data, we’ll go ahead and deserialize that. Otherwise, we’ll build the graph from the database, serialize it and dump it to disk.
const DATA_FILE = "graph.dat"
func buildGraph() map[string]*Vertex {
// A file with serialized data exists
if _, err := os.Stat(DATA_FILE); err == nil {
var vertices map[string]*Vertex
// Read the file's contents
data, err := ioutil.ReadFile(DATA_FILE)
checkErr(err)
// Deserialize the data into a graph
err = sereal.Unmarshal(data, &vertices)
checkErr(err)
return vertices
} else { // No file is found, let's load this from the DB
db, err := sql.Open("mysql", "root:@/wikilinks")
defer db.Close()
vertices := make(map[string]*Vertex)
idToTitle := make(map[int]string)
buildVertices(db, vertices, idToTitle)
buildEdges(db, vertices, idToTitle)
// The tree is built, serialize the data
data, err := sereal.Marshal(vertices)
checkErr(err)
// Dump the serialized data to disk
err = ioutil.WriteFile(TREEFILE, data, 0600)
checkErr(err)
return vertices
}
}
With this change, building the graph the first time takes a tiny bit longer due to serialization and disk write. However, subsequent runs of the program are ~10 times faster, taking just under 2 seconds.
Breadth-First Search
Now, for the actual algorithm!
The way BFS works is rather simple. As hinted by the name, you visit a graph in layers. First, you’ll visit every vertex 1 edge away from the start point, then every vertex 2 edges away, and so on and so forth.
In practical terms, you need a queue of the nodes you need to visit, and a way to keep track of the nodes you’ve already visited (either using a set, or a flag on the node themselves).
From the start node, push all the adjacent nodes in the queue. Start processing each item in the queue. If it’s the node we need to find, great, we’re done. If not, append each node in the adjacency list of this node in the queue, and proceed to the next item in the queue.
This is a bit abstract, let’s see if we can visualize it better.
BFS illustrated
The Wikipedia database is a bit big for an illustrative example, so we’re going to work with a graph that has 5 vertices and 6 edges:
We’ll use A
as our starting point, and E
as our target.
There are 4 possible paths we could take:
A -> B -> E
A -> C -> B -> E
A -> C -> D -> E
To start with, we’ll put A
in the queue.
Visited: []
Queue: [A]
Then, we start the process by visiting A. Since A != E
, we push A
’s neighbors (B
and C
) into the queue.
Visited: [A]
Queue: [B, C]
Next, we visit B
. B != E
, so we push E
onto the queue.
Visited: [A, B]
Queue: [C, E]
Next, we visit C
. C != E
, and we know we already visited B
, so we only push D
onto the queue.
Finally, we visit E
, which is our destination, and we’re done!
Visited: [A, B, C]
Queue: [E]
Let’s put that into code:
func (from *Vertex) BFS(to *Vertex) {
visited := make(map[int]struct{})
queue := make([]*Vertex, 0, 1)
// Push the source node to the queue
queue = append(queue, from)
// Process until we either:
// * have no items left to visit (empty queue)
// * have found the target
for v := queue[0]; len(queue) > 0; v = queue[0] {
if v == to {
fmt.Printf("Target %s found!\n", v.Title)
return
}
// Add the yet-unvisited neighbors to the
for _, n := range v.Neighbors {
if _, seen := visited[n.Id]; !seen {
queue = append(queue, n)
visited[n.Id] = struct{}{}
}
}
// Pop the first element from the queue
queue = queue[1:]
}
}
Reconstructing the shortest path
The current version of our algorithm only tells us when the destination as been reached, but not how to reach it. That’s not very helpful, so let’s make a few changes and address that.
The path from the source to the destination is a list of edges.
If we know that when we pushed E
into the queue we were visiting B
, and when we pushed B
into the queue we were visiting A
, we have reconstructed the whole path!
Since we already maintain a map of all the nodes we visit, we can set the “origin” vertex as the value in the map instead of an empty struct.
func (from *Vertex) BFS(to *Vertex) {
// Track which node we visit, and which node we were coming from when visiting it
visited := make(map[int]*Vertex)
queue := make([]*Vertex, 0, 1)
// Push the source node to the queue
queue = append(queue, from)
// Process until we either:
// * have no items left to visit (empty queue)
// * have found the target
for v := queue[0]; len(queue) > 0; v = queue[0] {
if v == to {
fmt.Printf("Target %s found!\n", v.Title)
return
}
// Add the yet-unvisted neighbors to the
for _, n := range v.Neighbors {
if _, seen := visited[n.Id]; !seen {
queue = append(queue, n)
visited[n.Id] = v
}
}
// Pop the first element from the queue
queue = queue[1:]
}
}
To reconstruct the path, we use the visited
map and backtrack until we get back to the root node. This will give us the path backwards, so we need to revert the resulting list.
func constructPath(visited map[int]*Vertex, target *Vertex, from *Vertex) []string {
path := make([]string, 0)
// Add the target as part of the path
path = append(path, target.Title)
// Move back until we find source vertex
curr, present := visited[target.Id]
for ; present && curr != from; curr, present = visited[curr.Id] {
path = append(path, curr.Title)
}
// Push the source vertex onto the path to get the full picture
path = append(path, from.Title)
// Reverse the path to get it the right way around
for i := len(path)/2 - 1; i >= 0; i-- {
j := len(path) - 1 - i
path[i], path[j] = path[j], path[i]
}
return path
}
Finally, we just need a main
, and a small helper to call our BFS
function.
func findPath(graph map[string]*Vertex, from string, to string) {
f, fromFound := graph[from]
t, toFound := graph[to]
if fromFound && toFound {
f.BFS(t)
}
}
func main() {
graph := buildGraph()
findPath(graph, os.Args[1], os.Args[2])
}
And that’s all we need! With everything combined, you’ll be able to give the title of two Wikipedia articles and get the shortest path between them:
$> ./pathfinder Australia Great_Red_Spot
Australia -> Sun -> Jupiter -> Great_Red_Spot
This works, awesome!
Before we conclude this article, let’s discuss the complexity of the BFS algorithm.
Complexity
In the worst case scenario, every vertex will be pushed onto the queue. For each of those vertices, every edge will be followed.
If we call V
the number of vertices, and E
the number of edges, this gives us the following complexity: O(V+E)
.
Conclusion
We’ve done quite a lot! The initial research into graphs and how Wikipedia structures their articles and the links between them taught us quite a lot. We saw a quick example of how to query a MySQL database with Go.
From there, we’ve built a fairly large graph and found a way to serialize this complex datastructure to disk for later use.
Finally, we looked at how BFS work and implemented it, allowing us to go from one article to the next.
This is a lot of effor invested, so it would be a shame to stop here. In a follow-up article, we’ll try to answer a few more questions:
- Can we iterate on our simple, iterative BFS and make it concurrent?
- What’s the highest distance between two articles?
- Are there any pairs of articles that cannot be connected?
Notes:
1 - If you want to play with this by yourself, you can get the full code here.
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email