11 minute read

I recently stumbled upon Six Degrees of Wikipedia. This is a really neat website with which you can find the shortest way to between any two Wikipedia articles.

The idea is really interesting and it’s a great excuse to mess around with graphs.

In more technical terms, the network of pages and the links between them is an unweighted directed graph, with the vertices representing pages and the edges representing the links themselves.

Since the graph is unweighted, the simplest and most efficient algorithm we can use to find the shortest path between two given nodes is Breadth-First Search (or BFS for short).

First thing first, we’re going to need data to play with.

Let’s build a graph!

We need two things: a list of pages, and a list of internal links.

Luckily, Wikipedia has a very well documented data structure, so it’s easy to figure out what we need: the page table and the page_links table. The English Wikipedia has a crazy amount of data, so we’re going to make things easier for us by downloading the data for simple.wikipedia.org here. We’ll need to download simplewiki-latest-page.sql.gz and simplewiki-latest-pagelinks.sql.gz and import those into an SQL database.

Once we’ve done this, let’s look at the data we have. In our case, the number of vertices will be equal to the number of pages in the Article namespace.

mysql> SELECT COUNT(*) FROM page WHERE page_namespace = 0;
+----------+
| COUNT(*) |
+----------+
|   185365 |
+----------+
1 row in set (0.35 sec)

The number of edges will be equal to the number of pagelinks from and to Article pages.

mysql> SELECT COUNT(*) FROM pagelinks WHERE pl_from_namespace = 0 AND pl_namespace = 0;
+----------+
| COUNT(*) |
+----------+
|  6634554 |
+----------+
1 row in set (3.44 sec)

185k vertices and 6.6 million edges, that’s a pretty big graph to play with!

The data model isn’t optimal for what we want to do though. To represent a graph, we need a list of vertices, and for each of those vertex an adjacency list.

I’ll be using a very simple checkErr function throughout this article, so let’s go ahead and define it right now:

func checkError(e error) {
    if e != nil {
        panic(e)
    }
}

First, we need some kind Vertex type, and a way to add neighbors.

type Vertex struct {
    Title      string
    Id         int
    Neighbors []*Vertex
}

func (from *Vertex) AddNeighbour(to *Vertex) {
    if to != nil {
        from.Neighbors = append(from.Neighbors, to)
    }
}

func NewVertex(id int, title string) *Vertex {
    neighbors := make([]*Vertex, 0)
    return &Vertex{Id: id, Title: title, Neighbors: neighbors}
}

Next, we need a function to build the trees:

func buildGraph() map[string]*Vertex {
    db, err := sql.Open("mysql", "someuser:@/wikilinks")
    checkErr(err)
    defer db.Close()

    vertices := make(map[string]*Vertex)
    // pagelinks uses IDs for the source page and Titles for the destination page
    // We're going to need to maintain a mapping to avoid hitting the DB each time when building the Edges
    idToTitle := make(map[int]string)
    buildVertices(db, vertices, idToTitle)
    buildEdges(db, vertices, idToTitle)

    return vertices
}

Now, as an astute reader, you’ll have noticed that we haven’t actually done anything yet!

First, let’s build a list of all the vertices in the graph. We want to ask the database for the id and title fields for every single entry in the page table where the namespace is Article (i.e with namespace ID 0). For each of those entries, we need to: * populate our idToTitle mapping that we’ll be using later * instantiate a Vertex object and add it to our vertices list, using title as the key.

func buildVertices(db *sql.DB, vertices map[string]*Vertex, idToTitle map[int]string) {
    rows, err := db.Query("SELECT page_id, page_title FROM page WHERE page_namespace = 0")
    checkErr(err)

    for rows.Next() {
        var id int
        var title string

        err := rows.Scan(&id, &title)
        checkErr(err)

        idToTitle[id] = title
        vertices[title] = NewVertex(id, title)
    }

    fmt.Println("Done building vertices")
}

Pretty straightforward, now let’s move on to building the edges. Looking at each row of the pagelinks table, the edge is between the pl_from and the pl_title.

The indexes on the pagelinks table aren’t optimized for our use case. Instead of putting the namespace conditions in the SQL requests, we’re going to iterate over all the records and filter-out the non-Article in our code.

func buildEdges(db *sql.DB, vertices map[string]*Vertex, idToTitle map[int]string) {
    rows, err := db.Query("SELECT pl_from, pl_title, pl_namespace, pl_from_namespace FROM pagelinks")
    checkErr(err)

    for rows.Next() {
        var fromId int
        var toTitle string
        var toNamespace int
        var fromNamespace int

        err := rows.Scan(&fromId, &toTitle, &toNamespace, &fromNamespace)
        checkErr(err)
        // Filter the links between two articles
        if fromNamespace == 0 && toNamespace == 0 {
            // Find the origin and destination vertices, and build the edge
            if fromTitle, found := idToTitle[fromId]; found {
                vertices[fromTitle].AddNeighbour(vertices[toTitle])
            }
        }
    }
    fmt.Println("Done building edges")
}

Hitting the database and building this whole graph is pretty expensive. On my machine (MacBook Pro with i7 2.5GHz, 16GB of RAM and the default brew MySQL setup), it takes around 20 seconds. We don’t need to do all this work every time though. We can serialize the tree once we build it, persist it on disk (in a file) and deserialize it on further runs of our program.

You may be familiar with the Marshal package in Ruby or Python’s pickle. The idea is simple yet powerful: you can take any data structure and serialize it. From there, you can send it over the wire to another program, dump it to disk for later use, or anything else.

Go provides the gob package. Let’s have a look at the documentation:

A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types. Pointers are not transmitted, but the things they point to are transmitted; that is, the values are flattened. Nil pointers are not permitted, as they have no value. Recursive types work fine, but recursive values (data with cycles) are problematic. This may change.

Graphs are a prime example of cyclical data structures, so this isn’t going to work.

After doing a bit of research, I came across the sereal package, which supports cyclical data structures.

Let’s modify our buildGraph method. If we find serialized data, we’ll go ahead and deserialize that. Otherwise, we’ll build the graph from the database, serialize it and dump it to disk.

const DATA_FILE = "graph.dat"

func buildGraph() map[string]*Vertex {
    // A file with serialized data exists
    if _, err := os.Stat(DATA_FILE); err == nil {
        var vertices map[string]*Vertex

        // Read the file's contents
        data, err := ioutil.ReadFile(DATA_FILE)
        checkErr(err)

        // Deserialize the data into a graph
        err = sereal.Unmarshal(data, &vertices)
        checkErr(err)

        return vertices
    } else { // No file is found, let's load this from the DB
        db, err := sql.Open("mysql", "root:@/wikilinks")
        defer db.Close()

        vertices := make(map[string]*Vertex)
        idToTitle := make(map[int]string)
        buildVertices(db, vertices, idToTitle)

        buildEdges(db, vertices, idToTitle)

        // The tree is built, serialize the data
        data, err := sereal.Marshal(vertices)
        checkErr(err)
        // Dump the serialized data to disk
        err = ioutil.WriteFile(TREEFILE, data, 0600)
        checkErr(err)

        return vertices
    }
}

With this change, building the graph the first time takes a tiny bit longer due to serialization and disk write. However, subsequent runs of the program are ~10 times faster, taking just under 2 seconds.


Now, for the actual algorithm!

The way BFS works is rather simple. As hinted by the name, you visit a graph in layers. First, you’ll visit every vertex 1 edge away from the start point, then every vertex 2 edges away, and so on and so forth.

In practical terms, you need a queue of the nodes you need to visit, and a way to keep track of the nodes you’ve already visited (either using a set, or a flag on the node themselves).

From the start node, push all the adjacent nodes in the queue. Start processing each item in the queue. If it’s the node we need to find, great, we’re done. If not, append each node in the adjacency list of this node in the queue, and proceed to the next item in the queue.

This is a bit abstract, let’s see if we can visualize it better.

BFS illustrated

The Wikipedia database is a bit big for an illustrative example, so we’re going to work with a graph that has 5 vertices and 6 edges:

graph

We’ll use A as our starting point, and E as our target. There are 4 possible paths we could take:

A -> B -> E
A -> C -> B -> E
A -> C -> D -> E

To start with, we’ll put A in the queue.

Visited:   []
Queue:     [A]

Then, we start the process by visiting A. Since A != E, we push A’s neighbors (B and C) into the queue.

Visited:   [A]
Queue:     [B, C]

Next, we visit B. B != E, so we push E onto the queue.

Visited:   [A, B]
Queue:     [C, E]

Next, we visit C. C != E, and we know we already visited B, so we only push D onto the queue.

Finally, we visit E, which is our destination, and we’re done!

Visited:   [A, B, C]
Queue:     [E]

Let’s put that into code:

func (from *Vertex) BFS(to *Vertex) {
    visited := make(map[int]struct{})
    queue := make([]*Vertex, 0, 1)

    // Push the source node to the queue
    queue = append(queue, from)

    // Process until we either:
    // * have no items left to visit (empty queue)
    // * have found the target
    for v := queue[0]; len(queue) > 0; v = queue[0] {
        if v == to {
            fmt.Printf("Target %s found!\n", v.Title)
            return
        }

        // Add the yet-unvisited neighbors to the
        for _, n := range v.Neighbors {
            if _, seen := visited[n.Id]; !seen {
                queue = append(queue, n)
                visited[n.Id] = struct{}{}
            }
        }
        // Pop the first element from the queue
        queue = queue[1:]
    }
}

Reconstructing the shortest path

The current version of our algorithm only tells us when the destination as been reached, but not how to reach it. That’s not very helpful, so let’s make a few changes and address that.

The path from the source to the destination is a list of edges. If we know that when we pushed E into the queue we were visiting B, and when we pushed B into the queue we were visiting A, we have reconstructed the whole path!

Since we already maintain a map of all the nodes we visit, we can set the “origin” vertex as the value in the map instead of an empty struct.

func (from *Vertex) BFS(to *Vertex) {
    // Track which node we visit, and which node we were coming from when visiting it
    visited := make(map[int]*Vertex)
    queue := make([]*Vertex, 0, 1)

    // Push the source node to the queue
    queue = append(queue, from)

    // Process until we either:
    // * have no items left to visit (empty queue)
    // * have found the target
    for v := queue[0]; len(queue) > 0; v = queue[0] {
        if v == to {
            fmt.Printf("Target %s found!\n", v.Title)
            return
        }

        // Add the yet-unvisted neighbors to the
        for _, n := range v.Neighbors {
            if _, seen := visited[n.Id]; !seen {
                queue = append(queue, n)
                visited[n.Id] = v
            }
        }
        // Pop the first element from the queue
        queue = queue[1:]
    }
}

To reconstruct the path, we use the visited map and backtrack until we get back to the root node. This will give us the path backwards, so we need to revert the resulting list.

func constructPath(visited map[int]*Vertex, target *Vertex, from *Vertex) []string {
    path := make([]string, 0)

    // Add the target as part of the path
    path = append(path, target.Title)

    // Move back until we find source vertex
    curr, present := visited[target.Id]
    for ; present && curr != from; curr, present = visited[curr.Id] {
        path = append(path, curr.Title)
    }

    // Push the source vertex onto the path to get the full picture
    path = append(path, from.Title)

    // Reverse the path to get it the right way around
    for i := len(path)/2 - 1; i >= 0; i-- {
        j := len(path) - 1 - i
        path[i], path[j] = path[j], path[i]
    }
    return path
}

Finally, we just need a main, and a small helper to call our BFS function.

func findPath(graph map[string]*Vertex, from string, to string) {
    f, fromFound := graph[from]
    t, toFound := graph[to]

    if fromFound && toFound {
        f.BFS(t)
    }
}

func main() {
    graph := buildGraph()

    findPath(graph, os.Args[1], os.Args[2])
}

And that’s all we need! With everything combined, you’ll be able to give the title of two Wikipedia articles and get the shortest path between them:

$> ./pathfinder Australia Great_Red_Spot
Australia -> Sun -> Jupiter -> Great_Red_Spot

This works, awesome!

Before we conclude this article, let’s discuss the complexity of the BFS algorithm.

Complexity

In the worst case scenario, every vertex will be pushed onto the queue. For each of those vertices, every edge will be followed.

If we call V the number of vertices, and E the number of edges, this gives us the following complexity: O(V+E).

Conclusion

We’ve done quite a lot! The initial research into graphs and how Wikipedia structures their articles and the links between them taught us quite a lot. We saw a quick example of how to query a MySQL database with Go.

From there, we’ve built a fairly large graph and found a way to serialize this complex datastructure to disk for later use.

Finally, we looked at how BFS work and implemented it, allowing us to go from one article to the next.

This is a lot of effor invested, so it would be a shame to stop here. In a follow-up article, we’ll try to answer a few more questions:

  • Can we iterate on our simple, iterative BFS and make it concurrent?
  • What’s the highest distance between two articles?
  • Are there any pairs of articles that cannot be connected?

Notes:

1 - If you want to play with this by yourself, you can get the full code here.

Say something

Comments

Nothing yet.