What is Google’s PageRank algorithm?

We use graph theory to find the importance of webpages. The Random Surfer Model, pseudocode and running time analysis of the PageRank algorithm

S. W.
6 min readDec 12, 2020

The classic PageRank algorithm from Google was something that they used to rank the importance of websites.

0. Motivation

So the idea is this: Imagine that I am a Google user, and I want to look something up on the search engine (say Stephen Hawking). There will be millions of websites on the Internet that contain the keyword Stephen Hawking in it. How does Google know which of them is the most relevant, or how does it know which website to put at the top of the search results?

If you try this out yourself and type in any keyword, you will see that most likely, Wikipedia is going to be a top search result, which is accurate because Wikipedia is a very relevant and important page.

But the question is, how did Google determined this?

One way we could do it is by looking at how many times a keyword is repeated on a page, but that’s obviously not a good idea, because I can just create a webpage and repeat a keyword a million times, which will make me a top search result, but my page is obviously not a very important page.

I. Formulating the problem

Sergey Brin and Larry Page

So the founders of Google, Sergey Brin and Larry Page (hence the name of the algorithm “PageRank”), came up with this idea:

The importance of a webpage is determined by how many other webpages having a link referencing it, and those other pages have to be important as well, aka, highly referenced.

Notice that we cannot drop the condition where the other pages have to be important, because again, I can just create like a million pages referencing one page, and that page is going to be super important, but that is not true, because my million other pages are not important and are created just to promote that only page. So if you have a page that is referenced by other important, aka, highly-referenced webpages, that page is going to be important.

II. The Random Surfer Model

Another way of looking at the problem, is to think about yourself as a web surfer landing randomly on page and just click on the hyperlink on that page one after another, and see where this chain of hyperlinks take you to. The important webpages, aka the highly-referenced ones, you have a higher chance of encountering them because so many other pages are pointing to them.

III. Solving the problem with graph theory

Now that we had a clear understanding of what we are trying to solve, we want to formulate this as a graph problem and solve it with techniques from graph theory.

Think of the whole Internet as a graph and all the websites are nodes in the graph. We create an edge from u to v if there is a hyperlink from u to v.

As an example, here is a very small model of the Internet with just 4 websites, A, B, C, D. There will be a directed edge from say A to B if there’s web link on the page a pointing to the page B.

A basic model of the internet with just 4 websites

Now we have our graph, the question is, how to we calculate the actual PageRank?

Firstly, we want the rank of a page to be a probabilistic value between 0 and 1. And obviously the output we are looking for is the rank of all webpages. If we consider the random surfer model, the rank of a webpage will be the probability of a web surfer encountering the page given a large number of trials, so we should expect the sum of ranks of all pages to be 1.

Here is a formula to calculate the rank of a page v:

Okay sounds good, but what does all of this mean?

To begin with, let’s take a look at the first sum in the middle. Like I mentioned earlier, for all ui are all websites pointing to v, we want to take their ranks and divide them by how many websites they are pointing to. This should be easy to understand because if a page is pointing to a lot of other pages, the importance of each of that reference should decrease.

The second sum is about relocating the probabilities for a node that does not have any outgoing links. Here, the assumption is that if we land at a website without any outgoing links, we will have to quit the current page and type in the web address go to any other websites at random. And that’s why the sum is divided by, because we want to uniformly redistribute the probabilities to all other websites.

The factor d is called the jump factor/the teleportation factor. The first use of a jump factor is that it models the real-world behavior of a user quitting the chain of hyperlinks and navigate to any page at random. Secondly, it is for evening out the rank of pages with a larger number of reference links versus those that don’t. Observe that if the damping factor is closer to 1, the rank of all webpages are going to be more uniform, versus if it’s closer to 0, there’s gonna be more disparity between the websites.

IV. Pseudocode

Here is a few lines of simple pseudocode for the readers that want a hands-on experience with the algorithm. Here, PageRank(G, n) takes in a graph G and n, which specifies the number of iterations. Notice that we initialize the rank of all pages to be 1/N, where N is the total number of all pages:

V. Running time analysis

For a graph with N nodes and M directed edges, having n iterations of this algorithm will give a running time of O(n*(N + M)). This is because in every iteration, we have to traverse every node and all edges that comes out of the node, which means that for each of the i = 1,…,N-th node, there is j steps to perform, where the sum of all j is M. This will work regardless of the graph being bi-directional or complete, because both conditions are already captured by having M as the number of all directed edges.

--

--