InsideGoogle

part of the Blog News Channel

The Original PageRank Model

Paul Montgomery linked to this slide of the original PageRank equation (which has likely been altered over the years to something far more complicated). I was wondering if anyone wanted to explain it to me, as simplistic as that request seems.

I don’t know what it is, but I’m feeling a lot of brain inferiority towards Google lately. Perhaps its time to look into developing some “Googley” skills. It can be so hard sometimes to determine if you lack Google skills, or Google brains, and I think most people hope it’s only the former.

Anyway, there are some excellent websites and papers out there that do justice of explaining the theory behind PageRank, but I’m looking for one that breaks down the equation itself, not the theory, and explains each portion.

November 28th, 2005 Posted by Nathan Weinberg | General | 13 comments



Hosting sponsored by GoDaddy

13 Comments »

  1. I suppose the “d” stands for the “damn I wish I’d thought of that” factor, which is basically saying how original a page is — the more original the better. Hey, I’m no math expert (as if you couldn’t tell). :)

    Really I think those are variables which need the actual full paper to be usable (i.e. to be explained), but I could be wrong. I think it just means “the value of a page is calculated by looking at incoming links of all other pages and their value itself, in recursion of up to some level.”

    In case you don’t get an answer here maybe Google Answers is worth a shot… I know they have some math experts over there.
    http://answers.google.com

    Comment by Philipp Lenssen | November 28, 2005

  2. It’s quite odd a formula, it’s not factorized, variable ‘d’ appears twice while its only purpose is to give weight the sum. Other than that, since other members of the equations can be just about anything, you can infer anything you want from that : it’s crap, it’s wonderful, it’s good enough.

    L(i,j) means that there is a matrix the data is projected onto.

    Iterative method means f(n) = f(n-1) + g(n-1) which means it’s possible to compute next step by storing just the previous step instead of having to store the entire history. Iterative methods is only a matter of computation performance. It says nothing to page ranking stability as a measure.

    Comment by anon | November 28, 2005

  3. There’s a good description of the original Google implementation and PageRank equation up at:

    http://www-db.stanford.edu/~backrub/google.html

    The jist is that the weight of a page is the sum of the weights of page i divided by the number of links on page i, for all i that link to the page in question. The d parameter is a damping factor, or informally, the probability that a user would jump to another random page (in the random-surfer web model).

    Comment by Tim | November 28, 2005

  4. It’s worth noting that yes, PageRank is a recursive definition, which is why they point out that it’s calculated by performing this computation iteratively. You’d set the initial weight of all pages to some arbitrary number, usually 1, and then in each iteration the weights get percolated around based on the web’s link structure. Neat!

    Comment by Tim | November 28, 2005

  5. Roughly speaking, we have the following :

    W(j) is the pagerank of page j

    n(i) is the number of links on page i

    l(i,j) is equal to 1 if page i has a link to page j, and 0 otherwise

    d is a parameter (strictly between 0 and 1)

    and so the equation gives the pagerank of page i in terms of a (weighted) sum of the pageranks of all pages that link to page i ; this means that we have to solve a system of (linear) equations in order to find the actual numerical values of the pageranks.

    Comment by Francois | November 28, 2005

  6. http://www.webworkshop.net/pagerank.html

    Comment by RichB | November 28, 2005

  7. […] Gracias a varias páginas hoy hemos podido saber la documentación original que Page creó para el Pagerank que más adelante se aplicaría a Google… ¿Qué queda de él? Poca cosa viendo la teoría… […]

    Pingback by » Así era el Pagerank original | November 28, 2005

  8. The maths behind the equasion itself is really pretty simple (however there are some tricky parts).

    For background, here are some things you probably need to know first:

    * Each page i has a rank W(i), and a number of links to other pages n(i).
    * In a really basic version of pagerank. W(j) is equal to the sum of all the pageranks of the pages that link to it, divided by the number of links on those pages.
    * The parts after the big E (sigma) is called summation, or ,a series notation. In programming terms it is like a for loop that calculates a result each iteration and adds that result to a total. The mathematic value of this part is simple that total value.

    Wikipedia’s article on PageRank gives a pretty good idea of it.

    hope this helps

    Comment by sidd | November 28, 2005

  9. argh, can somone fix my link? :)

    Comment by sidd | November 28, 2005

  10. Sidd, your link is fixed.

    I think my biggest problem is understanding the portion with the sigma. I get the damping factor, and the summation portion is the basic theory (PageRank divided by number of links), so as far as I can determine, without the sigma, the formula is:

    damping factor * originating page PageRank/originating page # of links = PageRank of target page

    So, what does the middle portion do? Unless part of what I just said is the middle portion. Which is why I’m asking. As I said, the theory I got a long time ago, but I’m trying to understand the mathematical guts.

    One question: If PageRank is divided equally as link juice for outgoing PageRank, isn’t that efficient but inaccurate? Not all links on a page are created equal, and conceivably a crawler could rank outgoing links on a scale from most important (a link that is the purpose of a blog post) to least important (navigational links, advertisements) and give higher outgoing PageRank to the more crucial links. I’m relatively sure Google’s current model does exactly that, and thus required a slightly different equation (backed up by a highly complex ranking model). If it doesn’t do that, hwo significantly would you say that hurts the accuracy of the PageRank model?

    Comment by Nathan Weinberg | November 29, 2005

  11. Nathan, I think it’s not trivial for Google to know which links on a page are “important.” I think they would have to apply a rendering mechanism and then analyze white-space, center positions, and so on. One easy but anything but flawless approach would be to assume “links which come first in the HTML are more valuable” — however you can just think of typical Navigation-Left-Content-Right table layouts where the exact opposite is true. I’m using a CSS layout so indeed my important links come first, and I’m also using “nofollow” to de-value some of my own navigation, but that’s just me and Google can’t rely on such specifics of “correctly implemented” sites… they must analyze the mass of all sites.

    Comment by Philipp Lenssen | November 29, 2005

  12. nathan: The sigma part of the equasion IS the summation part. Everything /after/ that sigma is summed together and scaled by d.

    The declarations around the sigma symbol define the sequence of values for i which should be used to make the sum. So, given N webpages, it starts at the first one (i=1) and continues for every page excluding ones which are the same page as the one we are working out (links to yourself dont give you a better rank), hense (i != j). When it has performed the following equasion on every site i up to the last one N, it adds them all together and scales it by d.

    Looking around, i cant seem to find a definite, or even most likely answer for what ‘d’ is in reality. As Tim said, it is a measure of the likelyhood of a user randomly clicking links or going to new URL’s. This is apparantly used to solve the problem of webpages that do not have external links on them, thus keeping the graph “fully connected”. So basically, the effect of it is to scale the rank down by some value, but precisely how that value is picked I cannot tell you.

    Btw, it is worth keeping in mind that, this equasion is probably a simplified version of the original, which has probably changed many times since. So while working it out is a useful exercize, it doesnt really reflect how google works today.

    Regarding different links having more real life importance than others, there is always a chance that google have implemented something to address this, but i think it would pose less of a problem than you might think. Remember that page rank works on the individual pages within a website, so while a site’s front page may have a high rank, individual articles will have lower ones.
    Additionally, there are big links and small links, but the sheer scale of the web would probably do enough to resolve that sort of anomoly.
    Eg. someone with a large pagerank might have a tiny link to some unknown website on his main page, but this isnt enough to affect the fact that noone else links to it.

    Also, keep in mind that this equasion simply works out the pagerank, and there are undoubtably many other methods used to to determine the relevance of each site to each particular query.

    Comment by sidd | November 29, 2005

  13. Google PageRank carefully explained and what you can do with it - written by top SEO experts.

    http://www.pagerank-prediction.com/

    Comment by Pagerank prediction | January 24, 2006

Leave a comment