Google's PageRank and Beyond

Google's PageRank and Beyond: The Science of Search Engine Rankings

Amy N. Langville
Carl D. Meyer
Copyright Date: 2006
Pages: 240
  • Cite this Item
  • Book Info
    Google's PageRank and Beyond
    Book Description:

    Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings,Google's PageRank and Beyondsupplies the answers to these and other questions and more.

    The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research.

    The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text.

    Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.

    Many illustrative examples and entertaining asidesMATLAB codeAccessible and informal styleComplete and self-contained section for mathematics review

    eISBN: 978-1-4008-3032-9
    Subjects: Mathematics, Technology

Table of Contents

  1. Front Matter
    (pp. i-iv)
  2. Table of Contents
    (pp. v-viii)
  3. Preface
    (pp. ix-x)
  4. Chapter One Introduction to Web Search Engines
    (pp. 1-14)

    Today we have museums for everything—the museum of baseball, of baseball players, of crazed fans of baseball players, museums for world wars, national battles, legal fights, and family feuds. While there’s no shortage of museums, we have yet to find a museum dedicated to this book’s field, a museum of information retrieval and its history. Of course, there are related museums, such as the Library Museum in Boras, Sweden, but none concentrating on information retrieval. Information retrieval¹ is the process of searching within a document collection for a particular information need (called a query ). Although dominated by recent...

  5. Chapter Two Crawling, Indexing, and Query Processing
    (pp. 15-24)

    Spiders are the building blocks of search engines. Decisions about the design of the crawler and the capabilities of its spiders affect the design of the other modules, such as the indexing and query processing modules.

    So in this chapter, we begin our description of the basic components of a web search engine with the crawler and its spiders. We purposely exclude one component, the ranking component, since it is the focus of this book and is covered in the remaining chapters. The goals and challenges of web crawlers are introduced in section 2.1, and a simple program for crawling...

  6. Chapter Three Ranking Webpages by Popularity
    (pp. 25-30)

    Nobody wants to be picked last for teams in gym class. Likewise, nobody wants their webpage to appear last in the list of relevant pages for a search query. As a result, many grown-ups transfer their high school wishes to be the “Most Popular” to their webpages.The remainder of this book is about the popularity contests that search engines hold for webpages. Specifically, it’s about the popularity score, which is combined with the traditional content score of section 2.3 to rank retrieved pages by relevance. By 1998, the traditional content score was buckling under the Web’s massive size and the...

  7. Chapter Four The Mathematics of Google’s PageRank
    (pp. 31-46)

    The famous and colorful mathematician Paul Erdos (1913–96) talked about The Great Book, a make-believe book in which he imagined God kept the world’s most elegant and beautiful proofs. In 2002, Graham Farmelo of London’s Science Museum edited and contributed to a similar book, a book of beautiful equations.It Must Be Beautiful: Great Equations of Modern Science[73] is a collection of 11 essays about the greatest scientific equations, equations like$E = hf$and$E = mc²$. The contributing authors were invited to give their answers to the tough question of what makes an equation great. One author, Frank Wilczek, included...

  8. Chapter Five Parameters in the PageRank Model
    (pp. 47-56)

    My grandfather, William H. Langville, Sr., loved fiddling with projects in his basement workshop. Down there he had a production process for making his own shad darts for fishing. He poured lead into a special mold, let it cool, then applied bright paints. He manufactured those darts by the dozens, which was good because on each fishing trip my brothers, cousins, and I always lost at least three each to trees, underwater boots, poor knot-tying, and of course, really big, sharp-toothed fish. Grandpop kept meticulous fishing records of where, when, how many, and which type of fish he caught each...

  9. Chapter Six The Sensitivity of PageRank
    (pp. 57-70)

    Psychologists say that a person’s sensitivities give insights into the personality. They say sensitivity to name-calling might indicate a maligned childhood. Sensitivity to injury, a pampered, spoiled upbringing; a short fuse with the boss, anger toward parents, and so on. It seems the same is true for the PageRank model. The sensitivities of the PageRank model reveal quite a bit about the popularity scores it produces. For example, when α gets very close to 1 (its upperbound), it seems to really get PageRank’s goat. In this chapter, we explain exactly how PageRank reacts to changes like this.

    In fact, the...

  10. Chapter Seven The PageRank Problem as a Linear System
    (pp. 71-74)

    Abraham Lincoln, in his humorous, self-deprecating style, said “If I were two-faced, would I be wearing this one?” Honest Abe wasn’t, but the PageRank problem is two-faced. There’s the eigenvector face it was given by its parents, Brin and Page, at birth, and there’s the linear system face, which can be arrived at with a little cosmetic surgery in the form of algebraic manipulation. Because Brin and Page originally conceived of the PageRank problem as an eigenvector problem (find the dominant eigenvector for the Google matrix), the eigenvector face has received much more press and fanfare. However, the normalized eigenvector...

  11. Chapter Eight Issues in Large-Scale Implementation of PageRank
    (pp. 75-88)

    That’s a funny quote, but of course, for us the question is: if you put the right (in our case, arbitrary) figures into the PageRank machine, do you get the right answers out? Simple enough to answer. Just check that, for any input${\pi ^{(0)T}}$, the output satisfies${\pi ^{(k)T}}{\bf{G}}\; = \;{\pi ^{(k)T}}$up to some tolerance. However, when the problem size grows dramatically, crazy things can happen and simple questions aren’t so simple. It’s hard to even put numbers into the machine, it’s hard to make the machine start running, and it’s hard to know whether you have the right answer.

    We’ve all had...

  12. Chapter Nine Accelerating the Computation of PageRank
    (pp. 89-98)

    People have a natural fascination with speed. Look around; articles abound on Nascar and the world’s fastest couple—Marion Jones and Tim Montgomery—speedboat racing and speed dating, fast food and the Concorde jet. So the interest in speeding up the computation of PageRank seems natural, but actually it’s essential because the PageRank computation by the standard power method takes days to converge. And the Web is growing rapidly, so days could turn into weeks if new methods aren’t discovered.

    Because the classical power method is known for its slow convergence, researchers immediately looked to other solution methods. However, the...

  13. Chapter Ten Updating the PageRank Vector
    (pp. 99-114)

    Every month a famous dance takes place on the Web. While there have been famous dances throughout modern history—the Macarena, the Mambo #5, the Chicken Dance—this dance is the first to have a profound impact on the search community. Every month search engine optimizers (SEOs) watch the Google Dance carefully, anxious to see if any steps have changed. Sometimes the modifications are easy to roll with, other times they cause a stir.

    The Google Dance is the nickname given to Google’s monthly updating of its rankings.We begin with some statistics that emphasize the need for updating rankings frequently....

  14. Chapter Eleven The HITS Method for Ranking Webpages
    (pp. 115-130)

    If you’re a sports fan, you’ve seen those “— is Life” t-shirts, where the blank is filled in by a sport like football, soccer, cheerleading, fishing, etc. After reading the first ten chapters of this book, you might be ready to declare “Google is Life.” But your mom probably told you long ago that “there’s more to life than sports.” And there’s more to search than Google. In fact, there’s Teoma, and Alexa, and A9, to name a few. The next few chapters are devoted to search beyond Google. This chapter focuses specifically on one algorithm, HITS, the algorithm that...

  15. Chapter Twelve Other Link Methods for Ranking Webpages
    (pp. 131-138)

    The previous chapters dealt with the major ranking algorithms of PageRank and HITS in depth, but there are other minor players in the ranking game. This chapter provides a brief introduction to the ranking alternatives.

    In 1998, one could rank the popularity of webpages using either the PageRank or the HITS algorithm. In 2000, SALSA [114] sashayed into the game. SALSA, an acronym for Stochastic Approach to Link Structure Analysis, was developed by Ronny Lempel and Shlomo Moran and incorporated ideas from both HITS and PageRank to create yet another ranking of webpages. Like HITS, SALSA creates both hub and...

  16. Chapter Thirteen The Future of Web Information Retrieval
    (pp. 139-148)

    Web search is a young research field with great room for growth. In this chapter, we survey possible directions for future research, pausing along the way for some storytelling.

    The story, the Ghosts of Search, might not be too outlandish. In fact, it was inspired by a recent weblog posting. On May 24, 2003, Jeremy Zawodny declared PageRank dead. He claimed the algorithm was no longer useful because bloggers and SEOs had learned too much about it and had, in effect, changed the nature of the Web. Since PageRank is based on an optimistic assumption that all links are conceived...

  17. Chapter Fourteen Resources for Web Information Retrieval
    (pp. 149-152)

    If you’re a student or a researcher new to the field, you’ll find these resources helpful for getting started. The datasets are small and manageable, the code simple, and the algorithms run quickly.

    There are several small web graphs that are available for download. The table below provides details.

    Most of these webpages also contain other graphs that are similar in size and source. For example, Panayiotis Tsaparas hosts a nice webpage (website 4) that contains more graphs (and some C code).

    Website 1:˜tsap/experiments/datasets /index.html

    Website 2:˜tsap/experiments/datasets /index.html

    Website 3:˜tsap/experiments/datasets /index.html

    Website 4:˜tsap/experiments/datasets /index.html

    On page 17, we...

  18. Chapter Fifteen The Mathematics Guide
    (pp. 153-200)

    Appreciating the subtleties of PageRank, HITS, and other ranking schemes requires knowledge of some mathematical concepts. In particular, it’s necessary to understand some aspects of linear algebra, discrete Markov chains, and graph theory. Rather than presenting a comprehensive survey of these areas, our purpose here is to touch on only the most relevant topics that arise in the mathematical analysis of Web search concepts. Technical proofs are generally omitted.

    The common ground is linear algebra, so this is where we start. The reader that wants more detail or simply wants to review elementary linear algebra to an extent greater than...

  19. Chapter Sixteen Glossary
    (pp. 201-206)
  20. Bibliography
    (pp. 207-218)
  21. Index
    (pp. 219-224)