Try it out: ingredient based recipe search.

NoFollow

Introduction

Where did links with 'rel="NoFollow"' come from?

There are over one trillion (1,000,000,000,000) unique URL's on the World Wide Web, and their number is growing at accelerated rate at several billion pages per day.[1] Even though Google has hundreds of thousands of servers for indexing the web, distributed over few dozen data centers which are worth billions of dollars,[2] its resources are limited.[3] In addition, Google needs to save space for storing personal information about behavior of its users, accross many services that it offers: Google Search, Gmail, Image, Scholar, Books, News, Base, Translate, Maps, Picassa, Blogger, YouTube, Orkut, etc.[4] so that it could provide users with better service, predict better which ads users are going to click on, and for other disclosed and undisclosed reasons.[5][6][7]

Google's founder Larry Page is a great visionary and a kind man who unlike Tesla[8] understands that great ideas are not enough, but one must accept present economic reality.[9][10] Sergey and him achieved to hire an army of world class researchers and engineers who are working very long hours trying to provide you and me with the best possible online experience[11] in finding the information that we need, no mather how obscure and in what language[12], while in the same time making as much money as they can from those few rare clicks we place on Google Ads. From the backbone PageRank, word formatting, proximity, and few other algorithms which were handling well few dozen million web pages[13], Google algorithms became amazingly more complex and include specialization of web, news, blog, books, code, and other searches, monitoring of individual page changes, detection of duplicate content, semantical analysis of the content, clustering techniques for machine learning, monitoring changes in the linking patterns, user behavior dependant ranking, traffic analysis, historical data, word stemming, context based queries, latent semantic indexing, social networks analysis, relying on informational entropy (among other things) to send pages to supplemental index,[14] etc. etc. etc. Behind each of these 'buzz words' there are dozens of papers and patents.[15] Thousands of years of engineering hours. And then what happened?

Web became overwhelming. It simply is not possible to index the whole web on a frequent basis. In addition, it is not possible to search all the relevant results in the index and return the optimal results to the user in less than a second. Engineering is a tradeoff. You cannot always provide impatient internet users with perfect results.

"... We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of barrels first and if there are not enough matches within those barrels we check the larger ones. (...) To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation. ..." (Google Founders, 1998) [16]

-- [underline emphasis is mine; have it in mind when reading the next page]

In addition, due to overwhelming number of links (over trillion as mentioned above), many of which are duplicate or spam, and many of which are not accessed due to informational entropy (as mentioned above), Google had to make a decision not to index a part of the World Wide Web, AND not to always access yet another part of the Web in the 'supplemental' index (link provided above).(*read note below) However, instead of hiring someone competent to work on selecting the best pages and fighting link spam[17], Google gave this very responsible task of selecting which part of the Web should be ignored to a wrong person.

An infamous Google employee Matt Cutts[18] due to his short sightedness in dealing with webspam, decides to mess with something as beautiful as World Wide Web which was created by a person not unlike Larry Page, another great visionary Tim Berners-Lee.[19][20]

"The intention in the design of the web was that normal links should simply be references, with no implied meaning."[21]

Either someone higher in Google suggested this as a way to cut down on the number of zillions of links in the pagerank matrix, OR, being lazy to "invent" a reasonable relation attribute values like for example advertisment, comment, editorial, navigation, user-submitted, etc., (which Google algorithms could treat in various undisclosed ways), Matt Cutts simply decides to impose a universal "nofollow" relation into the web by persuading major web players to addopt it.[22] If 'nofollow' was used only on spam, or user submitted links, there would be no problem, but precisely because of its 'universality' in definition, this value is used on all kinds of links.[23] Continue reading about the consequences at NoFollow Reciprocity.

________________________
*Note: My opinion is that Google doesn't have storage problems, as back in 1998 they could download, index, and sort few million pages a day ("...using four machines, the whole process of sorting takes about 24 hours.") and certainly today, ten years later, they have at least thousand times that capacity, which means that Google CAN index EVERY new page on the web, and refresh index of more important pages that already existed on the web. In addition, Google does not delete things, which further indicates they don't have storage problems. In addition, storage and analysis of web and users' behavior data accross all Google services takes lots of resources, like for example one petabyte (1,000,000,000,000,000) every 72 minutes! So why then supplemental index? Why 'nofollow'? Because searching for relevant result thru the indexed pages takes time, and Google's primary foccus is speed and quality (and lately also money), and therefore, only a part of index can be searched through in a limited time. Another point to make is this: while Web was smaller, Google had to find the best results to make users happy, while today Web is huge, and there are many good results, so focus now is on providing 'good enough', not the best. It is an engineering trade-off.

References:

  1. Google Blog: We knew the web was big...
  2. Royal Pingdom: Map of all Google data center locations
  3. Seo Theory: Google's supplemental index is still biting your ass
  4. Google: More Google Products
  5. www.box.net: Analyst day slides
  6. Electronic Frontier Foundation: Google Cuts IP Log Retention to Nine Months
  7. Thread Watch: Ex-CIA Agent States Google is 'In Bed With' the CIA
  8. YouTube: Nikola Tesla: The Missing Secrets
  9. Pimm - Partial immortalization: Google's Larry Page at the AAAS meeting: entrepreneurship and unlocking in science
  10. YouTube: Larry Page speaks at the AAAS
  11. Google Research: Papers Written by Googlers
  12. Google Mission: Company Overview
  13. Stanford InfoLab: The Anatomy of a Large-Scale Hypertextual Web Search Engine
  14. UIUC/Qiaozhu Mei: Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?
  15. Google Patent Search: Google's patents
  16. See 13.
  17. Google Scholar Search: Link Spam
  18. Matt Cutts: Matt Cutts
  19. World Wide Web Consortium: Tim Berners-Lee
  20. Times Online: Google could be superseded, says web inventor
  21. World Wide Web Consortium: The Implications of Links -- Axioms of Web architecture
  22. Internet News: Search Leaders, Bloggers Band to Fight Comment Spam
  23. Andy Beard: Nofollow Killed Google Social Graph API 3 Years Ago

© 2008, Nofollow by Lazar


Last modified: September 19 2008 01:30:20. (UTC-8)

Take a look at one of my hobby projects:       <   Popular Links   >