And then there were four
Google is one of about four search engines that matter. There aremany more than four engines, but only about four have thetechnology to crawl much of the web on a regular basis. As of July2003, Yahoo owned Overture, Alltheweb, AltaVista, and Inktomi, andfinally dumped Google in February 2004. Everything needed to turnYahoo into a major search engine was now under Yahoo's roof.It is still possible that Yahoo will shoot themselves in the footwith all of this firepower -- their desire to monetize everythingappears to be high on their agenda. But so far, after only a year,Yahoo has shown that their main index search results are ona par with Google's. This is true despite the fact that Yahoo hashas infiltrated some pay-per-click links into the main index. Onereason for Yahoo's success is that Google's main index, though freefrom paid results, has declined considerably since early 2003.Amazingly, there is on average only a 20 percent overlap betweenYahoo's first 100 results and Google's first 100 results for thesame search -- and still, Yahoo is just as good as Google. Thesedays there is so little room at the top of the search results heap,that any combination of algorithms will produce acceptable results.The main difference now is in the depth of the crawl.
Microsoft recently developed their own engine because they foundthemselves squeezed between the advertising engine of Overture andthe search engine Inktomi -- both of which became Yahoo property.In 2003 Microsoft began experimenting with their own crawler. Theirnew engine was launched in early 2005. If Microsoft puts theirgreed on a back burner for a few years, by doing deep crawls andpresenting a clean interface, they could do to Google what they didto Netscape. There is no "secret sauce" at Google -- we nowbelieve it was all hype from the very beginning. (To the extentthat there ever was a secret sauce, the recipe is now known bycountless ecommerce spammers, which makes it a liability ratherthan an asset.) Thousands of engineers in hundreds of companiesknow how to design search engines. The only real questions arewhether you can commit the resources for a deep, consistent crawlof the web, and how aggressively you want to use your search engineto make money.
That gives us Google, Yahoo, and Microsoft. The last one worthwatching is Teoma/AskJeeves. Their search technology is good, andthey seem serious about expanding their crawl. It remains to beseen how deeply and consistently they will be able to crawl websiteswith thousands of pages.
Google is easily top dog. They provide about 75 percent of theexternal referrals for most websites. There is no point in puttingup a website apart from Google. It's do or die with Google. Ifwe're all very lucky, one of the other three will soon offer someserious competition. If we're not lucky, we will be uploading ourwebsites to Google's servers by then, much like the bloggers do atblogger.com (which was bought by Google in 2003). It would mean theend of the web as we know it.
It is worthwhile to understand the pressures that the average,independent webmaster is under. And given that Google is sodominant, it's important to understand the pressures that arebeing brought to bear on Google, Inc. It does not take too muchimagination to recognize that there's a struggle going on for thesoul of the web, and the focal point of this struggle is Googleitself.
At one level, it's a struggle for advertising revenue. The pundits look atonly this level, and are unanimous that the only advertising model on theweb with any sort of future is one where little ads appear after beingtriggered by keyword searches, or by the non-ad content of a web page.For example, a search for Google Watchmay show some ads on the right side of the screenfor wrist watches. While the technique doesn'twork for this example, often it serves its purpose. There is only somuch pixeled real estate that the average user can be expected to surveyfor a given search. Today up to half of each screen is dedicated to paidads on Google, as compared to the ad-free original Google. Everyone wantsa piece of this new wave in web advertising, and Google is making a lot ofmoney.
Unfortunately, early evidence suggests that Yahoo is lessinterested in pure search algorithms, than in acquiring marketshare in a pay-for-placement and/or pay-for-inclusion revenuestream. The same may be true for Microsoft. Even Google, dazzled bythe sudden income from advertising, must be wondering why they goto all that trouble and expense to crawl the noncommercial sector.Those public-sector sites, such as the org, edu andgov domains, do not provide direct income, even though theweb would be unattractive without them. All the excitement over arevived online ad market, pushed by pundits hoping for anotherdot-com gold rush, is beginning to look like the days whenAltaVista decided that portals were the Next Big Thing. That notioncaused AltaVista to lose interest in improving their crawlingand searching -- which is how Google succeeded in the first place.
There has been almost no interest in establishing search enginesthat specialize in public-sector websites. Where is the Libraryof Congress? Where are the millions of dollars doled out by the FordFoundation? How about the United Nations? Why can't some enlightenedEuropean entity pick up the slack? Everyone is asleep, while theInternet is getting spammed to death.
At another level, it's a struggle over who will have the predominantinfluence over the massive amounts of user data that Googlecollects. In the past, discussions about privacy issues and theweb have been about consumer protection. That continues to beof interest, but since 9/11 there is a new threat to privacy-- the federal government. Google has not shown any inclination todeclare for the rights of its users across the globe, as opposed tothe rights of the spies in Washington who would love to have accessto Google's user data.
Much of the struggle at this new level is unarticulated. For one thing,the spies in Washington don't talk about it. Congress has given them newpowers, without debating the issues. Google, Inc. itself never comments aboutthings that matter. The struggle recognized by Google Watch has to do withthe clash of real forces, but right now all we can say is thatpotentially this struggle could manifest itself in Google'sboardroom.
The privacy struggle, which includes both the old issue of consumerprotection and this new issue of government surveillance, means that thequestion of how Google treats the data it collects from users becomescritical. Given that Google is so central to the web, whatever attitudeit takes toward privacy has massive implications for the rest of the webin general, and for other search engines in particular.
Call it class warfare, if you like. Because that brings up the othermajor gripe that Google Watch has with Google. That's the PageRankproblem -- the fact that Google's primary ranking algorithm has lessto do with the quality of web pages, than it has to do with the"power popularity" of web pages. Their approach to ranking isanti-democratic, in that already-powerful pages are mathematicallygranted extra power to anoint other pages as powerful.
It's not that we believe Google is evil. What we believe is thatGoogle, Inc. is at a fork in the road, and they have some bigdecisions to make. This Google Watch site is trying to articulateand publicize the situation at Google, and encourage more scrutinyof their operations. By doing this, we hope to play a small part inmaintaining the web as an information tool that is more useful forthe masses, than it is for the elites.
That's why we and over 500 others nominated Google for a Big Brother awardin 2003. The nine points we raised in connection with this nominationnecessarily focused on privacy issues:
1. Google's immortal cookie:
Google was the first search engine to use a cookie that expires in2038. This was at a time when federal websites were prohibitedfrom using persistent cookies altogether. Now it's years later, andimmortal cookies are commonplace among search engines;Google set the standard because no one bothered to challenge them.This cookie places a unique ID number on your hard disk. Anytimeyou land on a Google page, you get a Google cookie if you don'talready have one. If you have one, they read and record your uniqueID number.
2. Google records everything they can:
For all searches they record the cookie ID, your Internet IP address,the time and date, your search terms, and your browser configuration.Increasingly, Google is customizing results based on your IP number. Thisis referred to in the industry as "IP delivery based on geolocation."
3. Google retains all data indefinitely:
Google has no data retention policies. There is evidence that they areable to easily access all the user information they collect and save.
4. Google won't say why they need this data:
Inquiries to Google about their privacy policies are ignored. When theNew York Times (2002-11-28) asked Sergey Brin about whether Googleever gets subpoenaed for this information, he had no comment.
5. Google hires spooks:
Matt Cutts, a key Google engineer, used to work for the National SecurityAgency. Google wants to hire more people with security clearances, so thatthey can peddle their corporate assets to the spooks in Washington.
6. Google's toolbar is spyware:
With the advanced features enabled, Google's free toolbar for Explorerphones home with every page you surf, and yes, it reads your cookie too.Their privacy policy confesses this, but that's only because Alexa lost aclass-action lawsuit when their toolbar did the same thing, and their privacypolicy failed to explain this. Worse yet, Google's toolbar updates to newversions quietly, and without asking. This means that if you have thetoolbar installed, Google essentially has complete access to your harddisk every time you connect to Google (which is many times a day). Mostsoftware vendors, and even Microsoft, ask if you'd like an updated version.But not Google. Any software that updates automatically presents a massivesecurity risk.
7. Google's cache copy is illegal:
Judging from Ninth Circuit precedent on the application of U.S. copyrightlaws to the Internet, Google's cache copy appears to be illegal. The onlyway a webmaster can avoid having his site cached on Google is to put a"noarchive" meta in the header of every page on his site. Surfers likethe cache, but webmasters don't. Many webmasters have deleted questionablematerial from their sites, only to discover later that the problem pageslive merrily on in Google's cache. The cache copy should be "opt-in" forwebmasters, not "opt-out."
8. Google is not your friend:
By now Google enjoys a 75 percent monopoly for all external referrals tomost websites. Webmasters cannot avoid seeking Google's approval thesedays, assuming they want to increase traffic to their site. If theytry to take advantage of some of the known weaknesses in Google'ssemi-secret algorithms, they may find themselves penalized by Google, andtheir traffic disappears. There are no detailed, published standardsissued by Google, and there is no appeal process for penalized sites.Google is completely unaccountable. Most of the time Google doesn't evenanswer email from webmasters.
9. Google is a privacy time bomb:
With 200 million searches per day, most from outside the U.S., Googleamounts to a privacy disaster waiting to happen. Those newly-commissioneddata-mining bureaucrats in Washington can only dream about the sort ofslick efficiency that Google has already achieved.