Internet commentary
Keywords: Search engine, Google, GoogleWhacking, Image search
AbstractThe Google search facility is described, including its means of searching catalogues and locating images, and the pastime of GoogleWhacking.
The Google search engine
Exploration of the Internet is greatly aided by the use of search engines,and a number of them have been mentioned in past Commentaries. The Google search engine has aroused much interest in the last few years and seems set to become the search engine of usual first choice.
It is necessary to say "usual" since another engine may give more comprehensive coverage in some particular subject area, and the usefulness depends on how well the engine suits the methods of an individual user. It is mentioned, for example, in an Internet Tourbus (13:11:01) that another engine called AllTheWeb has been claimed to exceed by a factor of six the coverage of current news offered by Google. Google has at least the distinction of being accepted as the standard of comparison. It is claimed that AllTheWeb indexes news stories from over 3,000 online sources.
The addresses for these engines are easily remembered, as www.alltheweb.com and www.google.com.
Strictly, as pointed out in another Tourbus (3:4:02) Google embodies a directory as well as a search engine. That is to say, it gives access to a human- compiled list of websites as well as to the results of automatic web-crawling, where the human compilers are specialists in their own areas. The powerful search facility offered by Yahoo, at <www.yahoo.com>, is purely a directory, whereas Google combines the two.
The Google facility is impressively powerful simply as a people-finder. I was pleased to find myself listed, when I inserted the words "Alex Andrew" (not in quotes). The findings were a reference to my biographical note on the website of the UK Cybernetics Society, and one to my own home page, though the latter with an out-of-date (CompuServe) address.
The opportunity is offered of having an entry corrected. The Google directory is also a Netscape Open Directory Project, and this can be accessed to request alterations. The correction needed here, however, was to the search engine rather than the directory and an opportunity is offered within the Google site to indicate new URLs, though with no guarantee that they will be added. I used this facility to indicate my new home page address and about an hour later found it to be listed, along with the old one. In the response to a query the old version came much earlier in the listing, which was slightly unfortunate. The situation will probably alter as other sites are found to contain links to the new version of mine.
Statistics
The facility abounds with impressive statistics. In inserting my name I did not use quotes, which would have narrowed the search by having the name treated as a phrase rather than as a pair of key words. I also clicked the general"search" button and not the "I feel lucky" one which produces the display of only the first match found. The effect was to display the two relevant items already mentioned, as well as references to two other people who have "Alex Andrew" as part of longer names, and also another Alex Andrew who is a dancer. However, an impressive statistic is that the ten results that were displayed were the most relevant of about 1,050,000 matches, found in 0.18 s.
At the time of writing (April 2002) the home page of the Google system makes the claim that its web search extends to 2,073,418,204 web pages. It also hosts a number of discussion groups, and has archives of these over 20 years containing over 700 million messages. The number of images that can be searched(see below) is given as 330,000,000. The amount of stored information is unimaginable.
Cache
The search found my home page with CompuServe although I have allowed this to lapse and now have the home page of http://pages.britishlibrary.net/ alexandrew. The reference to the CompuServe homepage had the addition of the word "cached" in brackets, indicating that a copy of the page was stored and could be accessed within the Google system, along with subsidiary pages linked to it. The latter include the whole of an outdated version of the WOSC site. The accessibility of cached items is a valuable feature of the system, since it means that information can often be retrieved even when the relevant website is no longer maintained.
Operating principles
The success of a search facility depends on its capability in satisfying users' requirements. This in turn depends on appropriate software and hardware. The originators of Google have been innovative in both of these areas. The hardware consists of many a great low-cost standard PCs linked together. The software implements a system called PageRank developed by Larry Page and Sergey Brin, the founders of the Google company, at Stanford University.
PageRank assigns to each web page a measure of "value" or "importance". A link from page A to page B is interpreted as a vote, by page A, for page B. However, the importance is based on more than a simple count of links to the page, since votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."
Google takes account of these measures of importance each time it conducts a search. The effective measures are, however, modified according to the degree of match to the query. The exact way of doing this is not explained in the notes on the website, but the methods are described as "sophisticated" and it is easy to believe that they are. Anyhow, the overall result has proved satisfactory to users and is the main basis of Google's success.
Catalogue listing
A slightly surprising offshoot of the Google project is a listing of goods available from a set of catalogues, at http://catalogs.google.com. This is described in another Tourbus (19:2:02) in which Patrick Crispen declares himself amazed and mystified by the intricacy of the indexing giving access to the information. He quotes the number of catalogues as over 1,700, but at the time of writing this the number has grown to over 2,200.
The catalogues are placed under 12 general headings, but in addition the pages appear to have been thoroughly scanned. Patrick Crispen tells us that he is unusually tall and has some difficulty finding clothing to fit, but when he entered the keywords "large" and "tall" in this facility he was guided to appropriate outfitters. He declares himself mystified as to how it has been achieved.
Google images
The enormous collection of 330,000,000 images are accessed using keywords exactly as for a web search. The images are taken from web pages, and I was able to verify, as soon as my revised home page had been included, that images of WOSC people had also become accessible.
The imaging facility is discussed in yet another Tourbus (7:8:01)where it is pointed out that the images are not of the quality that would be expected for reproduction in print, and also that they are likely to be subjected to copyright restrictions. A further criticism made there is that the images were not filtered to be family-friendly. My recent inspection of the site showed that a facility for such filtering is now available and operates as a default, though with the user given the option of turning it off.
GoogleWhacking
The ultimate confirmation that a facility of this kind has become part of the scene is when games are to be played on it. In yet another Tourbus(5:2:02) Bob Rankin makes the comment that: "If the Internet has succeeded at nothing else, it has been a boon for those who truly have too much time on their hands." The aim in GoogleWhacking is to find a combination of key words for which Google returns only one result. He quotes the example of "cosmological tollbooth".
I have to report that when I inserted these keywords the number of returns was no less than six, but at least four of them had been inspired by the Tourbusitem. Rather to my surprise, a great many more returns were evoked by my alternative of "cosmos turnpike", mainly because the word "turnpike" occurs in many American postal addresses.
Finance
An obvious problem in providing a free facility of this sort is finding the money to pay for it. Google is supported by a certain amount of advertising. The people who run it are adamant, however, that the operation of the system is independent of any financial input. Advertising material may appear alongside the search outputs, but there is no possibility of bribery influencing the search results.
This aspect was mentioned in a chat I had with Marvin Minsky when I was in America for the recent WOSC Congress. The alternative to advertising would be a small charge levied on users of the service, but a suitable mechanism for this does not exist. A charge of just a few cents per transaction might be sufficient. For such small amounts the overheads of credit cards are such that they do not provide any solution. The alternative of requiring users to register and to submit to billing is also clumsy and would deter potential users. There seems to be no ideal solution at present.
Enthusiasm for Google should not obscure the fact that there are many other powerful search facilities that compete, but Google is certainly one of the leaders.
Alex M. Andrew
