How does Google Search Engine Work?
What better tool to learn about it, but google it-self. With times the world ‘google’ has become a synonym searching. It won’t be long when the Webster adds the word “googling” as a verb! Action of searching (all most any) stuff virtually on the World Wide Web (WWW). So how does it all work? How google does brings so accurate search results. Let me explain with a bit of history about Search engines.
Search engine technology had its humble begins. Things were maintained manually, due to limited number of web servers in dinosaur age of computers. As per wiki Tim Berners-Lee was the poor guys who kept updating the list on the CERN webserver.
Alan Emtage (apparently an Archie comic’s fan it seems or a bad speller) a student at McGill University in Montreal in 1990 created the very first tool used for searching on the (pre-web) Internet that was called Archie ("archive" without the "v."). The program downloaded the directory listings of all the files located on public anonymous FTP sites, creating a searchable database of file names. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) and Jughead (Jonzy’s Universal Gopher Hierarchy Excavation And Display) search tools came next which searched the file names and titles stored in Gopher index systems.
The web’s first search engine Aliweb appeared in November 1993. The site basically depended on being notified by website administrators of the existence at each site of an index file in a particular format. Practical approach for the times, but not exactly a smart way compared when compared to current context.
As I would like to write more about google search mechanism I would skip directly few generations and talk about the origin of search concept google is based on.
The Perl-based World Wide Web Wanderer developed in June 1993 by Matthew Gray (MIT) was probably the first web robot. JumpStation (December 1993) used a web robot to find web pages and to build its index. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine crawling, indexing, and searching. Web Crawler one the first “full text” crawler-based search engines which essentially searched for any word in the web page.
Niranjan, Infoseek, Lycos, AltaVista, Open Text, Web Index, Magellan, Excite, SAPO, Dogpile, Inktomi, HotBot, Ask Jeeves, Northern Light, Yandex, AlltheWeb, GenieKnows , Naver, Teoma, Vivisimo, Baidu, Exalead, Info.com, Yahoo! Search search engines popup between 1994 and 2004, which basically still did not fully use full text searching technology, rather had the more conventional Web-directories.
Google had a rather humbler beginning. The developers Larry Page and Sergey Brin while working on their Stanford research project developed google search technology. Did some distribution partnerships with AOL and Yahoo, pitched google technology to friend and Yahoo! founder David Filo, unsuccessfully.(that what I read)
Google got a profitable business model only with their AdWords advertising program in February of 2002; soon grew to over 100 billion dollars by the end of 2005!!
So I to go to the more technical side of the discussion, how does the sucker work?!
Google uses automated programs called spiders or crawlers, just like most search engines. Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. What sets Google apart is how it ranks search results, which in turn determines the order Google displays results on its search engine results page (SERP). Google uses a trademarked algorithm called PageRank, which assigns each Web page a relevancy score.
Google Search has three working sections.
- A web crawler (GoogleBot) that finds and fetches web pages.
- The indexer that sorts every word on a website page that the crawler provides and stores the resulting index of words in a huge database.
- The query processor, which compares the search query to the index and pulls out the pages that it considers most relevant.
Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server’s name into an address, Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things:
- The words within the page
- Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Know we know why the search results are so accurate!
Will try and append some for data in the page.
References:
1) http://computer.howstuffworks.com/search-engine1.htm
2) http://www.googleguide.com/google_works.html
3) http://computer.howstuffworks.com/google1.htm
4) http://www.seobook.com/relevancy/
5) http://en.wikipedia.org/wiki/Web_search_engine
6) http://infolab.stanford.edu/~backrub/google.html