Follow the Drinking Gourd:
Search Engine Analysis
PLEASE NOTE THAT THIS MATERIAL DATES FROM 2007.
Most users don't give their search engine any more thought than their car engine: enter a search phrase in Google, Yahoo or Bing, hit return, and within a second view relevant hits culled from billions of Web documents. But it turns out that how search engines work can make a substantial difference in the results they present.
This web site presented an interesting scenario. First, there was virtually no new research on Follow the Drinking Gourd in the previous twelve years. Second, this site represents the most comprehensive research published on the song. Third, some web pages that ranked highly on a "follow the drinking gourd" search were removed in February, 2007 after this site launched. All of which makes for a test case that sheds light on search engine performance on several different parameters:
And finally, just how buggy are Google's search results?? (Very!) (Go ►)
For those that want a basic grounding, I will start with an overview of how search engines work. (Some of this commentary draws on a helpful article "Think Like a Search Engineer" by Eric Enge.) Search mavens may prefer to jump ahead to the "First Appearance" section.
Search engines use "spiders", also known as software agents, robots or 'bots, to crawl the web and find new documents. For example, the first page to link to my home page came (even before it launched), was http://ugrrquilt.hartcottagequilts.com/rr9.htm. So as Google, Yahoo and MSN re-indexed that site and examined that page for new content, their spiders would have found the new link and attempted to follow it. Any attempts before January 17, 2007 yielded up only a "coming soon" page. But after this site launched, the next time a spider visited, it would have found a real home page, which linked to every main page in my site (via a Table of Contents.)
The engines next must analyze a site's contents. According to "GoogleBot", Google's robot, this site is about:
In addition to semantic content, the engines look at many other factors within a given site to evaluate relevance. For example, any words appearing in headings, page descriptors and early in a document are typically more heavily weighted.
Sometimes a site's URL can be a clue to its content – this one's certainly is. Engines may also analyze how many times a given term appears, both in the target page, and in the entire website. For example, my home page mentions "drinking gourd" nine times, and as I write this on March 10, 2007 the entire site uses it almost 200 times. (There is also a concept of "keyword density". As its name implies, this technique not only counts the number of times a keyword appears but weights these occurrences by measuring the amount of other content, too.)
But search engines cannot rely on these "intra-site" factors alone. If they did, a site that mentioned "free music" 1,000 times would automatically be rated more highly than a site that mentioned it "only" 900 times. So engines also look at how often other sites link to a given site, and look at the rankings and quality of the linking site, too. Result: a link from the New York Times or other major site can dramatically boost a given page's search ranking.
Google in particular is well-known for using inbound links to evaluate the quality of a page, see here. Perhaps the best print analogy is the practice of determining the most influential publications in a given field by ranking the most commonly cited works. But it takes a long time to determine these print citation rankings, and it can take a relatively long time (in web terms) to generate inbound links to a site. So the more an engine such as Google relies on external validation (in the form of inbound links), the more we might expect it to lag any another engine that favored internal content analysis when ranking new content highly.
As another external quality measure, the search engines monitor which sites are selected by users from results lists (the "click-through" rate), reckoning that more relevant sites will be selected more often. These sites are then usually promoted up the search results rankings.
Last, in most cases, a search engine can't know beforehand exactly what a user wants from a search string along. Some of those searching on "follow the drinking gourd" will want information on a book by that name. Others will want the lyrics to the song. Yet another group will want to know about the recordings or songbooks that include it. Some searchers are seeking explanations of the song, while teachers may want lesson plans and other pedagogic materials. So search engines aim to present a blend of different results for their leading choices.
This site went live on January 17, 2007. Neither Google nor Yahoo found it before February 5th. (I unfortunately didn't check MSN.) All three search engines found it by February 5th. As you can see below, Yahoo and MSN immediately placed the site in the top ten search results for "follow the drinking gourd" (without the quotes.) It would take a mighty motivated Google searcher to find this site when first listed, since it languished in the mid-300s. Google moved the site into the top 50 about a week later. It bounced between the 25th and 50th places for over two months. For reasons that elude me, this site then jumped into the Google top ten in late April, a full two and a half months after Yahoo and MSN put it there. (See here for more on Ranking Volatility and the Google Rollercoaster!) The weekly rankings are presented below. Although Google now ranks the site Number One, score this one for Yahoo and MSN.
Weekly Search Engine Rankings for followthedrinkinggourd.org
See the complete chart here.
(1) For some weeks, Yahoo ranked both the site's home
page and this Search Engine Analysis page.
As of late April, MSN was indexing all of this site's 30 pages. Yahoo and Google missed one page apiece. I uploaded a sitemap file on March 30th, 2007, which seems to have helped Yahoo's 'bot find five more pages and dramatically improve their coverage. Please note that some of the pages Google reports caching are not fully searchable. (See here for the details.)
Percent of Pages Cached for followthedrinkinggourd.org
Caching statistics can help assess the "freshness" of a search engine's results. The more often a search engine crawls a site, the more current their database – and presumably the better their search suggestions. Given the huge number of documents on the web, different pages in a website are usually "crawled" on different schedules. Key pages, such as a home page, are usually visited much more frequently.
Home Page Cache Lag for followthedrinkinggourd.org
Since February 24th, Yahoo has averaged only one day's lag time on this site's home page. Google visits about twice a week, and MSN about once a week. (Home page cache details here.)
Calculating how often the search engines are refreshing all the other pages in this site highlights another real difference. A fair longitudinal test would have to run longer than this site has been up. Still, the three month results' show pronounced differences among the engines. On average, Google lags the other two engines significantly. (Though to be fair, Google's longest delays are on the "Notes" pages. See the details here.)
Since Google's approach relies more heavily on inbound links, if this cache lag extends to other websites, it could mean that their performance when presenting new materials suffers a double whammy! For example, the NASA page that formerly ranked first in a Google "follow the drinking gourd" search linked to this site on February 6th, 2007. Yahoo and MSN captured the link very quickly and so could factor it into their ratings almost immediately. It took GoogleBot five weeks to revisit the NASA page. Score this one for MSN. (Detailed chart here.)
On February 8th, the three highest ranking pages on Google for "follow the drinking gourd" (without quotes) were hosted by the Pocantico Hills elementary school in Sleepy Hollow, New York, the Madison, Wisconsin Metropolitan school district and the National Security Agency. By the end of February, all three pages had been pulled. This table tracks how quickly the three major search engines delisted these documents upon their removal from the Web.
Yahoo excelled, taking roughly a week to delist the pages once they were removed from the web. Google took an extra week. As of the end of June MSN had yet to delist any of the three, even though the system has "known" for quite some time that these pages were gone. (For example, the Madison cached page was removed way back in early March.) MSN's performance in clearing away outdated listings can only be characterized as abysmal. Score this one for Yahoo.
(See the details here.)
I have identified three different types of Google bugs. In increasing order of user impact, they are: mishandling punctuation, excessively volatile search results, and documents that are supposedly cached but actually are not fully searchable. First, the punctuation.
A Panda Walks Into a Bar...
Google searches for phrases that span punctuation marks, such as exclamation points, ellipses and en-dashes, often fall apart. On this site, I quote Professor Kate Larson on Harriet Tubman, "The slave holders never knew, nor could they have imagined...that a tiny five foot tall disabled slave woman could have been one of the UGRR conductors they were so concerned with." Google does not find any results for "imagined... that a tiny five foot tall" nor for "imagined that a tiny five foot tall" (both in quotes), but will find the latter search without the quotes.
Writing about runaways in Jeanette Winter's book, Follow the Drinking Gourd, I note that "one of their hosts is an urban black family – an accurate reference to the major role that blacks played in the Underground Railroad." Again, Google muffs a search for "urban black family – an accurate reference" and "urban black family an accurate reference" (both in quotes) and again, will find the latter search minus the quotes.
But my favorite example comes from a sentence in Search Engine Watch praising Yahoo! On May 2nd, 2007 Eric Enge wrote, "On another note, I really like the way Yahoo! is playing this game right now." (Note the exclamation point, part of Yahoo!'s official name, but mishandled by Google.) You may search on any phrase in this sentence (in quotes) up to and including "Yahoo!" and it will be found; include "Yahoo!" and any words to the right in a search and the string will not be found; leave "Yahoo!" off and include only words to the right...and Google will find the document again. See the table below. All the searches described in this section that bamboozle Google prove no problem for Yahoo! and MSN.
Analysis of Google Searches on a Test Sentence
On April 6, 2007, this site ranked 26th for a Google search on "follow the drinking gourd" (without the quotes.) The next afternoon, we were 44th. The next evening, we were 29th. These are fairly volatile shifts for an obscure corner of the web, one where the other relevant pages had changed very little (if at all) for some time. I decided to compare the rankings on April 6th with those the afternoon of April 7th.
The average among the top 50 sites was a small change in rank of just two points. This site's 18 place swing was twice that of any other site in the top 50. (See the details here.)
Two leading search engine commentators valiantly take a crack at explaining search engine volatility here and here. They rightly focus on factors such as when a site's content is first cached, when in-bound links to that site were first discovered, changes at competitive sites, and so on. I track all these parameters closely, and still can't account for this overnight "boomerang", or explain why we bounced randomly between 25th and 44th for nearly ten weeks. So I suggest there is an elephant in the room. The Google oscillations noted here (and by others) are manifestations of a search engine that is, at least in part, buggy and unpredictable.
Yahoo's rankings showed much less volatility, though we did see an eleven point drop in mid-March and gradually climbed back up over the ensuing ten days. Somewhat later our Yahoo rating jumped from 8th to 4th place, which is likely attributable to a sitemap steering their 'bot to five more pages that were previously unindexed.
Microsoft is the least changeable of all, as might be expected given the horrible job their site does in removing stale Drinking Gourd pages. (See here.) They could afford to shake their ratings up a bit more!
To track where my content shows up on the Web, I set up a Google Alert with one phrase from each page in this site. (There was nothing unusual or tricky about these phrases, they just had to be unique on the Web so I wouldn't be bombarded with email Alerts.) I searched on these same phrases every few weeks. When I did, I found that a surprising percentage of phrases from pages that had been "cached" were nevertheless not found in a Google search. In other words, it was possible to navigate to a page from this site in the Google cache, select a unique string of text from that page and then search for that same text in Google – and Google would not find it.
As covered by Eric Enge here, the page Interpretation_Over_The_Last_Ten_Years.htm was first cached by Google on January 31st. It contains the phrase, "formed the narrative core of a planetarium show". The page was cached again on March 16th. A search for the test phrase that same day came up empty. The page was found in searches on March 18th and March 31st. On April 1st, Google cached the page again. A search on April 24th failed, but by May 1st, this search worked again! (See here for screen shots and more examples.)
I prepared a chart showing how scattershot the Google search results proved over time. (See it here.) Fully 19 of this site's 30 pages have been affected at some point. By contrast, when either Yahoo or MSN cached pages, their contents were overwhelmingly searchable in their entirety. It would take a thorough analysis, not a small hand sample, to determine how widespread this Google search bug might be. It's very possible that more pages are affected. I consider this a severe reliability problem for Google, and for websites that rely on the Google search box to enable users to find content on their site.
The table below shows how many pages of this site seem entirely searchable, vs. those reported cached by Google that contained search bugs.
Google Pages Reported Cached vs. Pages With No
The next table takes the adjusted Google figure and compares them to Yahoo and MSN, for a more accurate match-up. Score this one for Yahoo and MSN.
Percent of Entirely Searchable Pages
As might be expected, no single search site leads on every factor. Still, I was impressed with both Yahoo and MSN's performance, with the edge going to Yahoo. Both sites do a significantly better job than Google at showcasing new information and keeping site-wide content current. Both also find text in their cached content much more reliably than Google does. Google users should strongly consider supplementing their search results with those from Yahoo or MSN, too.
Search Engine Rankings on Five Performance Parameters for followthedrinkinggourd.org
Copyright 2008 - 2012, Joel Bresler.