There are three common resources to measure "buzz" through historical mentions of names/phrases on blogs. This article compares the data available for each from common search terms.
- Nielsen BuzzMetrics BlogPulse . They allow public searches & charting by date going back 180 days, though they do not allow bot access to the data.
- Technorati was conceived five years ago as general blog search search engine. The data generally goes back 180 days. In January 2006, they introduced chart generation go back 180 days, though general users cannot query by arbitrary dates.
- Google BlogSearch has indexed blog content since 2000 and offers it for public search, but the historical data is spotty.
What about non-blog Internet mentions?
There is still a much wider Internet than the blogosphere — there are comments to blogs, newsgroups, and discussion boards which predate blogs and persist. Does one need to collect this data?
BuzzMetrics offers it through the BrandPulse research service. In 2005, they partnered with the Pew Internet & American Life Project to produce the report Buzz Blogs, and Beyond about the 2004 election. They analyzed the textual data coming from the campaign press releases, from the traditional media, from blogs, and from non-blog Internet conversations they dubbed "chatter." In their analysis, they found that correlation between blogs and chatter were the strongest among all combinations.
For most topical research about cultural trends and public affairs, it is probably sufficient to limit one's quantitative analysis on the blogs.
How solid is the historical data?
While Google helpfully provides content for any date range going back to January 2000, in is deficient in one aspect which is common in generalized web search. Once Google's bots discovers a website, it crawls all the past content and then sets about to crawl the content in the future as well.
Google's blog indexing does not follow these same rules. Once a blog is discovered, Google does not index the past content at all. (For example, compare the results for "two-way journalism", which was discussed in the blogosphere in 2004– 47 web results from many bloggers, but there are no blog results.) The start date for most blogs in the index, according to Google, is June 2005. Through other research I have been doing, it appears that 90% of Google's index of political blogs is after September 2005.
The main reason for this is that may be that RSS has been promoted above and beyond sitemaps: thus, there is no standard for discovering all of the historical blog posts in a feed. Had RSS been developed as an API, and not as a static resource, this problem could have been solved long ago. Google has said (in the FAQ linked above) that it does intend to index older content.
How does the data compare?
Of the three services, on BlogPulse is contracturally expected to be accurate for its paying customers. Google and Technorati need to be accurate enough for the general (non-paying) public to continue to trust them.
Below, I made a quick list of 34 common cultural terms that are of interest to Americans. I included, as a lark "my cat", "my dog" (and "my mistake"), which people blog about as common as they do movie stars and Presidential candidates. I ran these searches through the blog search tools. Technorati searches I ran through a bot; Blog Pulse I did manually; for Google I ran two searches: both the last 180 days as well as all of time.
The problem with Google's data is that too often, especially with large result sets, the numbers wildly differ with the slightest modification (e.g., searching "by relevance" or "by date."). Above a million results, Google sometimes reports more results in the last 180 days than for the last 7 years. The numbers below somewhat reflect that frustration: I could not just run a bot to run the searches and be confident in the results. Thus for my statistical calculations regarding Google I threw out the top 8 results.
The colors signify how the number in the column compares to its expected value (there is math afterwards to explain the data better.) The Google 180 Days column compares to Google All Time, while the BlogPulse and Technorati columns compare to Google 180 days. The first 8 rows are not color-coded due to the stated unreliability of those numbers.
low | below avg | avg | above avg | high |
Google All Time | Google 180 days | BlogPulse 180 days | Technorati | |
Yahoo | 278,100,236 | 217,511,873 | 547,523 | 2,976,726 |
MySpace | 177,026,428 | 70,231,470 | 706,176 | 2,337,724 |
128,857,167 | 153,068,492 | 973,657 | 3,650,829 | |
New York | 88,642,573 | 46,866,893 | 1,011,093 | 2,479,844 |
London | 60,659,634 | 119,140,884 | 549,443 | 1,385,677 |
Washington | 51,327,283 | 12,311,657 | 586,817 | 1,442,426 |
Iraq | 42,613,608 | 21,798,307 | 541,114 | 1,197,216 |
Paris | 18,450,859 | 16,961,557 | 560,640 | 1,367,882 |
iPod | 9,645,820 | 2,962,693 | 423,858 | 1,156,779 |
New York Times | 5,315,136 | 1,416,500 | 181,481 | 392,024 |
CNN | 4,641,800 | 1,331,084 | 140,914 | 331,607 |
President Bush | 3,939,481 | 866,293 | 146,497 | 318,412 |
Washington Post | 3,228,702 | 739,772 | 94,375 | 206,258 |
MSNBC | 2,416,537 | 654,801 | 47,866 | 123,432 |
Harvard | 1,702,163 | 436,076 | 80,695 | 184,328 |
Paris Hilton | 1,527,166 | 542,760 | 121,151 | 347,689 |
American Idol | 1,525,322 | 364,965 | 87,870 | 208,668 |
Cheney | 1,707,750 | 303,843 | 84,754 | 172,645 |
Fox News | 1,207,455 | 301,756 | 55,142 | 123,816 |
Stanford | 982,487 | 303,240 | 45,205 | 106,637 |
my dog | 857,070 | 167,420 | 77,240 | X |
Oprah | 811,801 | 281,581 | 49,375 | 136,025 |
Princeton | 583,767 | 212,520 | 27,524 | 68,305 |
Hillary Clinton | 545,811 | 301,467 | 65,916 | 147,603 |
Beyonce | 538,617 | 216,983 | 37,957 | 126,119 |
Yale | 478,986 | 170,964 | 29,972 | 68,802 |
Angelina Jolie | 458,805 | 187,695 | 36,009 | 112,624 |
my cat | 433,014 | 95,874 | 56,063 | X |
Brad Pitt | 408,379 | 164,361 | 37,796 | 108,576 |
Rudy Giuliani | 258,710 | 173,937 | 36,064 | 74,095 |
Michael Moore | 168,369 | 88,308 | 29,219 | 65,847 |
Bill O'Reilly | 136,553 | 39,087 | 8,711 | 18,171 |
Ann Coulter | 118,721 | 31,962 | 11,539 | 23,379 |
my mistake | 136,930 | 33,506 | 14,038 | X |
Analysis
With the cleaner data, I found that one-third of Google blog posts on these subjects have been entered in the last 6 months; this is likely a result of the forward blog discovery described above. The correlation of the 6-month data to the 7-year data is .985. Clearly, the presidential candidates are more popular in the last 6 months. Additionaly, people seem to be tiring of blogging about the President and the Vice President (and tired, too, of blogging about their dog and cat.) When we remove anybody who has ever wanted to be President, or released a movie about the health care system this past summer, the correlation rises to .993.
By dropping Google/Yahoo/MySpace, we see that BlogPulse and Technorati are .998 correlated to each other. BlogPulse counts .42 posts for every one Technorati post; this is close to the expected ratio of their blog posts added per day: .47 (757k/1.6m). BlogPulse is a little closer related to Google (r = .963) than Technorati is (r = .957). With Technorati's data, I threw out the "my X" references. Evidently Technorati does not do exact-phrase searching, so searching for "my cat" brings up references to just "cat"– a far greater number.
The correlation data tells us that the relationships are linear beyond a doubt. If we want to make any predictions, we need the confidence of the standard deviation.
- Technorati to Google posts: average of .462, stdev of .155
- BlogPulse to Google posts: average of .227, stdev of .118
Based on the stated data from BlogPulse and Technorati, and the averages above, Google is likely indexing 3.4 million blog posts a day (though we'd to add, that, based on Technorati's numbers, we have 95% certainty that Google indexes between 2 million and 14.5 blog posts per day.)
Search engine counts are known to be flaky. As Search Engine Watch editor Danny Sullivan wrote in 2005, "Search engine counts are never something you should depend on, a topic we've discussed many times before. Still, if you're going to get a count, it's nice if it doesn't seem to change much or simply seem absurd depending on the query you do."
For example, Google may report the number of references to "New York" in the last six months as anything from 20 million to 142 million depending on how one seearches. Yet the numbers from BlogPulse/Technorati's are well correlated, so we can come up with an estimate for the number of posts on New York: 4.9 million posts, or 27,000 a day. Search the last 24 hours and sort by date, we get 23 million. Sort by relevance, and we get 21,000 — far closer to our projection. (Granted, with the wide standard deviation, it is still a bit of crapshoot: we have 95% confidence that the number is between 740,000 and 22 million. Still, that's tighter than Google's estimates.)
Conclusions
Google indexes the most blog posts: over twice the rate of Technorati, and four times more than BlogPulse. It is unknown whether Google has a similar language/national origin distribution as the other services.
For large numbers of blog results (> 1 million), Google's stated estimates fit the pattern of being completely unreliable. However, by extrapolating from the BlogPulse and Technorati numbers, and confirming versus smaller daily search numbers, a relatively accurate number can be supplied.
It is unnecessary to tell which is the most accurate count. For search terms which draw between 100 and 1000 blog posts a day, all of the services statistically correlate to each other. The variance could likely be refined (though I have nothing to compare it to) with more data, and more segmentation. With more data we could draw sharper conclusions in this area.
I assume most of the threes vendors have run their own internal analyses comparing their metrics with their competitors'. We should all be curious what they found.
The key differentiator in choosing a blog buzz measurement tool is the openness of the querying interface. At the moment, the advantage goes to Google due to the openness of their data. (I am unsure whether BuzzMetrics would grant the same visibility into the historical data for independent researchers.) If Google could improve its numbers and its analytical tools (no doubt it is working on such a thing), it could well provide formidable competition to BlogPulse.
Update, November 14: I spoke to Max Kalehoff, the VP of Marketing at BlogPulse a month ago, after I wrote this (I spent the rest of the month finishing up the TimesSelect series). Max confirmed for me that their technology is set up in such a way that they only have access to the last 180 days worth of data (he didn't know whether the prior data was buried in Iron Mountain somewhere). Also, he stressed that BlogPulse takes care to filter their spam blogs out of their list, and they primarily cover English-language blogs. (note: Kalehoff has since lift BlogPulse).
Further study needs to be done to understand whether Google's 4:1 advantage is due purely to spam and foreign-language blogs.