Comparing Blog Buzz Measures

There are three common resources to measure "buzz" through historical mentions of names/phrases on blogs. This article compares the data available for each from common search terms.

  • Nielsen BuzzMetrics BlogPulse . They allow public searches & charting by date going back 180 days, though they do not allow bot access to the data.
  • Technorati was conceived five years ago as general blog search search engine. The data generally goes back 180 days. In January 2006, they introduced chart generation go back 180 days, though general users cannot query by arbitrary dates.
  • Google BlogSearch has indexed blog content since 2000 and offers it for public search, but the historical data is spotty.

What about non-blog Internet mentions? 

There is still a much wider Internet than the blogosphere -- there are comments to blogs, newsgroups, and discussion boards which predate blogs and persist. Does one need to collect this data?

BuzzMetrics offers it through the BrandPulse research service. In 2005, they partnered with the Pew Internet & American Life Project to produce the report Buzz Blogs, and Beyond about the 2004 election. They analyzed the textual data coming from the campaign press releases, from the traditional media, from blogs, and from non-blog Internet conversations they dubbed "chatter." In their analysis, they found that correlation between blogs and chatter were the strongest among all combinations.

For most topical research about cultural trends and public affairs, it is probably sufficient to limit one's quantitative analysis on the blogs.

How solid is the historical data? 

While Google helpfully provides content for any date range going back to January 2000, in is deficient in one aspect which is common in generalized web search. Once Google's bots discovers a website, it crawls all the past content and then sets about to crawl the content in the future as well.

Google's blog indexing does not follow these same rules. Once a blog is discovered, Google does not index the past content at all. (For example, compare the results for "two-way journalism", which was discussed in the blogosphere in 2004-- 47 web results from many bloggers, but there are no blog results.) The start date for most blogs in the index, according to Google, is June 2005. Through other research I have been doing, it appears that 90% of Google's index of political blogs is after September 2005.

The main reason for this is that may be that RSS has been promoted above and beyond sitemaps: thus, there is no standard for discovering all of the historical blog posts in a feed. Had RSS been developed as an API, and not as a static resource, this problem could have been solved long ago. Google has said (in the FAQ linked above) that it does intend to index older content.

How does the data compare? 

Of the three services, on BlogPulse is contracturally expected to be accurate for its paying customers. Google and Technorati need to be accurate enough for the general (non-paying) public to continue to trust them.

Below, I made a quick list of 34 common cultural terms that are of interest to Americans. I included, as a lark "my cat", "my dog" (and "my mistake"), which people blog about as common as they do movie stars and Presidential candidates. I ran these searches through the blog search tools. Technorati searches I ran through a bot; Blog Pulse I did manually; for Google I ran two searches: both the last 180 days as well as all of time.

The problem with Google's data is that too often, especially with large result sets, the numbers wildly differ with the slightest modification (e.g., searching "by relevance" or "by date."). Above a million results, Google sometimes reports more results in the last 180 days than for the last 7 years. The numbers below somewhat reflect that frustration: I could not just run a bot to run the searches and be confident in the results. Thus for my statistical calculations regarding Google I threw out the top 8 results.

The colors signify how the number in the column compares to its expected value (there is math afterwards to explain the data better.) The Google 180 Days column compares to Google All Time, while the BlogPulse and Technorati columns compare to Google 180 days. The first 8 rows are not color-coded due to the stated unreliability of those numbers.

low  below avg  avg above avg  high 

  Google All Time  Google 180 days   BlogPulse 180 days    Technorati
Yahoo 278,100,236 217,511,873 547,523 2,976,726
MySpace 177,026,428 70,231,470 706,176 2,337,724
Google 128,857,167 153,068,492 973,657 3,650,829
New York 88,642,573 46,866,893 1,011,093 2,479,844
London 60,659,634 119,140,884 549,443 1,385,677
Washington 51,327,283 12,311,657 586,817 1,442,426
Iraq 42,613,608 21,798,307 541,114 1,197,216
Paris 18,450,859 16,961,557 560,640 1,367,882
iPod 9,645,820 2,962,693 423,858 1,156,779
New York Times 5,315,136 1,416,500 181,481 392,024
CNN 4,641,800 1,331,084 140,914 331,607
President Bush 3,939,481 866,293 146,497 318,412
Washington Post 3,228,702 739,772 94,375 206,258
MSNBC 2,416,537 654,801 47,866 123,432
Harvard 1,702,163 436,076 80,695 184,328
Paris Hilton 1,527,166 542,760 121,151 347,689
American Idol 1,525,322 364,965 87,870 208,668
Cheney 1,707,750 303,843 84,754 172,645
Fox News 1,207,455 301,756 55,142 123,816
Stanford 982,487 303,240 45,205 106,637
my dog 857,070 167,420 77,240 X
Oprah 811,801 281,581 49,375 136,025
Princeton 583,767 212,520 27,524 68,305
Hillary Clinton 545,811 301,467 65,916 147,603
Beyonce 538,617 216,983 37,957 126,119
Yale 478,986 170,964 29,972 68,802
Angelina Jolie 458,805 187,695 36,009 112,624
my cat 433,014 95,874 56,063 X
Brad Pitt 408,379 164,361 37,796 108,576
Rudy Giuliani 258,710 173,937 36,064 74,095
Michael Moore 168,369 88,308 29,219 65,847
Bill O'Reilly 136,553 39,087 8,711 18,171
Ann Coulter 118,721 31,962 11,539 23,379
my mistake 136,930 33,506 14,038 X

Analysis

With the cleaner data, I found that one-third of Google blog posts on these subjects have been entered in the last 6 months; this is likely a result of the forward blog discovery described above. The correlation of the 6-month data to the 7-year data is .985. Clearly, the presidential candidates are more popular in the last 6 months. Additionaly, people seem to be tiring of blogging about the President and the Vice President (and tired, too, of blogging about their dog and cat.) When we remove anybody who has ever wanted to be President, or released a movie about the health care system this past summer, the correlation rises to .993.

By dropping Google/Yahoo/MySpace, we see that BlogPulse and Technorati are .998 correlated to each other. BlogPulse counts .42 posts for every one Technorati post; this is close to the expected ratio of their blog posts added per day: .47 (757k/1.6m). BlogPulse is a little closer related to Google (r = .963) than Technorati is (r = .957). With Technorati's data, I threw out the "my X" references. Evidently Technorati does not do exact-phrase searching, so searching for "my cat" brings up references to just "cat"-- a far greater number.

The correlation data tells us that the relationships are linear beyond a doubt. If we want to make any predictions, we need the confidence of the standard deviation.

  • Technorati to Google posts: average of .462, stdev of .155
  • BlogPulse to Google posts: average of .227, stdev of .118

Based on the stated data from BlogPulse and Technorati, and the averages above, Google is likely indexing 3.4 million blog posts a day (though we'd to add, that, based on Technorati's numbers, we have 95% certainty that Google indexes between 2 million and 14.5 blog posts per day.)

Search engine counts are known to be flaky. As Search Engine Watch editor Danny Sullivan wrote in 2005, "Search engine counts are never something you should depend on, a topic we've discussed many times before. Still, if you're going to get a count, it's nice if it doesn't seem to change much or simply seem absurd depending on the query you do."

For example, Google may report the number of references to "New York" in the last six months as anything from 20 million to 142 million depending on how one seearches. Yet the numbers from BlogPulse/Technorati's are well correlated, so we can come up with an estimate for the number of posts on New York: 4.9 million posts, or 27,000 a day.  Search the last 24 hours and sort by date, we get 23 million. Sort by relevance, and we get 21,000 -- far closer to our projection. (Granted, with the wide standard deviation, it is still a bit of crapshoot: we have 95% confidence that the number is between 740,000 and 22 million. Still, that's tighter than Google's estimates.)

Conclusions

Google indexes the most blog posts: over twice the rate of Technorati, and four times more than BlogPulse. It is unknown whether Google has a similar language/national origin distribution as the other services.

For large numbers of  blog results (> 1 million), Google's stated estimates fit the pattern of being completely unreliable. However, by extrapolating from the BlogPulse and Technorati numbers, and confirming versus smaller daily search numbers, a relatively accurate number can be supplied.

It is unnecessary to tell which is the most accurate count. For search terms which draw between 100 and 1000 blog posts a day, all of the services statistically correlate to each other. The variance could likely be refined (though I have nothing to compare it to) with more data, and more segmentation. With more data we could draw sharper conclusions in this area.

I assume most of the threes vendors have run their own internal analyses comparing their metrics with their competitors'. We should all be curious what they found.

The key differentiator  in choosing a blog buzz measurement tool is the openness of the querying interface. At the moment, the advantage goes to Google due to the openness of their data. (I am unsure whether BuzzMetrics would grant the same visibility into the historical data for independent researchers.) If Google could improve its numbers and its analytical tools (no doubt it is working on such a thing), it could well provide formidable competition to BlogPulse.


Update, November 14: I spoke to Max Kalehoff, the VP of Marketing at BlogPulse a month ago, after I wrote this (I spent the rest of the month finishing up the TimesSelect series). Max confirmed for me that their technology is set up in such a way that they only have access to the last 180 days worth of data (he didn't know whether the prior data was buried in Iron Mountain somewhere). Also, he stressed that BlogPulse takes care to filter their spam blogs out of their list, and they primarily cover English-language blogs. (note: Kalehoff has since left BlogPulse).

Further study needs to be done to understand whether Google's 4:1 advantage is due purely to spam and foreign-language blogs.