In the previous section, we suggested that the Times, or any other newspaper, could well offer a premium service that allowed for perks like ad-free viewing and unmoderated discussion posts. Charging for content, on the other hand, has the effect of reducing the visibility to new audiences.
Some observers, without having done any calculations, assumed that instead of reducing the influence, TimesSelect represented an absolute closure. Jay Rosen, in his PressThink blog, wrote: “Times agrees to drop Times Select, which was a barrier to Google–and the blogosphere–working the right way.” Barbara Quint, in her otherwise excellent article in the trade publication Information Today similarly, remarked: “The opening of content also allows blogs, social networks, and other online sources to link to NYTimes.com articles and draw Web users to the site.”
Both observers are correct in pointing out the increasing role of the new gatekeepers – bloggers and search engines – in guiding readers to the content. We have already demonstrated (Part 1) that TimesSelect was not a real barrier at all to the blogosphere. It hardly stopped bloggers from citing the Times columnists; in fact, the references from bloggers did not drop as much as the direct readership did. Furthermore, any decision to not link to TimesSelect was purely due to the personal biases of a blogger.
The importance of search, on the other hand, is much more important; Vivian Schiller cited it in her numerous interviews.
The ranking of search results in Google involves many more inputs than the effects of bloggers and thus may be more immune to the biases of individuals. But whether something is listed at all – such as the Times archives – had more to do with Google’s policies than it did the Times‘s. It was mistaken to think that the undoing of the TimesSelect was the only way to get Google web search to index the NYT archives. This will be examined in depth below.
Granted, there are other search engines: Microsoft Live Search, Yahoo, Ask. Google remains of prime interest since, according to Compete.com’s data, the number of referrals from Google to the Times doubled in the last two years, while the share of referrals from the other search engines has decreased.
Some of the success could be attributed to Google News (Compete’s researcher could not help me on the number). NewsKnife, an analytics project, has calculated that Google News ranked NYT stories first in 2004; second (under ABC News) over 2005 and 2006; the year-to-date rankings for 2007 show the Times as being #1 again. In May, WebProNews reported: “News sites receive between 1.18 and 2 percent of their traffic via Google News according to LeeAnn Prescott of Hitwise.” This number is likely on the high side, and perhaps out beyond this range, for the Times. Otherwise, our main study is of the standard Google web search.
The Times Meets Google
For many years the Times had been virtually invisible in the Google web search rankings. Adam Penenberg’s 2004 experiment for Wired found that a search for Iraq torture prison Abu Ghraib didn’t bring up a result from nytimes.com until #295 (It’s now #24).
This began changing in February 2005, when The New York Times Company acquired About.com for $410 million. The purchase may have seemed odd to traditional media observers since About was nothing like a news company. It provides online guides to places and topics web searchers might be interested in. For example, search for nutrition in Google. The top four sites include two US Government sites, the Wikipedia article on the topic, and the American Society for Nutrition. The fifth site is the top “.com” on the list – the About Nutrition page. This page is different from the rest – it includes up-to-date editorial content and advertisements: 2 graphical ads, and 8 text ads from Google.
The reason this page is ranked so highly is that About had years of experience in Search Engine Optimization (SEO), the technique of improving search rankings for a given term. Buying About also brought in SEO experts like Marshall Simmonds, now the vice president for search engine marketing for the NYT.
“Marshall’s first step was to allow search engine crawlers to have complete access to everything published by the Times, including archived content dating back to 1981,” Chris Sherman wrote in a June 16, 2006 Search Engine Watch feature.
This is curious: if the search was open then, why were people claiming that the archives would now be accessible? I called up Simmonds to ask if he could clarify. He said that two years ago, nytimes.com opened up the content to the Google News Archive, not the web search. So I did a little test to demonstrate this. If you are looking for a particular quote about Noam Chomsky from January 1, 1980, you won’t find it in the main Google web search, but you will find it in the Google News Archives search. Only by searching for text in the title or abstract will you find the article in the main Google web search.
Google and Registration News Sites
A news site that requires a user to register typically shows a registration screen to an unregistered visitor. Yet it needs to show the real content to a search engine in order to be indexed. The same sort of trick is used by devious webmasters wishing to game their search results; this has been dubbed cloaking. SEO maven Danny Sullivan, the editor of Chris Sherman’s article, expanded on cloaking in a follow-up post to it, noting many examples of legitimate news websites that followed the practice. The confusion was semantic, Sullivan explained: Google’s original policy had been punishing the practice of cloaking, rather than, as he thought would be sensible, a website’s intent to deceive (through cloaking). This seems largely esoteric, but in a followup article on the topic from last March, he added this unique insight: “I did have several off-the-record conversations with Google about this. The main thing that came out that I can report was that Google really felt most users should see what their spiders saw WITHOUT having to register or pay for access.”
This is fully in line with Google’s mission: “organize the world’s information and make it universally accessible and useful.” On the other hand, it’s not necessarily in line with what content providers want.
For registration- based sites, Google offers two possible solutions: First, click-free and subscription designation. In the former, the publication must offer the whole text of the article free to the reader upon the click from Google’s search results. If the publication can’t accommodate that, Google will designate the publication “(subscription)” in the search results. (Though this rule appears to only apply for Google News.)
The first option is used by publications like The New Republic, but it is so exploitable that one must question the whole premise of it. Want to read an article for free in a recent issue? Enter through TNR’s front door and you get redirected to this abstract at the paywall. But search for the article in Google, and enter from there, you will have your “first click free” to read it.
Other publications are evidently included in Google’s main index without the “subscription designation.” Search a phrase like “the shoals of semantics” – which heretofore was only mentioned once in all of the webs, in a book review in the May 1998 Science. If you are not a subscriber to that journal, you would not see the full article when you click the link. Still, Google does not slap on any subscription warning on the result.
Thus it wasn’t a technical matter of Google not being able to index “paywall” content. They have, and they continue to do so. It was Google’s own policy that created another artificial wall to the data. It’s not that people couldn’t search in the Google News archives, it’s just that the additional clicks to get there (versus the profusion of initial web results) greatly reduced the number of people who would search it. On May 16th of this year, Google announced that it would, in fact, break down this wall by introducing their Universal Search approach: “Beginning today, the company will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results.”
The Times could have waited for a universal search to bring their archives into the main search, or they could have asked for the same treatment that subscription journals were getting. But, in the court of public opinion, if the Times is paper then Google is scissors, and scissors beat paper every time. (The rock may well be the government.) Google would win in the court of blogger opinion because it is free and the Times was not, just as Google trumps the ISP’s and the telco’s in the public network neutrality debate.
Sam Zell, the real estate billionaire who bought the Tribune Company last spring, had a sense that something was amiss. In comments to a Stanford Law School class, Zell went as far as saying that Google was stealing newspaper content – a claim which endured much derision in the blogosphere. Had he instead said something to the effect of “the current arrangement between Google and the newspaper companies is more one-sided than people think,” he would have signaled that he was on to something?
(The comments from Zell were first reported by the LA Times, which has been owned by the Tribune Company since 2000. Six months later, the original article link is dead; it is only accessible via this abstract from the LA Times’s archives, as served by ProQuest. Incidentally, if we search Google for a phrase from the article – “The title of Zell’s talk” – neither the web search results nor the News Archives brings up archives. Instead, it brings up a copy of the article, pasted into the LATimes Pressmen’s Forum. By comparison, the Ask.com search results bring up the article from several of the Tribune Company’s subsidiaries – many of which have also since been removed. Clearly ProQuest and the Tribune Company still need to have the conversation with Google that Zell said needed to happen six months ago.)
Google may need the Times, but the Times is starting to rely on Google even more. Marshall Simmonds told me that 25% of the traffic to nytimes.com comes from all search engines. The growth numbers or average monthly visits over the last two years (27% by the Nielsen/NetRatings; 133% according to the internal Times numbers – these need to be reconciled) may be due in large part to the existing SEO work. Granted, much of the growth (87%) happened in the first year due to the SEO being unoptimized at the outset.
I tried a number of big items in the news which I had figured the NYT would own given its reporting over the last several years, for better or worse: FBI surveillance, Wen Ho Lee, Iraq war, Judith Miller, Iraq War, housing mortgage crisis, CEO executive pay (this last one been has portrayed by magnificent two-page layouts at some regular frequency in the Sunday Business section). Most of them came up to msnbc.com, cnn.com. For the 2006 voting results, where the Times put together an incredible Flash presentation – they came in at #10 on that last one. It’s so good that it has no room for ads.
I then tried some NYC-centric terms: central park, US Open tennis, lincoln center, New York opera, the Met, New York pizza, New York hot dog, Atlantic yards, New York Giants, Broadway, Times Square, Bronx Zoo, Harlem, Greenwich Village, New York subway, New York food, MOMA, Grand Central Terminal, Madison Avenue, Lower Manhattan… in none of these is nytimes.com in the top ten search results (New York Magazine, CitySearch, and other niche sites dominate these rankings). Ok, the NYT is #1 for New York Real Estate and #4 for New York Music and #8 for Second Avenue Subway and a recent article snuck up to #8 for Washington Square Park. A search on new york water tunnel 3, which Wikipedia describes as “the largest capital construction project in New York City’s history” comes up #9 – for a theater review of a 1998 one-woman show about the project.
[These numbers above do not prove anything in and of themselves. I put them here in order to repeat Penenberg’s experiment so that a reader in the future can compare results to today.]
Here’s a new experiment. I wondered whether it may happen that somebody subconsciously interested in real estate might run a search on a related topic in the news. I ran a search on Leona Helmsley, the infamous “Queen of Mean” real estate developer who died two months ago. The #4 and #5 results come from the City Room blog, have exactly one advertisement: a 3-inch box from Dunkin Donuts informing me that they have a new sandwich for “bacon lovers.” (just the association I have for a Jewish grandmother like Helmsley), and an internal ad for the Real Estate section. And that’s it. There’s a lot of blog paraphernalia on the side at the bottom, which crowds out the room for ads.
Returning to Google, I pass many obituaries from the other papers (the Telegraph and the Independent of London, the Washington Post), before link #12 shows me the official Times obituary. This has the box for Dunkin, but also a banner ad for Ford, something from The New York Times Store, Yahoo! HotJobs and Banana Republic, and a few Google ads at the bottom, which helpfully point me to information on living wills. And, as it happens, the article is chunked into three, so I have to click two more pages full of new ads. I see ads for Conrad resorts and British Airways. Even more improbably, a “real” article gets a sponsored “article tools” box (print, single page, shared bookmarking via Digg/Facebook/Newsvine), while improbably the blog post is missing that. The “print” reader encounters a dozen ads; the blog reader, two.
How about Brooke Astor, another New York institution that recently passed away as well? The City Room blog’s obit is #4, while the official obit is #6 & #7. Phil Rizzuto’s Times obituary is #5; the blog obit may not have made the list because, according to Technorati, only one person linked it.
I asked Simmonds point-blank whether the blog obit or print obit should come up first in search results. He didn’t think either one would necessarily come up: “We optimize everything equally.” Granted, if a blog is “supposed” to be linked to, more than a regular article (under the norms of blogging), that might result in its getting rewarded with a higher ranking. Though in the case of Helmsley, the numbers of inbound links counted by Technorati are 56 and 24 for the blogs, while the obit had 46.
How important is the placement of search rankings within the top ten? A year ago, AOL released 658,000 records of search results for research purposes (and they admitted their error, and pulled the data down). Internet consultant Richard Hearne did an analysis of the data and found that the 42% of AOL’s users clicked on the first link. The difference between result #4 and result #12 was a factor of ten.
I asked Simmonds whether he had considered the ad-richness and click-through value in his work. He explained that he hadn’t worked on that at all; that was another department. So I leave these as open questions to the online publishing division of the Times: Does the volume make up for the lower ad content in blogs? Or is it a missed opportunity?
With the access status of the Times archives from 1987 onwards changed to Google’s satisfaction, 1.9 million articles have now been indexed.
Will this indexing of the archives bring in search riches? Many of the past calls for opening the archives were issued without any solid financial projections. In 2005, NYT Digital CEO Martin Nisenholtz told Mark Glaser in the Online Journalism Review: “There’s no analysis to show that Google AdWords gets you anything close to what we make on archives on the Web — never mind all the money we make on the after-market sales. It’s so ridiculous as to be laughable.”
The Times uses more than Google AdWords; they show paid ads on the archive articles. According to the NYTimes.com ad rates, one banner ad and two “Big Ads” (typical on an article page) bring in 10 cents a page view. Making a rough estimate that the average archive article is chunked into two parts, perhaps each archived article could be worth 20 cents in ads. Since a single article used to be $3.95 for non-subscribers, they are now banking on 20 people reading an article for everyone that had bought it before. Then again, the second Big Ad isn’t being sold (because we presume, archive hits have lower click-through rates). So maybe the Times needs 33 readers.
As for the “aftermarket sales” which Nisenholtz referred to: In 2004, Adam Penenberg reported that LexisNexis alone was bringing in $20 million a year to the Times (other archive clients are Dow Jones’s Factiva, ProQuest, HighBeam, Thomson Gale). Glaser, in his analysis, had added, “While there’s no stipulation in its database contracts for NYTD to keep archives behind a wall, Nisenholtz realizes that making archives free online would erode their value in other places.”
Barbara Quint, in her article on the demise of TimesSelect, diligently followed the demand chain. LexisNexis and ProQuest still believe that they provide a value-added one-stop-shop service; they provided horizontal search across many other publications. But some of their customers, the academic librarians who pay up to $8,000 for access to ProQuest, are wondering whether it is worth it to pay for redundant archives. Quint cited a “lively discussion on the lib license-l list” initiated by Ann Okerson of Yale University. Peter Hirtle, the Intellectual Property Officer at Cornell University Library (speaking for himself) responded: “We should rely on vendors to provide us with access to copyrighted material that is unlikely to be freely available on the web.”
Search Engine Oppression
In the rush to open the archives (and celebrate the opening thereof), there’s one more point that a number of observers missed. A couple of months ago, NYT Public Editor Clark Hoyt mentioned in his column the plight of Allen Kraus. A Google search on his name brought up, as its first link, a Times Topic page with an article from sixteen years ago: “A Welfare Official Denies He Resigned Because of Inquiry.” Kraus’s current web page for his consulting practice was buried in the search results.
Bringing Kraus’s web page to #2 turned out to be pretty easy (I accomplished that with the help of two other people.) But as I was investigating, I noticed that a story about the welfare inquiry gave the names of six women who were arrested. The full article was absent from the Google Web search archives until September 26. So, of the six women arrested sixteen years ago, this is how Google searches on their names turn out: two now see the Times story as the #1 link for their names; two see it in the top ten results; one sees the link as #14, and the other has a fairly common name. Neither the Times nor any other newspaper reported on the outcome of the charges. [I don’t see a pressing reason to link to the results or the names here.]
The count of articles in the Times going back to 1987 containing the word arrested is 55,000, though a count of the articles in the last few days show that no more than a third were in the context of local arrests. If anybody of them has a complaint they should bring it to the Public Editor…
We haven’t answered the question of whether the enhanced searchability will be contributed sufficiently to help the NYT sustain 21% online advertising growth. We’ll know in January how the 4 th quarter turned out.
Bookending the TimesSelect era were statements of the sort that the paywall was such an abominable idea that it shouldn’t catch on. Other laudatory achievements were overlooked. When TimesSelect was announced, Martin Nisenholtz explained to attendees as IDG Syndicate: “For 10 years you’ve been asking for seamless access to the archive, and now we’ve given that to you.” By seamless, I understand Nisenholtz to mean that the old links don’t “rot” – they’re always valid. Compare that to the LA Times, which kills its old links, and perhaps thousands of smaller papers and magazines which have committed to paywalling their archives and haven’t set up the same level of seamless access. Between the Online Publishers Association, the Online News Association, and Google News, this ought to be better coordinated.
For the last part, we’ll return to the hypothesis of whether some of the Op-Ed columnists were taken out of the “conversation” as many had warned. As per our numbers in Part 1, some of the columnists proved to be immune from the effects of the paywall. We’ll seek to answer why.