In recent months, there have been a heightened awareness about privacy risks online. Facebook's Beacon program, which had pulled in information from users' online purchases on from partner sites, was roundly criticized before the company backtracked. The Google/Doubleclick merger brought intense attention from the U.S. Congress and privacy groups, but the Federal Trade Commission went ahead and approved the deal this week. The attention that these companies attract is undoubtedly due to their reach and virtual omniscience: Google knows where people surf, and Facebook knows who people are.
The focus on the major companies may crowd out necessary concerns about third-party tracking software on smaller websites and blogs. The well-trafficked blogs have parceled out their on-screen real estate to a variety of content partners. It may embed videos from Google's YouTube; it may carry carry advertisements form Google Adwords, or Amazon, or BlogAds or any other ad network. Many popular bloggers display an affinity badge, whether indicating a publishing network (like Federated Media), or licensing framework (like Creative Commons) or a political cause. Other “web 2.0” add-ons report and retrieve metadata with sites like Technorati, Digg, etc. Lastly, traffic counters are extremely popular; these are used for the express purpose of tracking user navigation. All of these components hosted on third-party sites. When the first party (a user) accesses the second party (the website), their web browser makes additional calls to these third party sites.
Few non-institutional blogs supply a privacy policy to their readers. Casual web use is likely governed by an implicit understanding that a commenter posting under a made-up name will not have identifying information such as IP address revealed by the blog publisher. (Though some bloggers may reveal an IP address to spite a misbehaving poser – see Comment Management Responsibility) In general, a commenter may have some idea that some third-party tool may be logging their page requests, and they might suppose that the publisher would have the good sense not to sign up with sites that have weak privacy policies. But the commenter should well be shocked to find out if their IP data is publicly available to anybody else on the Internet. I will demonstrate below.
An IP address is not sufficient to identify a user; but it can be a helpful clue. It can usually identify the locality the computer is in. The more datasets there are which correspond IP addresses to behavior, the greater the likelihood of exposure. Most personal email messages contain the originating IP (one well-known exception is Google's Gmail service.) Website operators generally know which addresses are logging into their website, and they can often track these to users if they post or otherwise login.
I have conducted a small experiment with third-party tracking data. I do not have any formal training in computer security, and I have been called “hacker” probably less than I've been called a “blogger.” But I have done penetration testing on websites before; most of hacking is about knowing where to look.
AN EXPERIMENT
Site Meter and eXTReMe Tracker are two popular website traffic monitoring tools. They encourage publishers to display their icon on their websites, so that readers may get inspired to adopt them on their own sites. (Other trackers, such as Google Analytics, work invisibly.) Extreme Tracker is fairly simple: it tracks the page visit, along with the referring page. SiteMeter is more powerful: it not only captures every page visit, but every click on the page, including those outbound.
These tools also make some amount of information publicly, which is of interest to advertisers and readers. Yet their default privacy setting lets outsiders view the details as well. Extreme Tracker shows visits and IP addresses; SiteMeter allows a guest to view the list of all the pages accessed during a visit. A user posting a comment on a blog using MovableType leaves a telltale signature: the blog page is reloaded after the comment, and the time of that second page matches the time that a comment has been visibly posted on the blog.
On Sunday, I ran an experiment. Would it be possible for me to determine the IP addresses of commenters to another blog? I picked a popular blog run by a law professor and found a post related to some hot-button issue in the presidential primaries. It was getting an extraordinary number of comments, most from outsiders who weren't regular readers of the blog. As such, many of the names were pseudonyms. I took no interest in the substance of the comments, other than to get a rough idea of what side of the argument they were making.
For paying subscribers, such as this blog, SiteMeter provides a list of the last 4,000 site visits; I looked for those visits at or before the time of a comment post, to see if the visitor viewed multiple pages. I then clicked on the details to view whether they had viewed the blog post in question, multiple times, and then corresponded the times. From Saturday afternoon to Sunday morning, there was rarely a case where more than one person was reading the page at the same time. It took me a couple of hours to trace 10 commenters and determine with a fair degree of confidence what IP address each commenter was using. One posted multiple times under different names. It did not appear like he was trying to be a “sock puppet” (that is, pretend to be separate, autonomous, users), but was instead merely using the name field as a continuation of his thoughts. (As a software designer, the myriad ways which people choose to use the software never ceases to amaze me!) Somebody reading the thread closely would easily see that it was the same user; in fact, he was later chastised for using multiple handles.
I emailed the blogger/professor (whom I'll call G) and shared my findings. We had met before and discussed similar issues, so I wasn't coming to him cold. G thanked me for my work, but declined to confirm the data, citing the readers' privacy. I respected his decision, though I pointed out to him that it's his readers who should want to understand the true nature of the potential security breach more than I (For my part, I have deleted my notes and emails with the matching IP addresses.)
[G also requested that I not give too many identifying details about the blog. I got the sense, over numerous emails this week, that G would inform the blog readers, reconsider the site's privacy setting with SiteMeter, and also comment on his/her blog about the larger issues. Update: Shortly after I posted this, Daniel Solove revealed himself as the blogger.]
EXPLANATION
Fortunately, the exposure is quite limited. Further investigation showed that Site Meter only gives out the full IP address for the visits to the premium subscribers (who pay $5.95/month). Curiously, one sees that many of the popular (and profit-seeking) blogs – such as DailyKos and those from Gawker Media – don't pay a dime. Since they don't pay, only the last 100 visits are monitored (Patrick Ruffini, a political blogger, alleges that this causes popular blogs to inflate their traffic numbers, since during high traffic periods, a busy visitor would be counted multiple times.)
Here's what is seen for a non-paying subscriber site. The last 8-bit digit of the IP address is obscured. It still can be determined to be on a particular part of a network.
And here's full IP address shown for a paying subscriber site. (this has not been culled from G's data, from from another blog which hasn't had any updates since February)
G contacted customer support, and was told that they couldn't engineer a mask that would be in effect only for visitors. So they gave G three alternatives: shut off outside viewership of the logs; stop paying for the premium account; or await the next version, which the company has been dropping hints about recently. [G is presently considering these options.] We were stunned by the irony here: pay the company money, and you reduce the effective security of your data.
Given how few of the major blogs appear to be paying, there's probably a small risk of a third-party setting up a scanner project to see what it could scoop up from the exposed data. But there's a larger practical risk. Review the summaries of online harassment cases that I had researched through my work on Civilities. In each case, an aggressor hid behind an anonymous name to verbally harass people. Also, in each case, the aggrieved party wanted to face their harassers in public. They did not think it fair to be publicly attacked without a chance to fight back equally (a moral point rarely considered by free speech absolutists). Would any of these parties be justified in using creative means to discern the IP sources of their harassers? For example, while the Cahill decision in Delaware is celebrated for protecting anonymity online, what few people realized is that the targets of the anonymous harassment, the Cahills, ultimately triumphed. They learned the address of host computer – and its owner – due to sloppy paperwork handling by the opposing lawyers. They pressed a defamation lawsuit against named defendants, and agreed to a settlement following the admission in court from the woman who had posted the harassing messages.
RECOMMENDATIONS
The libertarian spirit of the Internet has engendered support for anonymous posters and anonymous forums (and they are certainly much more fundamental in non-free societies). But it is the libertarian philosophy which demands that the burden of responsibility fall on individuals. If anonymous speech is desired, then it is the responsibility of blog publishers to understand how to fully protect it. In practice, this libertarian spirit often places greater emphasis on rights over responsibilities. For example, the Electronic Frontier Foundation has published a guide to Blogger Rights, but there's been little effort on the responsibility side.
Personally, I believe in transparency; in general, a community where everyone is identified by their full name promotes more civil discourse. But I believe that transparency of contracts is even more fundamental.
Publishers should understand the contracts they make with their third party tracking software, and should communicate them to their readers. Site Meter's privacy policy is quite thorough, while Extreme Tracker's privacy policy is threadbare. Site Meter emphasizes clearly states that it “does not condone, support, or participate in any activities that could potentially gather, retrieve or store any Personal Information about or from internet users.” Whereas SiteMeter forbids any use on pornographic sites, Extreme Tracker has no such restriction. Nothing in its own policy, or nothing technically, is standing in the way of Extreme Tracker from associating traffic on one site to signed comments left on a completely different site.
Neither company gives much more information about its corporate structure. Site Meter does not list who runs the company on its website. Several other websites name David Smith as SiteMeter's creator (such as this a 2002 interview), and he signs his name to blog comments at times. A number for Smith is given for Washington, DC, but the company today has a Los Angeles address. Extreme Tracker is much harder to track down for a North American website – it's in Amsterdam, and it doesn't list any of its officers, either.
[I emailed Tech Support asking to learn anything about who runs the company; I called the Washington DC number, which didn't answer. “David Smith” is of course a very common name in America, and is the sort of pseudonym one would choose if one didn't want to bring attention to oneself . I am reserving this space to fill out the details of what I hear back. For a “social media” service, Smith and the Site Meter team keep a remarkably low profile. They were at BlogWorld Expo; I was not.]
Google, by contrast, is a global brand whose every movement is watched. In 2005, Google acquired the Urchin tracking software, and has branded it as Google Analytics. When you sign up, their user agreement is quite thorough about privacy data.
PRIVACY . You will not (and will not allow any third party to) use the Service to track or collect personally identifiable information of Internet users, nor will You (or will You allow any third party to) associate any data gathered from Your website(s) (or such third parties' website(s)) with any personally identifying information from any source as part of Your use (or such third parties' use) of the Service. You will have and abide by an appropriate privacy policy and will comply with all applicable laws relating to the collection of information from visitors to Your websites. You must post a privacy policy and that policy must provide notice of your use of a cookie that collects anonymous traffic data.
I'm experimenting with Site Meter as well as Google Analytics (hence the newly authored privacy policy.) Both services capture the same data; the apparent differentiator is how much of it they capture, and how they present it. But more web publishers should start seeing privacy policies & practices as a key factor in determining which service to use – and vendors should be pressured by the market to make it transparent to those publishers and their readers. I will discuss some possible technical approaches in the next piece, resurrecting some ancient but still-used privacy protocols…