Negotiating for your Social Data

Internet | Access/Network

This essay defines the concept of social data within social media software systems and issues a call for users to recognize its need to be made available for public research.

There is an interesting contract that we as individual users have with many of our Internet services. We get free use of their service for organizing our correspondence, our photos, our bookmarks, our subscriptions; they get complete use of the usage data which we provide through our usage. And how those services share their data may be informed by their corporate structure. Most services are not run as charities, though some increasingly are. As an example, the online user-contributed encyclopedia Wikipedia is now, with other open reference guides, under the umbrella of the nonprofit Wikimedia Foundation. Judging by their bylaws, they intend to invite users to be members and thus part of the governing process.

By usage data I mean that which is not the “public data” which the service manifestly provides (in Wikipedia’s case, that would be the content of the articles, the history of its edits). Traditionally, this is extremely sensitive to a company, as it may be used for infrastructure planning (how many servers does Google have? Tristan Louis supplied an estimate two years back), or for strategic planning about markets. On the other hand, there is operational data which must be of interest to society at large– I will call this social data. An example query: how many pages does Google censor for local jurisdictions? Seth Finkelstein has been posting his research and observations on Google for the last three years. With Wikipedia we might want to know the number of users who are monitoring anonymous submissions at any hour of the day; we might want to know what the mean elapsed time is for falsehoods to be corrected.

Recognizing Social Data

Once we recognize the concept of social data, we can understand how information service providers can take steps to either share it or obfuscate it for public review. Google does not clearly indicate which of its results are censored, which have made Seth’s research on the company (as well as his pioneering work in censorware description) more unique and valuable. Wikipedia, by contrast, has made claims to the openness of its accountability; those became widely known following John Seigenthaler’s discovery of his biography’s errors. Founder Jimmy Wales has explained that one tool for accountability is the Recent Changes page, and that hundreds of monitor monitor it any time. Querying with this tool at 3:30pm EST Sunday, when most of the people in Europe and the Americas were awake, I can see that 500 changes have been made in the last five minutes to the main entries (as opposed to the discussion and other tangential pages). At 12:40am I see that the pace slowed to 500 changes over 8 minutes, or 1 edit/second. I could do some more, like separate the named users from the anonymous users only through regular expression parsing. But there is little other classification data to help me: what categories the entries are in; the nature of changes (beyond “major” and everything else), the time since the last change. Thus with regard to supplying social data, Wikipedia has a technology architecture is not yet as open as its public contract desires.

The social data of Google and Wikipedia is of particular interest as the sites themselves have been subject to increased scrutiny (GoogleWatch, WikipediaWatch). This is because of the perceptions of trust: given a term, one returns links and one returns a normative article, and people have expectations for finding a starting point of of answers, if not the answer. For example, a Google search for “Wikipedia Watch” brings up the website as result #42 (while the Wikipedia article on it comes up second).

What could be deemed social data is open to disagreement. The Federal Government has recently argued, through force of a subpoena, that the typical search terms and ensuing results are of interest to the Justice Department for investigating the probability of a minor stumbling upon pornography. Microsoft and Yahoo quietly complied; Google noisily resisted; Ask Jeeves was not asked (see SearchEngineWatch). On the other hand, there is social data that is willingly released by Internet services. David Sifry of Technorati releases a State of the Blogosphere twice a year with data from his service. Intelliseek provides the BlogPulse tools for analyzing its data. So there are competitive pressures to be open about certain types of data.

A brief definition for “social data” is that it is any data generated from social media software which is not private personal data. It is also data which in general tends to be accessible– mostly. For example, last year I put together the Social Media Scorecard. I picked thirty individuals/publishers and compared the Technorati links and the Bloglines subscriptions alongside what I thought were a couple of crucial metrics– the frequency of posts, and the initial date of the feed. You would not believe how obfuscated these last two statistics were, until through through several formats of blog publishing software to determine what they are. Lack of concise, accessible data provides additional barriers of time and cost to research.

Seeking Social Data in Bloglines

My particular interests bring to Bloglines, a service which manages subscriptions to RSS feeds from blogs and other sources. The social implications are not as easy to see as with search data. My interest is in trying to understand whether . I am not a full-time researcher; instead I am a programmer/analyst and I use what I find to direct further investigations or software tools development. If we have set a goal that the blogosphere and RSS-based services should promote a meritocracy of voices, and we find that they don’t, then we might want to consider how to change our tools . I investigated this at length last year in the New Gatekeepers series. This month I have to this topic in my research about the sustainability of RSS.

Last week I sent in a request Bloglines customer service asking if they could provide some validation of a few data points about usage. The auto-generated response promised an answer within two days, but I have since not. I figured instead that I might just employ the wget utility to retrieve a bunch of web pages from the site for me. I did this before reading the Terms of Service, which may consider such actions prohibited uses (“collect or store data about other people using the Service”, using an “automated device to monitor or copy any content on the Service.”). In doing so, I may be jeopardizing my status on the service as a reader, and also as a publisher to 90 subscribers. On the other hand, their privacy policy clearly does state that the purpose of the Bloglines database of subscription: “We share this information with you, other end users, and other third parties.” Also I find that Jon Udell, well-read columnist for InfoWorld (4,661 subscribers), not only posted data he scooped up from Bloglines last year, but posted a code sample in python for readers to fetch the data themselves. Granted, he listed specific feeds that he regularly read– but it takes minimal programming skills to retrieve random pages. I didn’t even need to use it since it took me about half a minute to see how the Bloglines data was publicly structured.

In contrast to Wikipedia, Bloglines has an architecture which is not as restrictive as its contract suggests. Its technology clearly favors data transparency. On Bloglines, each feed description page is indicated by a sequential number: Jon Udell’s Radio Blog is #39; Civilities is #514161. Bloglines helpfully provides a page of the “newest blogs”– more helpful for researchers than users, since the 1.5MB chokes browsers. The sequence numbers go from 4294077 to 4301942, which accounts for 7,866 slots, but there are only 3,432 entries within. Therefore, a little over half of the slots were duds, perhaps the result of malformed RSS. Other information systems are designed to not make it easy to guess account numbers: given a random 16-digit number, only 1 in 10 is a valid bank or credit card number, and with that only perhaps 1 in 10 million actually points to a valid cardholder.

Any person– competitor, investor, hacker– can download the Bloglines feed description pages and “scrape” this data from the HTML and use it for private purposes, possibly before Bloglines even notices. Given that I do not have the resources of the InfoWorld Media Group to support me, I have a set stringent guidelines for my use of the data, in the hopes that Bloglines will not see fit to terminate my account. Downloading the whole corpus of 4,200,000 pages would be egregious, and would take a serious amount of time (if at 10 pages/second it would take 5 days). Instead, downloading a sample set of 3,200 at one request a second (which ended up yielding 1,490 actual feeds, 976 with subscribers) seemed fair and reasonable. I also don’t need to reveal things that were irrelevant to my study. And for that I must appeal to two of its major champions, who I have to reason to believe support such usage.

My Appeal

Bloglines was created by Mark Fletcher, a software engineer who had previously founded ONElist, the web-based email list management service. Onelist preferred a “clean URL” approach, the a core component of which is sequential numbering– users could find and link to email messages to a list simply by adjusting the sequence number in the URL. The clean design perhaps helped in some measure to attract and retain people to the service, and ultimately be valuable enough to be scooped up by YahooGroups.

A year ago, Fletcher was blessed to find that his startup had attracted a buyer he was willing to sell to: Ask Jeeves, which has been trying to catch up to the “big three” in search portals. Senior Vice President of Search Properties Jim Lanzone expressed his enthusiasm for the purchase in an early post on the just-launched Ask Jeeves blog. He subsequently brought forth some actual Last July he brought forth some data in a post called What Feeds Matter? and also added in a parenthetical comment “Maybe people will start a new game called Bloglineswhacking to find feeds with only one subscriber?”

If that’s the game, then let’s play. I’ll bring the balls.

To some this may be a sign that Lanzone and Ask Jeeves “get” the blogosphere, as is commonly touted. The crucial drawback to this slogan is that it doesn’t take much to “get” it– oh, just create a blog, and put some friendly it’s-not-really-marketing material up. Instead, I would hope instead that information service providers– any business in fact– should understand the special value of social data and make a commitment to regularly making it available, and also sharing and responding to public analyses of that data.

This last part is important. When Technorati CEO David Sifry says that their service is tracking 18.9 million blogs, that number reaches a lot of people. But when M.D.-entrepreneur Christian Mayaud — or random blogger to you– finds that Technorati’s data shows that 18 million of those blogs have no links from anyone (and thus, do they exist? he asks), that should be recognized as a salient piece of social data. The data should be shared to anyone who received the original Technorati data. If the principle of a meritocracy a voices is upheld, then there should be no problem doing so.

The Challenge Going Forward

First, the challenge is to recognize what is meant by social data, and why it should be accorded special status from usage data. I hope I’ve done justice with it here; I am not familiar with any other examination of the terms. As noted, distinguishing the terms does not mean that there is a strict rule for what is what, but it would help to start by understanding the kinds of things that are. Secondly, once we recognize the concept of social data, access to it should an additional negotiating point between users and information services. Thirdly, companies should undertake clear initiatives to share such data once they have bought into the whole “conversations” manifesto.

We should also pause to consider some of the dangers as we move forward. If the understanding of social data encompasses data such as Google’s search queries, that might be used to justify its release to government investigators, and I am unsure whether to support that reasoning at this time. The push for classifying more information as social data reflects a technocratic/communitarian mindset — that the aggregation of all of our daily social interactions (and not just the top ten list) is something we want to know. Then again, this knowledge is already being collected, privately. If it’s something for the public to negotiate for, we should.