The House of Commons: A night of analysing data

Wednesday, May 21, 2008

A night of analysing data

Running for 8 hours, without crashing but with a little complaining about bad web pages, my analysis analysed 191,093 web pages (not other file types like images) and found 179 pages that have rel="license" links (a semantic statement that the page is licensed) with a total of 288 rel="license" links (about 1.5 per page). This equates to 1 in 1067 pages using rel="license"

The pages were drawn randomly from the dataset, though I'm not sure that my randomisation is great - I'll look into that. As I said in a previous post, the data aims to be a broad crawl of Australian sites, but it's neither 100% complete nor 100% accurate about sites being Australian.

By my calculations, if I were to run my analysis on the whole dataset, I'd expect to find approximately 1.3 million pages using rel="licence". But keep in mind that I'm not only running the analysis over three years of data, but that data also sometimes includes the same page more than once for a given year/crawl, though much more rarely than, say, the Wayback Machine does.

And of course, this statistic says nothing about open content licensing. I'm sure, as in I know, there are lots more pages out there that don't use rel="license".

(Tech note: when doing this kind of analysis, there's a race between I/O and processor time, and ideally they're both maxed out. Over last night's analysis, the CPU load - for the last 15 minutes at least, but I think that's representative - was 58%, suggesting that I/O is so far the limiting factor.)

Labels: ben, quantification, research

(permalink) posted by Ben Bildstein @ Wednesday, May 21, 2008

Comments:

Ben Bildstein said:
I came up with a neat way of collecting as many licensed pages as possible as quickly as possible. All pages are listed in a master file in alphabetical oder. Now I've got my program going to a random one, and if it has a licence attached, it proceeds to the next in order (very likely to be licensed because it's part of the same web site). And if it has already recorded that page as licensed, or if that page does not turn out to be licensed, it goes to a random part of the master file to find the next URL.

So far, I've identified and saved 80,000 pages as licensed. (Remember, there are duplicates in there.)

(permalink) posted by

Ben Bildstein : 4:51 PM, May 21, 2008

mikal said:
Ben,

I'll have some spare compute time on my 250+ node PlanetLab slice sometime towards then end of June. Would you be interested in simply hitting all of those web pages and logging matches?

Mikal

(permalink) posted by

mikal : 7:32 PM, May 21, 2008

Ben Bildstein said:
Tell me more. Are you suggesting comparison of live data with NLA's cached crawls? Or are you talking about live analysis?

When someone says "do you want to use these 250 computers", of course I say yes :) So let's talk.

(permalink) posted by

Ben Bildstein : 9:24 PM, May 21, 2008

mikal said:
Email me at mikal@stillhq.com and we can chat about logistics.

(permalink) posted by

mikal : 1:56 PM, May 22, 2008

Wednesday, May 21, 2008

A night of analysing data

Contributors

On this page

Supporters

Archives

IP blogosphere