Cyberspace Law and Policy Centre, University of New South Wales
Unlocking IP  |  About this blog  |  Contact us  |  Disclaimer  |  Copyright & licencsing  |  Privacy

Wednesday, May 21, 2008

 

A night of analysing data

Running for 8 hours, without crashing but with a little complaining about bad web pages, my analysis analysed 191,093 web pages (not other file types like images) and found 179 pages that have rel="license" links (a semantic statement that the page is licensed) with a total of 288 rel="license" links (about 1.5 per page). This equates to 1 in 1067 pages using rel="license"

The pages were drawn randomly from the dataset, though I'm not sure that my randomisation is great - I'll look into that. As I said in a previous post, the data aims to be a broad crawl of Australian sites, but it's neither 100% complete nor 100% accurate about sites being Australian.

By my calculations, if I were to run my analysis on the whole dataset, I'd expect to find approximately 1.3 million pages using rel="licence". But keep in mind that I'm not only running the analysis over three years of data, but that data also sometimes includes the same page more than once for a given year/crawl, though much more rarely than, say, the Wayback Machine does.

And of course, this statistic says nothing about open content licensing. I'm sure, as in I know, there are lots more pages out there that don't use rel="license".

(Tech note: when doing this kind of analysis, there's a race between I/O and processor time, and ideally they're both maxed out. Over last night's analysis, the CPU load - for the last 15 minutes at least, but I think that's representative - was 58%, suggesting that I/O is so far the limiting factor.)

Labels: , ,


Comments:
Blogger Ben Bildstein said:
I came up with a neat way of collecting as many licensed pages as possible as quickly as possible. All pages are listed in a master file in alphabetical oder. Now I've got my program going to a random one, and if it has a licence attached, it proceeds to the next in order (very likely to be licensed because it's part of the same web site). And if it has already recorded that page as licensed, or if that page does not turn out to be licensed, it goes to a random part of the master file to find the next URL.

So far, I've identified and saved 80,000 pages as licensed. (Remember, there are duplicates in there.)
 
Blogger mikal said:
Ben,

I'll have some spare compute time on my 250+ node PlanetLab slice sometime towards then end of June. Would you be interested in simply hitting all of those web pages and logging matches?

Mikal
 
Blogger Ben Bildstein said:
Tell me more. Are you suggesting comparison of live data with NLA's cached crawls? Or are you talking about live analysis?

When someone says "do you want to use these 250 computers", of course I say yes :) So let's talk.
 
Blogger mikal said:
Email me at mikal@stillhq.com and we can chat about logistics.
 
Post a Comment

Links to this post:

Create a Link



<< Home
 
 

This page is powered by Blogger. Isn't yours?