Cyberspace Law and Policy Centre, University of New South Wales
Unlocking IP  |  About this blog  |  Contact us  |  Disclaimer  |  Copyright & licencsing  |  Privacy

Tuesday, July 08, 2008

 

iSummit Therefore I Blog

Over at the Creative Commons blog I saw that the schedule for this year's iCommons iSummit, held in Sapporo, Japan, from 29 July - 31 August has now been released. For some of the days there are still a few details to be filled in, but the speaker list is definitely impressive - Lawrence Lessig, Yochai Benkler, Joi Ito and Jimmy Wales.

Representatives from Australia include members of the Creative Commons Australia team, Delia Browne and our very own Ben Bildstein. Ben will be presenting on quantification of the digital commons, and if you've been following Ben's work about quantification here at the House of Commons, then you won't want to miss his presentation (read more about it here).

And that's enough of a shameless plug for one day...

Labels: , , ,


Friday, June 06, 2008

 

The very few domains with the very many licensed pages



'nuff said

Labels: ,


Thursday, June 05, 2008

 

Technical problems at The House of Commons

We're having one slight technical problem at the moment. Individual post pages for the current month are giving 403 forbidden errors. We're looking into it. The effected pages are http://www.cyberlawcentre.org/unlocking-ip/blog/2008/06/cctv-eye.html and http://www.cyberlawcentre.org/unlocking-ip/blog/2008/06/conferences-in-brisbane.html, and I guess the one for this post, too (click the post title, above, to see if it works).

Of course, you can still read everything on the main page.

Labels:


Friday, May 23, 2008

 

Friday... gardening tips?

It's Friday, so I don't have to keep it technical. But what the heck, I've got a little bit of tech. http://www.skymind.com/~ocrow/python_string/ was exactly what I needed yesterday when I realised I had an n-squared problem with accumulating strings into one long string. It's a complete newbie mistake, but I found it at least. Anyhoo, it was so good I decided it needed another vote.

Okay, so gardening. I'll make it quick. I read in my bonsai book that if you have problems propagating from cuttings, you can grow roots on the original plant in a process called air-layering:

I tried that with my dwarf schefflera (at least I think that's what it is). Here's the parent plant (it's in a pot with 2 chillies and a mint):

But the air layer failed. No roots grew. And as you probably guessed I failed with cuttings, too. But I don't like giving up, so I stuck the cutting in some water:

It has done well. It's in a tall thin jar half full of water, in a pot that's backfilled with pebbles. This keeps the rooting part warm, which I understand is important. It is now finally growing roots:

By the way, there was no sign of those roots when it was being air layered - they've all popped out since it has been in the water. I had it outside, but a couple of the roots died and I decided it was too cold out there so I bought it inside. And today I noticed there are lost more roots starting to stick out through the bark (from cracks that run in the direction of the stem, not from those popcorn-looking bits).

So here's my conclusion about air layering schefflera (umbrella trees). There's nothing to be gained from cutting a ring of bark off. The roots don't grow out of the cut bark - they just grow out of normal bark. In fact, they grow out of the brown ~2mm long cracks you can see on every part of every branch.

In summary:
Enjoy your weekend. I'm taking a week off and spending it on vacation on Long Island in the Whitsundays, to celebrate my partner and my 10 year anniversary.

Labels: ,


Wednesday, May 21, 2008

 

Mechanisms

Here are the ways I can think of that an automated system could know that a web page is licensed:
  1. has a link to a known licence URL
  2. has a link that has rel="license" attribute in the tag, and a legal expert confirms that the link target is a licence URL
  3. has a meta tag with name="dc:rights" content="URL", and an expert confirms that the URL is a licence
  4. has embedded or external RDF+XML with license rdf:resource="URL"
  5. natural language, such as "This web page is licensed with a Creative Commons Attribution 1.0 Australia License"
  6. system is told by someone it trusts
Here are the ways I can think of that an automated system could find new, previously undiscovered, types of licences (or at least URLs thereof):
  1. URL is in rel="license" link tag, expert confirms
  2. URL is in meta name="dc:rights" tag, expert confirms
  3. URL is in RDF license tag
  4. page contains an exact copy of a known licence
  5. system is told by someone it trusts
If you can think of any other items for either of these lists, please let me know.

Labels: ,


 

A night of analysing data

Running for 8 hours, without crashing but with a little complaining about bad web pages, my analysis analysed 191,093 web pages (not other file types like images) and found 179 pages that have rel="license" links (a semantic statement that the page is licensed) with a total of 288 rel="license" links (about 1.5 per page). This equates to 1 in 1067 pages using rel="license"

The pages were drawn randomly from the dataset, though I'm not sure that my randomisation is great - I'll look into that. As I said in a previous post, the data aims to be a broad crawl of Australian sites, but it's neither 100% complete nor 100% accurate about sites being Australian.

By my calculations, if I were to run my analysis on the whole dataset, I'd expect to find approximately 1.3 million pages using rel="licence". But keep in mind that I'm not only running the analysis over three years of data, but that data also sometimes includes the same page more than once for a given year/crawl, though much more rarely than, say, the Wayback Machine does.

And of course, this statistic says nothing about open content licensing. I'm sure, as in I know, there are lots more pages out there that don't use rel="license".

(Tech note: when doing this kind of analysis, there's a race between I/O and processor time, and ideally they're both maxed out. Over last night's analysis, the CPU load - for the last 15 minutes at least, but I think that's representative - was 58%, suggesting that I/O is so far the limiting factor.)

Labels: , ,


Monday, May 19, 2008

 

What's an Australian web page?

I said recently that defining the Australian web is an issue in itself. I thought I'd say a little more about how the National Library's crawls handled the issue.

First, the National Library's crawls were outsourced to the Internet Archive, which is a good thing - it's been done well, the data is in a well defined format (a few sharp edges, but pretty good), and there's a decent knowledge-base out there already for accessing this data.

Now, there are two ways that IA chooses to include a page as Australian:
  1. domain name ends in '.au' (e.g. all web pages on the unsw.edu.au domain)
  2. IP address is registered as Australian in a geolocation database
Number 1 is simple, and number 2 complicated. Basically, IA is using another company's geolocation database, which uses things such as the path through the Internet to the server, who the Internet service provider is, and possibly who the domain name is registered to.

Actually, there is a third kind of page in the crawls. The crawls were done with a setting that included some pages linked directly from Australian pages (example: slashdot.org), though not sub-pages of these. I'll have to address this, and I can think of a few ways:

(Thanks to Alex Osborne and Paul Koerbin from the National Library for detailing the specifics for me)

Labels:


 

National Library of Australia's web data

The National Library of Australia has been crawling the Australian web (defining the Australian web is of course an issue in itself). I'm going to be running some quantification analysis over at least some of this crawl (actually plural, crawls - there is a 2005 crawl, a 2006 crawl and a 2007 crawl), probably starting with a small part of the crawl and then scaling up.

Possible outcomes from this include:
They're all interesting. I'll blog a bit about the technology, but in a separate post.

Labels: , , , ,


Thursday, May 01, 2008

 

Get a list of all (indexable) URLs on a site from the Wayback Machine

Earlier this year I complained about the problem with search engines. Today, Alexander Osborne (from the National Library of Australia) corrected me, at least a little bit.

I said I'd like to see an interface that (among other nice-to-haves) answers questions like "give me everything you've got from cyberlawcentre.org/unlocking-ip", and it turns out that that's actually possible with The Wayback Machine. Not in a single request, that I know of, but with this (simple old HTTP) request: http://web.archive.org/web/*xm_/www.cyberlawcentre.org/unlocking-ip/*, you can get a list of all URLs under cyberlawcentre.org/unlocking-ip, and then if were to want to you could do another HTTP request for each URL.

Pretty cool actually, thanks Alex.

Now I wonder how big you can scale those requests up to... I wonder what happens if you ask for www.*? Or (what the heck, someone has to say it) just '*'. I guess you'd probably break the Internet...

Labels: ,


Wednesday, April 30, 2008

 

Quantifying open software using Google Code Search

Google Code Search lets you search for source code files by licence type, so of course I was interested in whether this could be used for quantifying indexable source code on the web. And luckily GCS lets you search for all works with a given licence. (If you don't understand why that's a big deal, try doing a search for all Creative Commons licensed work using Google Search.) Even better, using the regex facility you can search for all works! You sure as heck can't do that with a regular Google web search.

Okay, so here's the latest results, including hyperlinks to searches for you to try them yourself:

And here's a spreadsheet with graph included: However, note the discontinuity (in absolute and trend terms) between approximate and specific results in that (logarithmic) graph, which suggests Google's approximations are not very good.

Labels: , , , ,


Tuesday, April 08, 2008

 

Table comparing Yahoo and Google's commons-based advanced search options

Hi commons researchers,

I just did this analysis of Google's and Yahoo's capacities for search for commons (mostly Creative Commons because that's in their advanced search interfaces), and thought I'd share. Basically it's an update of my research from Finding and Quantifying Australia's Online Commons. I hope it's all pretty self-explanatory. Please ask questions. And of course point out flaws in my methods or examples.

Also, I just have to emphasise the "No" in Yahoo's column in row 1: yes, I am in fact saying that the only jurisdiction of licences that Yahoo recognises is the US/unported licences, and that they are in fact ignoring the vast majority of Creative Commons licences. (That leads on to a whole other conversation about quantification, but I'll leave that for now.)

(I've formatted this table in Courier New so it should come out well-aligned, but who knows).


Feature                       | Google | Yahoo |
------------------------------+--------+-------+
1. Multiple CC jurisdictions  | Yes    | No    | (e.g.)
2. 'link:' query element      | No     | Yes   | (e.g. GY)
3. RDF-based CC search        | Yes    | No    | (e.g.)
4. meta name="dc:rights" *    | Yes    | ? **  | (e.g.)
5. link-based CC search       | No     | Yes   | (e.g.)
6. Media-specific search      | No     | No    | (GY)
7. Shows licence elements     | No     | No    | ****
8. CC public domain stamp *** | Yes    | Yes   | (e.g.)
9. CC-(L)GPL stamp            | No     | No    | (e.g.)


* I can't rule out Google's result here actually being from <a rel="license"> in the links to the license (as described here: http://microformats.org/wiki/rel-license).
** I don't know of any pages that have <meta name="dc:rights"> metadata (or <a rel="licence"> metadata?) but don't have links to licences.
*** Insofar as the appropriate metadata is present.
**** (i.e. doesn't show which result uses which licence)

Notes about example pages (from rows 1, 3-5, 8-9):

Labels: , , ,


Tuesday, February 19, 2008

 

I think I found a trump card (update: no, I didn't)

(following on from this post)

http://www.archive.org/web/researcher/intended_users.php

I'll certainly be looking into this further.

(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:
We are in the process of redesigning our researcher web interface. During this time we regret that we will not be able to process any new researcher requests. Please see if existing tools such as the Wayback Machine can accommodate your needs. Otherwise, check back with us in 3 months for an update.
This seems understandable except for this, on the same page:
This material has been retained for reference and was current information as of late 2002.
That's over 5 years. And in Internet time, that seems like a lifetime. I'll keep investigating.)

Labels: , ,


 

The problem with search engines

There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...

The very basics of web search

Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
None of these is trivial. I'm no expert, but I suggest indexing is the easiest. Performing searches well is what made Google so successful, where previous search engines had been treating the search step more trivially.

But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.

A bit more about the web crawler

There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.

Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.

They're all downloading the same data

So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.

My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.

Here's what I'd like to see specifically:
If you know somewhere this is happening, let me know, because I can't find it. I think the Wayback Machine is the closest to an open access Web cache, and http://archive-access.sourceforge.net/ is the closest I've found to generalised access to the Wayback Machine. I'll read more about it, and let you know if it comes up trumps.


* I know.

Labels: , , ,


Monday, February 18, 2008

 

Creative Commons has data!

As you aren't aware*, Creative Commons has some data on the quantification of Creative Commons licence usage (collected using search engine queries). It's great that they are a) collecting this data, and b) sharing it freely.

If you look around, you can probably find some graphs based on this data, and that's probably interesting in itself. Tomorrow I'll see about dusting off my Perl skills, and hopefully come up with a graph of the growth of Australian CC licence usage. Stay tuned.


* If you knew about this, why didn't you tell me!

Labels: , , ,


Thursday, February 14, 2008

 

CC0 - Creative Commons' new solution for the public domain

Creative Commons have come up with a better way for people to mark works as copyright-free, or part of the public domain. It's called CC0 (CC Zero). The page for using it is here: http://labs.creativecommons.org/license/zero.

There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."

It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.

Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.

Here's my example waiver:

CC0


To the extent possible under law,
Ben Bildstein
has waived all copyright, moral rights, database rights, and any other rights that might be asserted over Sensei's Library: Bildstein/Votes.

Labels: , ,


Friday, October 19, 2007

 

Australian Elections 2.0?

The Australian public is going to the polls on 24th November to elect members of our Federal Parliament (the one that sits in Canberra, is currently led by John Howard, and is constantly under siege from members of The Chaser's War on Everything). The calling of the election was a long time coming. Even though we knew that an election would have to take place soon (so says the Constitution), Prime Minister John Howard took his time in announcing the exact date. But now, in the words of one Federal House of Representatives candidate, it's "game on". A normal part of an election process, particularly at such a high level of government, is the holding of debates between party leaders. In Australia, debate is already under way as to who, when, and how these debates should take place.

House of Commons friend and ANU academic Dr. Matthew Rimmer has called for Australia to follow the lead of US Democrats presidential candidate hopeful Barack Obama and allow these debates to be made "freely accessible across all media and technology platforms" (See the ANU Press Release here). In the United States, Obama suggested that the US Democrat debates be either placed in the public domain or licensed under a Creative Commons licence.

Dr Rimmer has said that

"Whichever television networks or internet media end up broadcasting the federal
election debates, it’s important to the health of our democracy that people are
free to capture and distribute the dialogue of our prospective leaders so that
they can make a more informed decision."
The House of Commons strongly supports Dr. Rimmer's suggestion. It is an unusual one in an Australian context - in the United States, there is no copyright in works produced by the US government and thus there is at least a precedent for this type of action. There is also the First Amendment guarantee of freedom of speech, which arguably means that this type of content gains even greater significance. However, there has been a shift in this campaign to Australian political parties embracing all that the digital revolution has to offer (just type 'Kevin07' into Google, for example). A pledge by the parties to make debate materials freely available and accessible via sites such as YouTube would be both a positive and definite step for Australian democracy in the digital age.

The logisitics of such a proposition has also caused much discussion amongst House of Commons housemates. Housemate Ben writes:
"I think election debates should belong to the commons, at least insofar as
complete reproduction is concerned. However, I do see that there are good
reasons not to allow modifications, because they could be used to spread
disinformation at such a crucial time. For these reasons, a licence such as
Creative Commons No Derivatives would be appropriate (as opposed to, say, a
public domain dedication). It's also worth noting that, even under such a
licence, derivatives could be made for the purpose of satire (correct me if I'm
wrong here!), and that could perhaps be both a good and a bad thing (I'm not
sure to what extent you could use the satire exception to spread
disinformation)."
In response, Housemate Abi has agreed (and I concur) that the parody or satire fair dealing exception in the Copyright Act could probably be used to create parodies, although there issue regarding modifications may need to be addressed.

For more information on Dr. Rimmer's proposal, the ANU Press Release can be found here.

Labels: , , , , , ,


Tuesday, October 09, 2007

 

Advance Australia Fair? The Copyright Reform Process

Housemates Abi and Catherine, and project chief investigator Graham Greenleaf have been published in the Journal of World Intellectual Property. They were too modest to post about it themselves, but something has to be said! I mean, if I was published in the Journal of World Artificial Intelligence, well I'd want to tell everyone :) (not that there's a journal called the Journal of World Artificial Intelligence, but that's not the point!).

So without further ado, here's the link: Advance Australia Fair? The Copyright Reform Process.

Now, I'm no legal expert, and I have to admit the article was kind of over my head. But, by way of advertisement, here are some keywords I can pluck out of the paper as relevant:
Not to mention all the courts and cases that get involved. Anyway, that should be enough to whet your appetite. Get reading it; get citing it ;)

Labels: , , , ,


Thursday, October 04, 2007

 

Quick question

Hello friendly readers. I just got this completely random e-mail from someone in Helsinki, I think, completely out of the blue:
I'm working on human embryonic stem cell research and patenting of those in Australia but from overseas. I was wondering whether you are aware of any Australian IP (that would cover patents) or patent blogs maybe.
I don't know of any such blogs, but I have to admit to not paying as much attention to the world of patents as I do of copyright. But for the sake of being helpful I decided to ask around. So if anyone has any good Australian patent law related web resources, drop a comment on this post and I'll pass it on.

Labels: ,


Wednesday, October 03, 2007

 

GPL didn't get its day in court

A few days ago, housemate Abi posted about action being taken in court over a GPL violation in the US. And I'm sure astute readers who looked at the facts thought "what? that's so obviously violation. how can they possibly think they had the right to do that?" I know that's what I thought. Of course along with that thought goes another thought: this will be another win we can chalk up on the side of the GPL and the Free Software Foundation.

Anyway, the point is, it won't be getting to court because the defendants capitulated. According to Linux Watch, Monsoon Multimedia "admitted today that it had violated the GPLv2 (GNU General Public License version 2), and said it will release its modified BusyBox code in full compliance with the license."

This shows that the system works. The GPL must be clear enough that it is obvious what you can't do. (Okay, there's still some discussion, but on the day to day stuff, everything is going just fine).

Labels: , ,


Friday, August 24, 2007

 

My software found something!

Ever heard of Softpedia? I hadn't. Until the web crawler I wrote found it, and the ripple down rules system I created flagged it as interesting.

Here's the run-down:
Lastly, here is a link to the Softpedia search page: http://www.softpedia.com/progSearch. Try it out.

Labels: ,


Tuesday, August 14, 2007

 

Facebook sued for patent infringement -- TechnoLlama

http://technollama.blogspot.com/2007/08/facebook-sued-for-patent-infringement.html

I was going to comment on this TechnoLlama post, but then I thought, "why not just blog about it?"

Now, before I go any further, let me reiterate, I am not a lawyer, I have no formal background in law, and the Unlocking IP project is not even particularly about patents.

Get to the point, Ben.

Okay, okay. My point is this, and please correct me if I am wrong (that's why I'm making this a whole post, so you can correct me in the comments): as I understand it, the abstract of a patent does not say what the invention is, but rather it describes the invention. I.e. the abstract is more general that the invention.

For example, say I had a patent for... *looks around for a neat invention within reach*... my discgear. Well the abstract might say something like
A device for storing discs, with a selector mechanism and an opening mechanism such that when the opening mechanism is invoked, the disc selected by the selector mechanism is presented.

But then the actual patent might talk about how:
So that's what I've patented. The abstract I gave just describes it. You can invent something else that fits that description, but as long as it's not sufficiently similar to my invention as described in detail, it's not patent infringement.

To go back to the original example, an old vinyl disc jukebox (see the third image on this page) would satisfy the description in my abstract, but not infringe my patent.

In summary, I'm not alarmed by the generality of the abstract in the Facebook case. But if it turns out I'm wrong, and abstract are not more general than the patents they describe, let me just say I will be deeply disturbed.

You have no idea what you're talking about

I thought I made that clear earlier, but yes that's true. Please correct me, or clarify what I've said, by comment (preferred), or e-mail (in case it's the kind of abuse you don't want on the public record).

Labels: ,


Friday, July 20, 2007

 

Probabilistic Ripple Down Rules

Ripple down rules (RDR) is a methodology for creating decision trees, in domains even experts have trouble mapping their knowledge in, by requiring the expert only to justify their correction of the system, in the context in which the particular error occurred. That's probably why the original article on ripple down rules was called Knowledge in context: a strategy for expert system maintenance.

Now, there's two possible approaches to making the RDR system probabilistic (i.e. making it predict the probability that it is wrong for a given input). First, we could try to predict the probabilities based on the structure of the RDR and which rules have fired. Alternatively, we could ask the expert explicitly for some knowledge of probabilities (in the specific context, of course).

Observational analysis

What I'm talking about here is using RDR like normal, but trying to infer probabilities based on the way it reaches its conclusion. The most obvious situation where this will work is when all the rules that conclude positive (interesting) fire and none of the rules that conclude negative (uninteresting) fire. (This does, however, mean creating a more Multiple Classification RDR type of system.) Other possibilities include watching over time to see which rules are more likely to be wrong.

These possibilities may seem week, but they may turn out to provide just enough information. Remember, any indication that some examples are more likely to be useful is good, because it can cut down the pool of potential false negatives from the whole web to something much, much smaller.

An expert opinion

The other possibility is to ask the expert in advance how likely the system is to be wrong. Now, as I discussed, this whole RDR methodology is based around the idea that experts are good at justifying themselves in context, so it doesn't make much sense to ask the expert to look at an RDR system and say in advance how likely a given analysis is to be wrong. On the other hand, it might be possible to ask the expert, when they are creating a new rule: what is the probability that the rule will be wrong (the conclusion is wrong), given that it fires (its condition is met)? And, to get a little bit more rigorous, we would ideally also like to know: what is the probability that the rule's condition will be met, given that the rule's parent fired (the rule's parent's condition was met)?

The obvious problem with this is that the expert might not be able to answer these questions, at least with any useful accuracy. On the other hand, as I said above, any little indication is useful. Also, it's worth pointing out that what we need is not primarily probabilities, but rather a ranking or ordering of the candidates for expert evaluation, so that we know which is the most likely to be useful (rather than exactly how likely it is to be useful).

Also the calculations of probabilities could turn out to be quite complex :)

Here's what I consider a minimal RDR tree for the purposes of calculating probabilities, with some hypothetical (imaginary) given probabilities.



Let me explain. Rule 0 is the default rule (the starting point for all RDR systems). It fires 100% of the time, and in this case it is presumed to be right 99% of the time (simulating the needle-in-a-haystack scenario). Rules 1 and 2 are exceptions to rule 0, and will be considered only when rule 0 fires (which is all the time because it is the default rule). Rule 3 is an exception to rule 2, and will be considered only when rule 2 fires.

The conclusions of rules 0 and 3 are (implicitly) 'hay' (uninteresting), while the conclusions of rules 1 and 2 are (implicitly) 'needle' (interesting). This is because the conclusion of every exception rule needs to be different from the conclusion of the parent rule.

The percentage for 'Fires' represents the expert's opinion of how likely the rule is to fire (have its condition met) given that the rule is reached (its parent is reached and fires). The percentage for 'Correct' represents the expert's opinion of how likely the rule's conclusion is to be correct, given that the rule is reached and fires.

With this setup, you can start to calculate some interesting probabilities, given knowledge of which rules fire for a given example. For example, what is the probability of 'needle' given that rules 1 and 2 both fire, but rule 3 doesn't? (This is assumedly the most positive indication of 'needle' we can get.) What difference would it make if rule 3 did fire? If you can answer either of these questions, leave a comment.

If no rules fire, for example, the probability of 'needle' is 0.89%, which is only very slightly less than the default probability of 'needle' before using the system, which was 1%. Strange, isn't it?

Labels: , ,


Tuesday, July 17, 2007

 

Possible Solutions

In a previous post, I talked about the problems of finding commons in the deep web; now I want to talk about some possible solutions to these problems.

Ripple Down Rules

Ripple down rules is a knowledge acquisition methodology developed at the University of New South Wales. It's really simple - it's about incrementally creating a kind of decision tree based on an expert identifying what's wrong with the current decision tree. It works because the expert only needs to justify their conclusion that the current system is wrong in a particular case, rather than identify a universal correction that needs to be made, and also the system is guaranteed to be consistent with the expert's evaluation of all previously seen data (though overfitting can obviously still be a problem).

The application of ripple down rules to deep web commons is simply this: once you have a general method for flattened web forms, you can use the flattened web form as input to the ripple down rules system and have the system decide if the web form hides commons.

But how do you create rules from a list of text strings without even a known size (for example, there could be any number of options in a select input (dropdown list), and any number of select inputs in a form). The old "IF weather = 'sunny' THEN play = 'tennis'" type of rule doesn't work. One solution is to make the rule conditions more like questions, with rules like "IF select-option contains-word 'license' THEN form = 'commons'" (this is a suitable rule for Advanced Google Code Search). Still, I'm not sure this is the best way to express conditions. To put it another way, I'm still not sure that extracting a list of strings, of indefinite length, is the right way to flatten the form (see this post). Contact me if you know of a better way.

A probabilistic approach?

As I have said, one of the most interesting issues I'm facing is the needle in a haystack problem, where we're searching for (probably) very few web forms that hide commons, in a very very big World Wide Web full of all kinds of web forms.

Of course computers are good at searching through lots of data, but here's the problem: while you're training your system, you need examples of the system being wrong, so you can correct it. But how do you know when it's wrong? Basically, you have to look at examples and see if you (or the expert) agree with the system. Now in this case we probably want to look through all the positives (interesting forms), so we can use any false positives (uninteresting forms) to train the system, but that will quickly train the system to be conservative, which has two drawbacks. Firstly, we'd rather it wasn't conservative because we'd be more likely to find more interesting forms. Secondly, because we'll be seeing less errors in the forms classified as interesting, we have less examples to use to train the system. And to find false negatives (interesting forms incorrectly classified as uninteresting), the expert has to search through all the examples the system doesn't currently think are interesting (and that's about as bad as having no system at all, and just browsing the web).

So the solution seems, to me, to be to change the system, so that it can identify the web form that it is most likely to be wrong about. Then we can get the most bang (corrections) for our buck (our expert's time). But how can anything like ripple down rules do that?

Probabilistic Ripple Down Rules

This is where I think the needle in a haystack problem can actually be an asset. I don't know how to make a system that can tell how close an example is to the boundary between interesting and uninteresting (the boundary doesn't really exist, even). But it will be a lot easier to make a system that predicts how likely an example is to be an interesting web form.

This way, if the most likely of the available examples is interesting, it will be worth looking at (of course), and if it's classified as not interesting, it's the most likely to have been incorrectly classified, and provide a useful training example.

I will talk about how it might be possible to extract probabilities from a ripple down rules system, but this post is long enough already, so I'll leave that for another post.

Labels: , ,


Thursday, July 12, 2007

 

The Problems

Today I just wanted to take a brief look at some of the problems I'm finding myself tackling with the deep web commons question. There's two main ones, from quite different perspectives, but for both I can briefly describe my current thoughts.

Flattening a web form

The first problem is that of how to represent a web form in such a way that it can be used as an input to an automated system that can evaluate it. Ideally, in machine learning, you have a set of attributes that form a vector, and then you use that as the input to your algorithm. Like in tic-tac-toe, you might represent a cross by -1, a naught by +1, and an empty space by 0, and then the game can be represented by 9 of these 'attributes'.

But for web forms it's not that simple. There are a few parts of the web form that are different from each other. I've identified these potentially useful places, of which there may be one or more, and all of which take the form of text. These are just the ones I needed when considering Advanced Google Code Search:
Conditions can then be made up of these. For example, when (as in the case of Advanced Google Code Search) you see a select option like "GNU General Public License", it's an indication you've found an interesting web form.

But as far as I can tell, text makes for bad attributes. Numerical is much better. As far as I can tell. But I'll talk about that more when I talk about ripple down rules.

A handful of needles in a field of haystacks

The other problem is more about what we're actually looking for. We're talking about web forms that hide commons content. Well the interesting this about that is that there's bound to be very few, compared to the rest of the web forms on the Internet. Heck, they're not even all for searching. Some are for buying things. Some are polls.

And so, if, as seems likely, most web forms are uninteresting, if we need to enlist an expert to train the system, the expert is going to be spending most of the time looking at uninteresting examples.

This makes it harder, but in an interesting way: if I can find some way to have the system, while it's in training, find the most likely candidate of all the possible candidates, it could solve this problem. And that would be pretty neat.

Labels: , ,


Tuesday, July 10, 2007

 

Current Focus - The Deep Web

My previous two posts, last week, talked about my research, but didn't really talk about what I'm researching at the moment.

The deep web

Okay, so here's what I'm looking at. It's called the deep web, and it refers to the web documents that the search engines don't know about.

Sort of.

Actually, when the search engines find these documents, they really become part of the surface web, in a process sometimes called surfacing. Now I'm sure you're wondering: what kinds of documents can't search engines find, if they're the kind of documents anyone can browse to? The simple answer is: documents that no other pages link to. But a more realistic answer is that it's documents hidden in databases, that you have to do searches on the site to find. They'll generally have URLs, and you can link to them, but unless someone does, they're part of the deep web.

Now this is just a definition, and not particularly interesting in itself. But it turns out (though I haven't counted, myself) that there are more accessible web pages in the deep web than in the surface web. And they're not beyond the reach of automated systems - the systems just have to know the right questions to ask and the right place to ask the question. Here's an example, close to Unlocking IP. Go to AEShareNet and do a search, for anything you like. The results you get (when you navigate to them) are documents that you can only find by searching like this, or if someone else has done this, found the URL, and then linked to it on the web.

Extracting (surfacing) deep web commons

So when you consider how many publicly licensed documents may be in the deep web, it becomes an interesting problem from both the law / Unlocking IP perspective and from the computer science, which I'm really happy about. What I'm saying here is that I'm investigating ways of making automated systems to discover deep web commons. And it's not simple.

Lastly, some examples

I wanted to close with two web sites that I think are interesting in the context of deep web commons. First, there's SourceForge, which I'm sure the Planet Linux Australia readers will know (for the rest: it's a repository for open source software). It's interesting, because their advanced search feature really doesn't give many clues about it being a search for open source software.

And then there's the Advanced Google Code Search, which searches for publicly available source code, which generally means free or open source, but sometimes just means available, because Google can't figure out what the licence is. This is also interesting because it's not what you'd normally think of as deep web content. After all Google's just searching for stuff it found on the web, right? Actually, I class this as deep web content because Google is (mostly) looking inside zip files to find the source code, so it's not stuff you can find in regular search.

This search, as compared to SourceForge advanced search, makes it very clear you're searching for things that are likely to be commons content. In fact, I came up with 6 strong pieces of evidence that I can say leads me to believe Google Code Search is commons related.

(As a challenge to my readers, see how many pieces of evidence you can find that the Advanced Google Code Search is a search for commons (just from the search itself), and post a comment).

Labels: , ,


 

Samba embraces GPLv3

The Samba team have announced they will start using the new version of the GPL, after initially indicating they might not. As I recall Andrew Tridgell saying, they were using GPL version 2 (the previous version) because they liked the words of the licence, and the principles of free software as embodied in it, and GPLv3 seemed to be going in a slightly different direction.

This announcement acknowledges that GPLv3 "is an improved version of the license to better suit the needs of Free Software in the 21st Century," saying "We feel this is an important change to help promote the interests of Samba and other Free Software."

Unfortunately, the announcement doesn't say much about how Samba made their decision or what swayed them.

Labels: , ,


Friday, July 06, 2007

 

Quantification

Those of you who have been paying (very) close attention would have noticed that there was one thing missing from yesterday's post - the very same topic on which I've been published: quantification of online commons.

This is set to be a continuing theme in my research. Not because it's particularly valuable in the field of computer science, but because in the (very specific) field of online commons research, no one else seems to be doing much. (If you know something I don't about where to look for the research on this, please contact me now!)

I wish I could spend more time on this. What I'd do if I could would be another blog post altogether. Suffice it to say that I envisaged a giant machine (completely under my control), frantically running all over the Internets counting documents and even discovering new types of licences. If you want to hear more, contact me, or leave a comment here and convince me to post on it specifically.

So what do I have to say about this? Actually, so much that the subject has its own page. It's on unlockingip.org, here. It basically surveys what's around on the subject, and a fair bit of that is my research. But I would love to hear about yours or any one else's, published, unpublished, even conjecture.

Just briefly, here's what you can currently find on the unlockingip.org site:
What else?

I'm also interested in the methods of quantification. With the current technologies, what is the best way to find out, for any given licence, how many documents (copyrighted works) are available with increased public rights? This is something I need to put to Creative Commons, because their licence statistics page barely addresses this issue.

Labels: , ,


Thursday, July 05, 2007

 

My research

Recently, I've been pretty quiet on this blog, and people have started noticing and suggested I post more. Of course I don't need to point out what my response was...

The reason I haven't been blogging much (apart from laziness, which can never be ruled out) is that The House of Commons has become something of an IP blog. Okay, it sounds obvious, I know. And, as it seems I say at every turn, I have no background in law, and my expertise is in computer science and software engineering. And one of the unfortunate aspects of the blog as a medium is that you don't really know who's reading it. The few technical posts I've done haven't generated much feedback, but then maybe that's my fault for posting so rarely that the tech folks have stopped reading.

So the upshot of this is a renewed effort by me to post more often, even if it means technical stuff that doesn't make sense to all our readers. It's not like I'm short of things to say.

To start with, in the remainder of this post, I want to try to put in to words, as generally as possible,