Tuesday, July 08, 2008
Representatives from Australia include members of the Creative Commons Australia team, Delia Browne and our very own Ben Bildstein. Ben will be presenting on quantification of the digital commons, and if you've been following Ben's work about quantification here at the House of Commons, then you won't want to miss his presentation (read more about it here).
And that's enough of a shameless plug for one day...
Labels: ben, catherine, conferences, Creative Commons
Friday, June 06, 2008
Thursday, June 05, 2008
Of course, you can still read everything on the main page.
Labels: ben
Friday, May 23, 2008
Okay, so gardening. I'll make it quick. I read in my bonsai book that if you have problems propagating from cuttings, you can grow roots on the original plant in a process called air-layering:

I tried that with my dwarf schefflera (at least I think that's what it is). Here's the parent plant (it's in a pot with 2 chillies and a mint):

But the air layer failed. No roots grew. And as you probably guessed I failed with cuttings, too. But I don't like giving up, so I stuck the cutting in some water:

It has done well. It's in a tall thin jar half full of water, in a pot that's backfilled with pebbles. This keeps the rooting part warm, which I understand is important. It is now finally growing roots:

By the way, there was no sign of those roots when it was being air layered - they've all popped out since it has been in the water. I had it outside, but a couple of the roots died and I decided it was too cold out there so I bought it inside. And today I noticed there are lost more roots starting to stick out through the bark (from cracks that run in the direction of the stem, not from those popcorn-looking bits).
So here's my conclusion about air layering schefflera (umbrella trees). There's nothing to be gained from cutting a ring of bark off. The roots don't grow out of the cut bark - they just grow out of normal bark. In fact, they grow out of the brown ~2mm long cracks you can see on every part of every branch.
In summary:
- If you're going to air layer a schefflera, the important points are making sure it's very wet and not cutting through the bark (though it may turn out that cutting through the bark encourages root production)
- To strike a cutting, focus on keeping the rooting area (which is the bark) wet
- But at the same time, remember that the cutting will drink through the bottom of the cutting, so make sure it is a clean cut. It easily rots. Mine did, and I just cut an extra 5mm of bark off to keep it healthy.
Wednesday, May 21, 2008
- has a link to a known licence URL
- has a link that has rel="license" attribute in the tag, and a legal expert confirms that the link target is a licence URL
- has a meta tag with name="dc:rights" content="URL", and an expert confirms that the URL is a licence
- has embedded or external RDF+XML with license rdf:resource="URL"
- natural language, such as "This web page is licensed with a Creative Commons Attribution 1.0 Australia License"
- system is told by someone it trusts
- URL is in rel="license" link tag, expert confirms
- URL is in meta name="dc:rights" tag, expert confirms
- URL is in RDF license tag
- page contains an exact copy of a known licence
- system is told by someone it trusts
Labels: ben, quantification
The pages were drawn randomly from the dataset, though I'm not sure that my randomisation is great - I'll look into that. As I said in a previous post, the data aims to be a broad crawl of Australian sites, but it's neither 100% complete nor 100% accurate about sites being Australian.
By my calculations, if I were to run my analysis on the whole dataset, I'd expect to find approximately 1.3 million pages using rel="licence". But keep in mind that I'm not only running the analysis over three years of data, but that data also sometimes includes the same page more than once for a given year/crawl, though much more rarely than, say, the Wayback Machine does.
And of course, this statistic says nothing about open content licensing. I'm sure, as in I know, there are lots more pages out there that don't use rel="license".
(Tech note: when doing this kind of analysis, there's a race between I/O and processor time, and ideally they're both maxed out. Over last night's analysis, the CPU load - for the last 15 minutes at least, but I think that's representative - was 58%, suggesting that I/O is so far the limiting factor.)
Labels: ben, quantification, research
Monday, May 19, 2008
First, the National Library's crawls were outsourced to the Internet Archive, which is a good thing - it's been done well, the data is in a well defined format (a few sharp edges, but pretty good), and there's a decent knowledge-base out there already for accessing this data.
Now, there are two ways that IA chooses to include a page as Australian:
- domain name ends in '.au' (e.g. all web pages on the unsw.edu.au domain)
- IP address is registered as Australian in a geolocation database
Actually, there is a third kind of page in the crawls. The crawls were done with a setting that included some pages linked directly from Australian pages (example: slashdot.org), though not sub-pages of these. I'll have to address this, and I can think of a few ways:
- Do a bit of geolocation myself
- Exclude pages where sibling pages aren't in the crawl
- Don't make national-oriented conclusions, or when I do, restrict to the .au domains
- Argue that it's a small portion so don't worry about it
(Thanks to Alex Osborne and Paul Koerbin from the National Library for detailing the specifics for me)
Labels: ben
Possible outcomes from this include:
- Some meaningful consideration of how many web pages, on average, come about from a single decision to use a licence. For example, if you licence your blog and put a licence statement into your blog template, that would be one decision to use the licence, arguably one licensed work (your blog), but actually 192 web pages (with permalinks and all). I've got a few ideas about how to measure this, which I can go in to more depth about.
- How much of the Australian web is licensed, both proportionally (one page in X, one web site in X, or one 'document' in X), and in absolute terms (Y web pages, Y web sites, Y 'documents').
- Comparison of my results to proprietary search-engine based answers to the same question, to put my results in context.
- Comparison of various licensing mechanisms, including but not limited to: hyperlinks, natural language statements, dc rights in tags, rdf+xml.
- Comparison of use of various licences, and licensing elements.
- Changes in the answers to these questions over time.
Labels: ben, licensing, open content, quantification, research
Thursday, May 01, 2008
I said I'd like to see an interface that (among other nice-to-haves) answers questions like "give me everything you've got from cyberlawcentre.org/unlocking-ip", and it turns out that that's actually possible with The Wayback Machine. Not in a single request, that I know of, but with this (simple old HTTP) request: http://web.archive.org/web/*xm_/www.cyberlawcentre.org/unlocking-ip/*, you can get a list of all URLs under cyberlawcentre.org/unlocking-ip, and then if were to want to you could do another HTTP request for each URL.
Pretty cool actually, thanks Alex.
Now I wonder how big you can scale those requests up to... I wonder what happens if you ask for www.*? Or (what the heck, someone has to say it) just '*'. I guess you'd probably break the Internet...
Wednesday, April 30, 2008
Okay, so here's the latest results, including hyperlinks to searches for you to try them yourself:
- all (by regex: .*) : 36,700,000
- gpl : 8,960,000
- lgpl : 4,640,000
- bsd : 3,110,000
- mit : 903,000
- cpl : 136,000
- artistic : 192
- apache : 156
- disclaimer : 130
- python : 108
- zope : 103
- mozilla : 94
- qpl : 86
- ibm : 67
- sleepycat : 51
- apple : 47
- lucent : 19
- nasa : 15
- alladin : 9
And here's a spreadsheet with graph included: However, note the discontinuity (in absolute and trend terms) between approximate and specific results in that (logarithmic) graph, which suggests Google's approximations are not very good.
Labels: ben, free software, licensing, quantification, search
Tuesday, April 08, 2008
I just did this analysis of Google's and Yahoo's capacities for search for commons (mostly Creative Commons because that's in their advanced search interfaces), and thought I'd share. Basically it's an update of my research from Finding and Quantifying Australia's Online Commons. I hope it's all pretty self-explanatory. Please ask questions. And of course point out flaws in my methods or examples.
Also, I just have to emphasise the "No" in Yahoo's column in row 1: yes, I am in fact saying that the only jurisdiction of licences that Yahoo recognises is the US/unported licences, and that they are in fact ignoring the vast majority of Creative Commons licences. (That leads on to a whole other conversation about quantification, but I'll leave that for now.)
(I've formatted this table in Courier New so it should come out well-aligned, but who knows).
Feature | Google | Yahoo |
------------------------------+--------+-------+
1. Multiple CC jurisdictions | Yes | No | (e.g.)
2. 'link:' query element | No | Yes | (e.g. G, Y)
3. RDF-based CC search | Yes | No | (e.g.)
4. meta name="dc:rights" * | Yes | ? ** | (e.g.)
5. link-based CC search | No | Yes | (e.g.)
6. Media-specific search | No | No | (G, Y)
7. Shows licence elements | No | No | ****
8. CC public domain stamp *** | Yes | Yes | (e.g.)
9. CC-(L)GPL stamp | No | No | (e.g.)
* I can't rule out Google's result here actually being from <a rel="license"> in the links to the license (as described here: http://microformats.org/wiki/rel-license).
** I don't know of any pages that have <meta name="dc:rights"> metadata (or <a rel="licence"> metadata?) but don't have links to licences.
*** Insofar as the appropriate metadata is present.
**** (i.e. doesn't show which result uses which licence)
Notes about example pages (from rows 1, 3-5, 8-9):
- To determine whether a search engine can find a given page, first look at the page and find enough snippets of content that you can create a query that definitely returns that page, and test that query to make sure the search engine can find it (e.g. '"clinton lies again" digg' for row 8). Then do the same search as an advanced search with Creative Commons search turned on and see if the result is still found.
- The example pages should all be specific with respect to the feature they exemplify. E.g. the Phylocom example from row 9 has all the right links, logos and metadata for the CC-GPL, and particularly does not have any other Creative Commons licence present, and does not show up in search results.
Labels: ben, Creative Commons, quantification, search
Tuesday, February 19, 2008
http://www.archive.org/web/researcher/intended_users.php
I'll certainly be looking into this further.
(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:
We are in the process of redesigning our researcher web interface. During this time we regret that we will not be able to process any new researcher requests. Please see if existing tools such as the Wayback Machine can accommodate your needs. Otherwise, check back with us in 3 months for an update.This seems understandable except for this, on the same page:
This material has been retained for reference and was current information as of late 2002.That's over 5 years. And in Internet time, that seems like a lifetime. I'll keep investigating.)
Labels: ben, open access, quantification
The very basics of web search
Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
- Web crawling - downloading web pages; discovering new web pages
- Indexing - like the index in a book: figure out which pages have which features (meaning keywords, though there may be others), and store them in separate lists for later access
- Performing searches - when someone wants to do a keyword search, for example, the search engine can look up the keywords in the index, and find out which pages are relevant
But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.
A bit more about the web crawler
There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.
Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.
They're all downloading the same data
So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.
My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.
Here's what I'd like to see specifically:
- A good web crawler, crawling the web and thus keeping an up-to-date cache of the best parts of the web
- An interface that lets you download this data, or diffs from a previous time
- An interface that lets you download just some. E.g. "give me everything you've got from cyberlawcentre.org/unlocking-ip" or "give me everything you've got from *.au (Australian registered domains)" or even "give me everything you've got that links to http://labs.creativecommons.org/licenses/zero-assert/1.0/us/"
- Note that in these 'interface' points, I'm talking about downloading data in some raw format, that you can then use to, say, index and search with your own search engine.
* I know.
Labels: ben, open access, quantification, search
Monday, February 18, 2008
If you look around, you can probably find some graphs based on this data, and that's probably interesting in itself. Tomorrow I'll see about dusting off my Perl skills, and hopefully come up with a graph of the growth of Australian CC licence usage. Stay tuned.
* If you knew about this, why didn't you tell me!
Labels: ben, Creative Commons, licensing, quantification
Thursday, February 14, 2008
There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."
It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.
Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.
Here's my example waiver:
To the extent possible under law,
Ben Bildstein has waived all copyright, moral rights, database rights, and any other rights that might be asserted over Sensei's Library: Bildstein/Votes.
Labels: ben, Creative Commons, search
Friday, October 19, 2007
House of Commons friend and ANU academic Dr. Matthew Rimmer has called for Australia to follow the lead of US Democrats presidential candidate hopeful Barack Obama and allow these debates to be made "freely accessible across all media and technology platforms" (See the ANU Press Release here). In the United States, Obama suggested that the US Democrat debates be either placed in the public domain or licensed under a Creative Commons licence.
Dr Rimmer has said that
"Whichever television networks or internet media end up broadcasting the federalThe House of Commons strongly supports Dr. Rimmer's suggestion. It is an unusual one in an Australian context - in the United States, there is no copyright in works produced by the US government and thus there is at least a precedent for this type of action. There is also the First Amendment guarantee of freedom of speech, which arguably means that this type of content gains even greater significance. However, there has been a shift in this campaign to Australian political parties embracing all that the digital revolution has to offer (just type 'Kevin07' into Google, for example). A pledge by the parties to make debate materials freely available and accessible via sites such as YouTube would be both a positive and definite step for Australian democracy in the digital age.
election debates, it’s important to the health of our democracy that people are
free to capture and distribute the dialogue of our prospective leaders so that
they can make a more informed decision."
The logisitics of such a proposition has also caused much discussion amongst House of Commons housemates. Housemate Ben writes:
"I think election debates should belong to the commons, at least insofar asIn response, Housemate Abi has agreed (and I concur) that the parody or satire fair dealing exception in the Copyright Act could probably be used to create parodies, although there issue regarding modifications may need to be addressed.
complete reproduction is concerned. However, I do see that there are good
reasons not to allow modifications, because they could be used to spread
disinformation at such a crucial time. For these reasons, a licence such as
Creative Commons No Derivatives would be appropriate (as opposed to, say, a
public domain dedication). It's also worth noting that, even under such a
licence, derivatives could be made for the purpose of satire (correct me if I'm
wrong here!), and that could perhaps be both a good and a bad thing (I'm not
sure to what extent you could use the satire exception to spread
disinformation)."
For more information on Dr. Rimmer's proposal, the ANU Press Release can be found here.
Labels: abi, ben, catherine, Creative Commons, open content, parody, youtube
Tuesday, October 09, 2007
So without further ado, here's the link: Advance Australia Fair? The Copyright Reform Process.
Now, I'm no legal expert, and I have to admit the article was kind of over my head. But, by way of advertisement, here are some keywords I can pluck out of the paper as relevant:
- Technological protection measures (TPMs)
- Digital rights management (DRM)
- The Australia-US Free Trade Agreement (AUSFTA)
- The Digital Agenda Act (forgive me for not citing correctly!)
- The Digital Millennium Copyright Act (DMCA)
- The World Intellectual Property Organization (WIPO)
Labels: abi, ben, catherine, legislation, research
Thursday, October 04, 2007
I'm working on human embryonic stem cell research and patenting of those in Australia but from overseas. I was wondering whether you are aware of any Australian IP (that would cover patents) or patent blogs maybe.I don't know of any such blogs, but I have to admit to not paying as much attention to the world of patents as I do of copyright. But for the sake of being helpful I decided to ask around. So if anyone has any good Australian patent law related web resources, drop a comment on this post and I'll pass it on.
Wednesday, October 03, 2007
Anyway, the point is, it won't be getting to court because the defendants capitulated. According to Linux Watch, Monsoon Multimedia "admitted today that it had violated the GPLv2 (GNU General Public License version 2), and said it will release its modified BusyBox code in full compliance with the license."
This shows that the system works. The GPL must be clear enough that it is obvious what you can't do. (Okay, there's still some discussion, but on the day to day stuff, everything is going just fine).
Friday, August 24, 2007
Here's the run-down:
- "Softpedia is a library of over 35,000 free and free-to-try software programs for Windows and Unix/Linux,games and drivers. We review and categorize these products in order to allow the visitor/user to find the exact product they and their system needs," from the help page.
- Wikipedia has a page on it, which interestingly says that "it is one of the top 500 websites according to Alexa traffic ranking."
Tuesday, August 14, 2007
I was going to comment on this TechnoLlama post, but then I thought, "why not just blog about it?"
Now, before I go any further, let me reiterate, I am not a lawyer, I have no formal background in law, and the Unlocking IP project is not even particularly about patents.
Get to the point, Ben.
Okay, okay. My point is this, and please correct me if I am wrong (that's why I'm making this a whole post, so you can correct me in the comments): as I understand it, the abstract of a patent does not say what the invention is, but rather it describes the invention. I.e. the abstract is more general that the invention.
For example, say I had a patent for... *looks around for a neat invention within reach*... my discgear. Well the abstract might say something like
A device for storing discs, with a selector mechanism and an opening mechanism such that when the opening mechanism is invoked, the disc selected by the selector mechanism is presented.
But then the actual patent might talk about how:
- the opening mechanism is a latch;
- the discs are held in small grooves that separate them but allow them to be kept compact;
- the lid is spring loaded and damped so that when the latch is released it gently lifts up;
- there is a cool mechanism I don't even understand for having the lid hold on to the selected disc;
- said cool mechanism (the selector mechanism) has another mechanism that allows it to slide only when pressed on and not when just pushed laterally;
- etc.
To go back to the original example, an old vinyl disc jukebox (see the third image on this page) would satisfy the description in my abstract, but not infringe my patent.
In summary, I'm not alarmed by the generality of the abstract in the Facebook case. But if it turns out I'm wrong, and abstract are not more general than the patents they describe, let me just say I will be deeply disturbed.
You have no idea what you're talking about
I thought I made that clear earlier, but yes that's true. Please correct me, or clarify what I've said, by comment (preferred), or e-mail (in case it's the kind of abuse you don't want on the public record).
Friday, July 20, 2007
Now, there's two possible approaches to making the RDR system probabilistic (i.e. making it predict the probability that it is wrong for a given input). First, we could try to predict the probabilities based on the structure of the RDR and which rules have fired. Alternatively, we could ask the expert explicitly for some knowledge of probabilities (in the specific context, of course).
Observational analysis
What I'm talking about here is using RDR like normal, but trying to infer probabilities based on the way it reaches its conclusion. The most obvious situation where this will work is when all the rules that conclude positive (interesting) fire and none of the rules that conclude negative (uninteresting) fire. (This does, however, mean creating a more Multiple Classification RDR type of system.) Other possibilities include watching over time to see which rules are more likely to be wrong.
These possibilities may seem week, but they may turn out to provide just enough information. Remember, any indication that some examples are more likely to be useful is good, because it can cut down the pool of potential false negatives from the whole web to something much, much smaller.
An expert opinion
The other possibility is to ask the expert in advance how likely the system is to be wrong. Now, as I discussed, this whole RDR methodology is based around the idea that experts are good at justifying themselves in context, so it doesn't make much sense to ask the expert to look at an RDR system and say in advance how likely a given analysis is to be wrong. On the other hand, it might be possible to ask the expert, when they are creating a new rule: what is the probability that the rule will be wrong (the conclusion is wrong), given that it fires (its condition is met)? And, to get a little bit more rigorous, we would ideally also like to know: what is the probability that the rule's condition will be met, given that the rule's parent fired (the rule's parent's condition was met)?
The obvious problem with this is that the expert might not be able to answer these questions, at least with any useful accuracy. On the other hand, as I said above, any little indication is useful. Also, it's worth pointing out that what we need is not primarily probabilities, but rather a ranking or ordering of the candidates for expert evaluation, so that we know which is the most likely to be useful (rather than exactly how likely it is to be useful).
Also the calculations of probabilities could turn out to be quite complex :)
Here's what I consider a minimal RDR tree for the purposes of calculating probabilities, with some hypothetical (imaginary) given probabilities.

Let me explain. Rule 0 is the default rule (the starting point for all RDR systems). It fires 100% of the time, and in this case it is presumed to be right 99% of the time (simulating the needle-in-a-haystack scenario). Rules 1 and 2 are exceptions to rule 0, and will be considered only when rule 0 fires (which is all the time because it is the default rule). Rule 3 is an exception to rule 2, and will be considered only when rule 2 fires.
The conclusions of rules 0 and 3 are (implicitly) 'hay' (uninteresting), while the conclusions of rules 1 and 2 are (implicitly) 'needle' (interesting). This is because the conclusion of every exception rule needs to be different from the conclusion of the parent rule.
The percentage for 'Fires' represents the expert's opinion of how likely the rule is to fire (have its condition met) given that the rule is reached (its parent is reached and fires). The percentage for 'Correct' represents the expert's opinion of how likely the rule's conclusion is to be correct, given that the rule is reached and fires.
With this setup, you can start to calculate some interesting probabilities, given knowledge of which rules fire for a given example. For example, what is the probability of 'needle' given that rules 1 and 2 both fire, but rule 3 doesn't? (This is assumedly the most positive indication of 'needle' we can get.) What difference would it make if rule 3 did fire? If you can answer either of these questions, leave a comment.
If no rules fire, for example, the probability of 'needle' is 0.89%, which is only very slightly less than the default probability of 'needle' before using the system, which was 1%. Strange, isn't it?
Labels: artificial intelligence, ben, research
Tuesday, July 17, 2007
Ripple Down Rules
Ripple down rules is a knowledge acquisition methodology developed at the University of New South Wales. It's really simple - it's about incrementally creating a kind of decision tree based on an expert identifying what's wrong with the current decision tree. It works because the expert only needs to justify their conclusion that the current system is wrong in a particular case, rather than identify a universal correction that needs to be made, and also the system is guaranteed to be consistent with the expert's evaluation of all previously seen data (though overfitting can obviously still be a problem).
The application of ripple down rules to deep web commons is simply this: once you have a general method for flattened web forms, you can use the flattened web form as input to the ripple down rules system and have the system decide if the web form hides commons.
But how do you create rules from a list of text strings without even a known size (for example, there could be any number of options in a select input (dropdown list), and any number of select inputs in a form). The old "IF weather = 'sunny' THEN play = 'tennis'" type of rule doesn't work. One solution is to make the rule conditions more like questions, with rules like "IF select-option contains-word 'license' THEN form = 'commons'" (this is a suitable rule for Advanced Google Code Search). Still, I'm not sure this is the best way to express conditions. To put it another way, I'm still not sure that extracting a list of strings, of indefinite length, is the right way to flatten the form (see this post). Contact me if you know of a better way.
A probabilistic approach?
As I have said, one of the most interesting issues I'm facing is the needle in a haystack problem, where we're searching for (probably) very few web forms that hide commons, in a very very big World Wide Web full of all kinds of web forms.
Of course computers are good at searching through lots of data, but here's the problem: while you're training your system, you need examples of the system being wrong, so you can correct it. But how do you know when it's wrong? Basically, you have to look at examples and see if you (or the expert) agree with the system. Now in this case we probably want to look through all the positives (interesting forms), so we can use any false positives (uninteresting forms) to train the system, but that will quickly train the system to be conservative, which has two drawbacks. Firstly, we'd rather it wasn't conservative because we'd be more likely to find more interesting forms. Secondly, because we'll be seeing less errors in the forms classified as interesting, we have less examples to use to train the system. And to find false negatives (interesting forms incorrectly classified as uninteresting), the expert has to search through all the examples the system doesn't currently think are interesting (and that's about as bad as having no system at all, and just browsing the web).
So the solution seems, to me, to be to change the system, so that it can identify the web form that it is most likely to be wrong about. Then we can get the most bang (corrections) for our buck (our expert's time). But how can anything like ripple down rules do that?
Probabilistic Ripple Down Rules
This is where I think the needle in a haystack problem can actually be an asset. I don't know how to make a system that can tell how close an example is to the boundary between interesting and uninteresting (the boundary doesn't really exist, even). But it will be a lot easier to make a system that predicts how likely an example is to be an interesting web form.
This way, if the most likely of the available examples is interesting, it will be worth looking at (of course), and if it's classified as not interesting, it's the most likely to have been incorrectly classified, and provide a useful training example.
I will talk about how it might be possible to extract probabilities from a ripple down rules system, but this post is long enough already, so I'll leave that for another post.
Labels: artificial intelligence, ben, research
Thursday, July 12, 2007
Flattening a web form
The first problem is that of how to represent a web form in such a way that it can be used as an input to an automated system that can evaluate it. Ideally, in machine learning, you have a set of attributes that form a vector, and then you use that as the input to your algorithm. Like in tic-tac-toe, you might represent a cross by -1, a naught by +1, and an empty space by 0, and then the game can be represented by 9 of these 'attributes'.
But for web forms it's not that simple. There are a few parts of the web form that are different from each other. I've identified these potentially useful places, of which there may be one or more, and all of which take the form of text. These are just the ones I needed when considering Advanced Google Code Search:
- Form text. The actual text of the web form. E.g. "Advanced Code Search About Google Code Search Find results with the regular..."
- Select options. Options in drop-down boxes. E.g. "any language", "Ada", "AppleScript", etc.
- Field names. Underlying names of the various fields. E.g. "as_license_restrict", "as_license", "as_package".
- Result text. The text of each search result. E.g. (if you search for "commons"): "shibboleth-1.3.2-install/.../WrappedLog.java - 8 identical 26: package..."
- Result link name. Hyperlinks in the search results. E.g. "8 identical", "Apache"
But as far as I can tell, text makes for bad attributes. Numerical is much better. As far as I can tell. But I'll talk about that more when I talk about ripple down rules.
A handful of needles in a field of haystacks
The other problem is more about what we're actually looking for. We're talking about web forms that hide commons content. Well the interesting this about that is that there's bound to be very few, compared to the rest of the web forms on the Internet. Heck, they're not even all for searching. Some are for buying things. Some are polls.
And so, if, as seems likely, most web forms are uninteresting, if we need to enlist an expert to train the system, the expert is going to be spending most of the time looking at uninteresting examples.
This makes it harder, but in an interesting way: if I can find some way to have the system, while it's in training, find the most likely candidate of all the possible candidates, it could solve this problem. And that would be pretty neat.
Labels: ben, deep web, research
Tuesday, July 10, 2007
The deep web
Okay, so here's what I'm looking at. It's called the deep web, and it refers to the web documents that the search engines don't know about.
Sort of.
Actually, when the search engines find these documents, they really become part of the surface web, in a process sometimes called surfacing. Now I'm sure you're wondering: what kinds of documents can't search engines find, if they're the kind of documents anyone can browse to? The simple answer is: documents that no other pages link to. But a more realistic answer is that it's documents hidden in databases, that you have to do searches on the site to find. They'll generally have URLs, and you can link to them, but unless someone does, they're part of the deep web.
Now this is just a definition, and not particularly interesting in itself. But it turns out (though I haven't counted, myself) that there are more accessible web pages in the deep web than in the surface web. And they're not beyond the reach of automated systems - the systems just have to know the right questions to ask and the right place to ask the question. Here's an example, close to Unlocking IP. Go to AEShareNet and do a search, for anything you like. The results you get (when you navigate to them) are documents that you can only find by searching like this, or if someone else has done this, found the URL, and then linked to it on the web.
Extracting (surfacing) deep web commons
So when you consider how many publicly licensed documents may be in the deep web, it becomes an interesting problem from both the law / Unlocking IP perspective and from the computer science, which I'm really happy about. What I'm saying here is that I'm investigating ways of making automated systems to discover deep web commons. And it's not simple.
Lastly, some examples
I wanted to close with two web sites that I think are interesting in the context of deep web commons. First, there's SourceForge, which I'm sure the Planet Linux Australia readers will know (for the rest: it's a repository for open source software). It's interesting, because their advanced search feature really doesn't give many clues about it being a search for open source software.
And then there's the Advanced Google Code Search, which searches for publicly available source code, which generally means free or open source, but sometimes just means available, because Google can't figure out what the licence is. This is also interesting because it's not what you'd normally think of as deep web content. After all Google's just searching for stuff it found on the web, right? Actually, I class this as deep web content because Google is (mostly) looking inside zip files to find the source code, so it's not stuff you can find in regular search.
This search, as compared to SourceForge advanced search, makes it very clear you're searching for things that are likely to be commons content. In fact, I came up with 6 strong pieces of evidence that I can say leads me to believe Google Code Search is commons related.
(As a challenge to my readers, see how many pieces of evidence you can find that the Advanced Google Code Search is a search for commons (just from the search itself), and post a comment).
Labels: ben, deep web, research
This announcement acknowledges that GPLv3 "is an improved version of the license to better suit the needs of Free Software in the 21st Century," saying "We feel this is an important change to help promote the interests of Samba and other Free Software."
Unfortunately, the announcement doesn't say much about how Samba made their decision or what swayed them.
Labels: ben, free software, gpl
Friday, July 06, 2007
This is set to be a continuing theme in my research. Not because it's particularly valuable in the field of computer science, but because in the (very specific) field of online commons research, no one else seems to be doing much. (If you know something I don't about where to look for the research on this, please contact me now!)
I wish I could spend more time on this. What I'd do if I could would be another blog post altogether. Suffice it to say that I envisaged a giant machine (completely under my control), frantically running all over the Internets counting documents and even discovering new types of licences. If you want to hear more, contact me, or leave a comment here and convince me to post on it specifically.
So what do I have to say about this? Actually, so much that the subject has its own page. It's on unlockingip.org, here. It basically surveys what's around on the subject, and a fair bit of that is my research. But I would love to hear about yours or any one else's, published, unpublished, even conjecture.
Just briefly, here's what you can currently find on the unlockingip.org site:
- My SCRIT-ed paper
- My research on the initial uptake of the Creative Commons version 2.5 (Australia) licence
- Change in apparent Creative Commons usage, June 2006 - March 2007
- Creative Commons semi-official statistics
I'm also interested in the methods of quantification. With the current technologies, what is the best way to find out, for any given licence, how many documents (copyrighted works) are available with increased public rights? This is something I need to put to Creative Commons, because their licence statistics page barely addresses this issue.
Labels: ben, quantification, research
Thursday, July 05, 2007
The reason I haven't been blogging much (apart from laziness, which can never be ruled out) is that The House of Commons has become something of an IP blog. Okay, it sounds obvious, I know. And, as it seems I say at every turn, I have no background in law, and my expertise is in computer science and software engineering. And one of the unfortunate aspects of the blog as a medium is that you don't really know who's reading it. The few technical posts I've done haven't generated much feedback, but then maybe that's my fault for posting so rarely that the tech folks have stopped reading.
So the upshot of this is a renewed effort by me to post more often, even if it means technical stuff that doesn't make sense to all our readers. It's not like I'm short of things to say.
To start with, in the remainder of this post, I want to try to put in to words, as generally as possible,

