The House of Commons

Thursday, May 01, 2008

Get a list of all (indexable) URLs on a site from the Wayback Machine

Earlier this year I complained about the problem with search engines. Today, Alexander Osborne (from the National Library of Australia) corrected me, at least a little bit.

I said I'd like to see an interface that (among other nice-to-haves) answers questions like "give me everything you've got from cyberlawcentre.org/unlocking-ip", and it turns out that that's actually possible with The Wayback Machine. Not in a single request, that I know of, but with this (simple old HTTP) request: http://web.archive.org/web/*xm_/www.cyberlawcentre.org/unlocking-ip/*, you can get a list of all URLs under cyberlawcentre.org/unlocking-ip, and then if were to want to you could do a n o t h e r H T T P r e q u e s t f o r e a c h U R L.

Pretty cool actually, thanks Alex.

Now I wonder how big you can scale those requests up to... I wonder what happens if you ask for www.*? Or (what the heck, someone has to say it) just '*'. I guess you'd probably break the Internet...

Labels: ben, search

(permalink) posted by Ben Bildstein @ Thursday, May 01, 2008 0 comments links to this post

Wednesday, April 30, 2008

Quantifying open software using Google Code Search

Google Code Search lets you search for source code files by licence type, so of course I was interested in whether this could be used for quantifying indexable source code on the web. And luckily GCS lets you search for all works with a given licence. (If you don't understand why that's a big deal, try doing a search for all Creative Commons licensed work using Google Search.) Even better, using the regex facility you can search for all works! You sure as heck can't do that with a regular Google web search.

Okay, so here's the latest results, including hyperlinks to searches for you to try them yourself:

all (by regex: .*) : 36,700,000
gpl : 8,960,000
lgpl : 4,640,000
bsd : 3,110,000
mit : 903,000
cpl : 136,000
artistic : 192
apache : 156
disclaimer : 130
python : 108
zope : 103
mozilla : 94
qpl : 86
ibm : 67
sleepycat : 51
apple : 47
lucent : 19
nasa : 15
alladin : 9

And here's a spreadsheet with graph included: However, note the discontinuity (in absolute and trend terms) between approximate and specific results in that (logarithmic) graph, which suggests Google's approximations are not very good.

Labels: ben, free software, licensing, quantification, search

(permalink) posted by Ben Bildstein @ Wednesday, April 30, 2008 0 comments links to this post

Tuesday, April 08, 2008

Table comparing Yahoo and Google's commons-based advanced search options

Hi commons researchers,

I just did this analysis of Google's and Yahoo's capacities for search for commons (mostly Creative Commons because that's in their advanced search interfaces), and thought I'd share. Basically it's an update of my research from Finding and Quantifying Australia's Online Commons. I hope it's all pretty self-explanatory. Please ask questions. And of course point out flaws in my methods or examples.

Also, I just have to emphasise the "No" in Yahoo's column in row 1: yes, I am in fact saying that the only jurisdiction of licences that Yahoo recognises is the US/unported licences, and that they are in fact ignoring the vast majority of Creative Commons licences. (That leads on to a whole other conversation about quantification, but I'll leave that for now.)

(I've formatted this table in Courier New so it should come out well-aligned, but who knows).

Feature                       | Google | Yahoo |
------------------------------+--------+-------+
1. Multiple CC jurisdictions  | Yes    | No    | (e.g.)
2. 'link:' query element      | No     | Yes   | (e.g. G, Y)
3. RDF-based CC search        | Yes    | No    | (e.g.)
4. meta name="dc:rights" *    | Yes    | ? **  | (e.g.)
5. link-based CC search       | No     | Yes   | (e.g.)
6. Media-specific search      | No     | No    | (G, Y)
7. Shows licence elements     | No     | No    | ****
8. CC public domain stamp *** | Yes    | Yes   | (e.g.)
9. CC-(L)GPL stamp            | No     | No    | (e.g.)

* I can't rule out Google's result here actually being from <a rel="license"> in the links to the license (as described here: http://microformats.org/wiki/rel-license).
** I don't know of any pages that have <meta name="dc:rights"> metadata (or <a rel="licence"> metadata?) but don't have links to licences.
*** Insofar as the appropriate metadata is present.
**** (i.e. doesn't show which result uses which licence)

Notes about example pages (from rows 1, 3-5, 8-9):

To determine whether a search engine can find a given page, first look at the page and find enough snippets of content that you can create a query that definitely returns that page, and test that query to make sure the search engine can find it (e.g. '"clinton lies again" digg' for row 8). Then do the same search as an advanced search with Creative Commons search turned on and see if the result is still found.
The example pages should all be specific with respect to the feature they exemplify. E.g. the Phylocom example from row 9 has all the right links, logos and metadata for the CC-GPL, and particularly does not have any other Creative Commons licence present, and does not show up in search results.

Labels: ben, Creative Commons, quantification, search

(permalink) posted by Ben Bildstein @ Tuesday, April 08, 2008 0 comments links to this post

Tuesday, February 19, 2008

The problem with search engines

There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...

The very basics of web search

Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:

Web crawling - downloading web pages; discovering new web pages
Indexing - like the index in a book: figure out which pages have which features (meaning keywords, though there may be others), and store them in separate lists for later access
Performing searches - when someone wants to do a keyword search, for example, the search engine can look up the keywords in the index, and find out which pages are relevant

None of these is trivial. I'm no expert, but I suggest indexing is the easiest. Performing searches well is what made Google so successful, where previous search engines had been treating the search step more trivially.

But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.

A bit more about the web crawler

There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.

Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.

They're all downloading the same data

So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.

My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.

Here's what I'd like to see specifically:

A good web crawler, crawling the web and thus keeping an up-to-date cache of the best parts of the web
An interface that lets you download this data, or diffs from a previous time
An interface that lets you download just some. E.g. "give me everything you've got from cyberlawcentre.org/unlocking-ip" or "give me everything you've got from *.au (Australian registered domains)" or even "give me everything you've got that links to http://labs.creativecommons.org/licenses/zero-assert/1.0/us/"
Note that in these 'interface' points, I'm talking about downloading data in some raw format, that you can then use to, say, index and search with your own search engine.

If you know somewhere this is happening, let me know, because I can't find it. I think the Wayback Machine is the closest to an open access Web cache, and http://archive-access.sourceforge.net/ is the closest I've found to generalised access to the Wayback Machine. I'll read more about it, and let you know if it comes up trumps.

* I know.

Labels: ben, open access, quantification, search

(permalink) posted by Ben Bildstein @ Tuesday, February 19, 2008 2 comments links to this post

Thursday, February 14, 2008

CC0 - Creative Commons' new solution for the public domain

Creative Commons have come up with a better way for people to mark works as copyright-free, or part of the public domain. It's called CC0 (CC Zero). The page for using it is here: http://labs.creativecommons.org/license/zero.

There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."

It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.

Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.

Here's my example waiver:

To the extent possible under law,
Ben Bildstein has waived all copyright, moral rights, database rights, and any other rights that might be asserted over Sensei's Library: Bildstein/Votes.

Labels: ben, Creative Commons, search

(permalink) posted by Ben Bildstein @ Thursday, February 14, 2008 0 comments links to this post

Friday, August 24, 2007

My software found something!

Ever heard of Softpedia? I hadn't. Until the web crawler I wrote found it, and the ripple down rules system I created flagged it as interesting.

Here's the run-down:

"Softpedia is a library of over 35,000 free and free-to-try software programs for Windows and Unix/Linux,games and drivers. We review and categorize these products in order to allow the visitor/user to find the exact product they and their system needs," from the help page.
Wikipedia has a page on it, which interestingly says that "it is one of the top 500 websites according to Alexa traffic ranking."

Lastly, here is a link to the Softpedia search page: http://www.softpedia.com/progSearch. Try it out.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Friday, August 24, 2007 0 comments links to this post

Friday, December 08, 2006

Positive feedback

I think it was Brendan Scott who coined the term kablooey (see this presentation in PDF format) for the positive feedback of Free and Open Source Software, and particularly the GNU GPL. But recently I've got to thinking: why is it that both Google and Yahoo (and hey, even Nutch) are so big on Creative Commons? Why didn't this ever happen for software? Or any other given licence?

Sure, it's lots of reasons: Creative Commons has some funky technology, like RDF, that makes it easier for search engines to find licensed stuff. They've got brilliant marketing. And there are other reasons.

But something kind of disturbing just occurred to me. These search engines have Creative Commons features, and no other open content features. That's okay, but it does, in a sense, advertise Creative Commons. And this creates positive feedback: people use Creative Commons, search engines enhance support for Creative Commons, more people learn about Creative Commons, and more people use it. Etc.

So I hear people saying "yeah, but that's natural - the cream rises to the top". But it's more than that. There are people who have never heard of any type of open content licensing. And now Creative Commons is going to be the first they hear about the idea.

In summary, I think the other great licences may have been ahead of their time. Creative Commons came along at just the right time, with Web 2.0 happening and all that. And all those people who'd never given a thought to how to share legally were waiting, even if they didn't know it.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Friday, December 08, 2006 0 comments links to this post

Wednesday, November 29, 2006

A Challenge to Search Engines

The spark

I've recently been redrafting my Unlocking IP conference paper for publication, and it got me to thinking. Google (as did Nutch, when it was online at creativecommons.org) has a feature for searching for Creative Commons works, as part of Advanced Search. Graham Greenleaf, my supervisor, asked me the other day how they keep up with all the various licences - there are literally hundreds of them when you consider all the various combinations of features, the many version numbers, and the multitude of jurisdictions. Yahoo, for example, only searches for things licensed under the American 'generic' licences, and no others. But Google seems to be able to find all sorts.

Now I'm not one to doubt Google's resources, and it could well be that as soon as a new licence comes out they're all over it and it's reflected in their Advanced Search instantaneously. Or, more likely, they have good communication channels open with Creative Commons.

But it did occur to me that if I were doing it - just little old me all by myself, I'd try to develop a method that didn't need to know about the new licences to be able to find them.

As detailed in my paper, it appears that Google's Creative Commons search is based on embedded metadata. As I have said previously, I understand this standpoint, because it is, if nothing else, unambiguous (compared with linking to licences for example, which generates many false positives).

So if I were doing it, I'd pick out the metadata that's needed to decide if something is available for commercial use, or allows modification, or whatever, and I'd ignore the bit about which licence version was being used, and its jurisdiction, and those sorts of details that the person doing the search hasn't even been given the option to specify.

The challenge

Anyway, the challenge. I have created a web page that I hereby make available for everyone to reproduce, distribute, and modify. In keeping with the Creative Commons framework I require attribution and licence notification, but that's just a formality I'm not really interested in. I've put metadata in it that describes these rights, and it refers to this post's permalink as its licence. The web page is up, and by linking to it from this post, it's now part of the Web.

The challenge is simply this: Which search engines will find it and classify it as commons, creative or otherwise. Will I fool search engines into thinking it's Creative Commons content? Or will they look straight past it? Or will they rise to the challenge and see that what Creative Commons has started is bigger than just their licences, and the Semantic Web may as well be for every licence.

Let's give them a little time to index these pages, and we'll find out.

[post script: I've added a properly Creative Commons licensed page for comparison, that we can use to see when the search engines have come by (this page at least should turn up in search results).]

Labels: ben, search

(permalink) posted by Ben Bildstein @ Wednesday, November 29, 2006 1 comments links to this post

Tuesday, October 17, 2006

SWS: Conclusion and Retrospective

In my last three posts (see part 1, part 2 and part 3), I have been exploring the potential for a search engine that focuses solely on commons content (i.e. works with expanded public rights). Now I’d like to touch on where this idea might fit into the broader scheme of things, including what its shortcomings might be.

At present, Google web search is the benchmark in Internet search. I consider Google to have almost unlimited resources, so for any given feature that Google web search lacks, it is worth considering why they do not have it. In general, I think the answer to this question is either that it will clutter their user interface, it will not be of significant use to the public, or it is computationally intractable. However, given that Google have many services, many in beta and many still in Google Labs, the first problem of cluttering the user interface can clearly be solved by making a new search a new service (for example, Google Scholar). This leaves only the possibilities of the feature not being of significant use, or being computationally intractable.

As this clearly demonstrates, there is no single or simple way to establish works as part of the commons. This may go some way to answering the question of why Google hasn't done this yet. However, Google has implemented at least three of these ideas: in the advanced search, Creative Commons-based usage rights can be specified; in the new Code Search (beta), various text-based licences are recognised by the system; and in Google Book Search (beta), authorship is used to establish public-domain status (in some small subset of Google's scanned books). Google hasn’t done anything all encompassing yet, but it may be that it’s just around the corner, or it may be that Google has figured out that it’s not really what the public want. Depending on how the big players (Google, but also Yahoo and others) proceed, my research may move more towards an analysis of either why so much was done by the big search engines (and what this means to the Unlocking IP project), or alternatively why so little was done (and what this means).

Lastly…

Lastly, I should consider what this idea of a semantic web search engine, as presented, is lacking. First, there is the issue that licence metadata (URLs, if nothing else) need to be entered by someone with such knowledge – the system can not discover these on its own. Second, there are the issues of false positives (web pages that are erroneously included in search results) and false negatives (suitable web pages that are erroneously excluded from search results). The former is the most obvious problem from a user's perspective, and my description here focuses mostly on avoiding these false positives. The latter problem of false negatives is much harder to avoid, and obviously is exacerbated by the system's incomplete domain knowledge, and the inevitable non-standard licensing that will be seen on the web.

Thanks for reading all the way to the end of my last post. If you got this far, stay tuned – I’m sure there’s more to come.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Tuesday, October 17, 2006 0 comments links to this post

SWS: The Tricky Bits

(This is part 3 of a series of posts on semantic web search for commons. For the previous posts, see here: part 1, part 2)

Another possible expansion to the proposed semantic web search engine is to include non-web documents hosted on the web. Google, for example, already provides this kind of search with Google Image Search, where although Google can not distil meaning from the images that are available, it can discover enough from their textual context that they can be meaningfully searched.

Actually, from the perspective of semantic web search as discussed in this series, images need not be considered, because if an image is included in a licensed web page, then that image can be considered licensed. The problem is actually more to do with non-html documents that are linked-to from html web pages, hence where the search engine has direct access to them but prima facie has no knowledge of their usage rights.

This can be tackled from multiple perspectives. If the document can be rendered as text (i.e. it is some format of text document) then any other licence mechanism detection features can be applied to it (for example, examining similarity to known text-based licences). If the document is an archive file (such as a zip or tar), then any file in the archive that is an impromptu licence could indicate licensing for the whole archive. Also, any known RDF – especially RDF embedded in pages that link to the document – that makes usage-rights statements about the document can be considered to indicate licensing.

Public domain works

The last area I will consider, which is also the hardest, is that of public domain works. These works are hard to identify because they need no technical mechanism to demonstrate that they are available for public use – simply being created sufficiently long ago is all that is needed. Because the age of the web is so much less than the current copyright term, only a small portion of the actual public domain is available online, but some significant effort has been and is continuing to be made to make public domain books available in electronic form.

The simplest starting point for tackling this issue is Creative Commons, where there is a so-called 'public domain dedication' statement that can be linked to and referred to in RDF to indicate either that the author dedicates the work to the public domain, or that it is already a public domain work (the latter is a de facto standard, although not officially promoted by Creative Commons). Both of these fit easily into the framework so far discussed.

Beyond this, it gets very difficult to establish that a web page or other document is not under copyright. Because there is no (known) standard way of stating that a work is in the public domain, the best strategy is likely to be to establish the copyright date (date of authorship) in some way, and then to infer public domain status. This may be possible by identifying copyright notices such as "Copyright (c) 2006 Ben Bildstein. All Rights Reserved." Another possibility may be to identify authors of works, if possible, and then compare those authors with a database of known authors' death dates.

But wait, there’s more

In my next and last post for this series, I will consider where Google is up to with the tackling of these issues, and consider the problems in the framework set out here. Stay tuned.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Tuesday, October 17, 2006 0 comments links to this post

SWS: Many Other Mechanisms

From the starting point of semantic web search just for Creative Commons (CC) licensed works using RDF/XML metadata (see my previous post), we can expand the idea to include other mechanisms and other kinds of documents. For example, the AEShareNet licensing scheme does not promote RDF, but instead web pages simply link to AEShareNet's copy of the licence, and display the appropriate image. A semantic web search engine could then use this information to establish that such pages are licensed, and in combination with an administrator entering the appropriate rights metadata for these licences, such documents can be included in search results. Using this new architecture of link-based licensing identification, we can also expand the Creative Commons search to include pages that link to their licences but for one reason or another do not include the relevant metadata (this is the primary method that Yahoo advanced search uses). Note that such link-based searching will inevitably include false positives in the search results.

The next mechanism that can be considered is that of HTML 'meta' tags. This is part of the HTML format, and is an alternative (and older) way of putting metadata into web pages. The same information can be carried, and given the nature of 'meta' tags they are unambiguous in a similar way to RDF, so false positives should not be a problem.

Another possibility is that the RDF that describes the rights in a page will not be embedded in that page, but will exist elsewhere. This is not too much of an issue, because the search engine can certainly be made capable of reading it and correctly interpreting it. However, it is worth noting that we should have less confidence in such 'foreign RDF' than we would in locally embedded RDF, because it is more likely than otherwise to be a demonstration or illustrative example, rather than a serious attempt to convey licensing information.

Text-based licences

One mechanism that poses significant challenges is what I call 'text-based licences', as compared with RDF-based (e.g. CC) or link-based (e.g. AEShareNet) licences. What I mean by ‘text-based’ is that the licence text is duplicated and included with each licensed work. This raises two problems: What happens if the licence text undergoes some slight changes? And what happens if the licence text is stored in a different document to the document that is a candidate for a search result? (This is common practice in software, as well as uses of the LGPL in multi-page documents. Wikipedia is a prime example of the latter.)

The first question can be answered fairly simply, although the implementation might not be as easy: the search engine needs a feature that allows it to compare a licence text to a document to see if they are (a) not similar, (b) similar enough that the document can be considered a copy, or (c) similar, but with enough extra content in the document that it can be considered a licensed document in its own right.

The other question is more tricky. One possible solution might be to keep a database of all such copies of known licences (what I term 'impromptu' licences), and then, using the functionality of establishing links to licences, record every web page that links to such an impromptu licence as licensed under the original licence. This idea will be useful for all types of text-based licensed, from free software to open content.

But wait, there’s more

Stay tuned for the third instalment, where I will talk about how to utilise non-web content, and the difficulties with public domain works.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Tuesday, October 17, 2006 0 comments links to this post

Tuesday, October 10, 2006

Semantic Web Search for Commons: An Introduction

As part of my research, I’ve been thinking about the problem of searching the Internet for commons content (for more on the meaning of ‘commons’, see Catherine’s post on the subject). This is what I call the problem of ‘semantic web search for commons’, and in this post I will talk a little about what that means, and why I’m using ‘semantic’ with a small ‘s’.

When I say 'semantic web search', what I’m talking about is using the specific or implied metadata about web documents to allow you to search for specific classes of documents. And of course I am talking specifically about documents with some degree of public rights (i.e. reusability). But before I go any further, I should point out that I’m not talking about the Semantic Web in the formal sense. The Semantic Web usually refers to the use of Resource Description Framework (RDF), Web Ontology Language (OWL) and Extensible Markup Language (XML) to express semantics (i.e. meaning) in web pages. I am using the term (small-s ‘semantic’) rather more generally, to refer more broadly to all statements that can meaningfully be made about web pages and other online documents, not necessarily using RDF, OWL or XML.

Commons content search engine

Now let me go a bit deeper into the problem. First, for a document to be considered 'commons' content (i.e. with enhanced public rights), there must be some indication, usually on the web and in many cases in the document itself. Second, there is no single standard on how to indicate that a document is commons content. Third, there is a broad spectrum of kinds of commons, which breaks down based on the kind of work (multi-media, software, etc.) and on the legal mechanisms (public domain, FSF licences, Creative Commons licences, etc.).

So let us consider the simplest case now, and in future posts I will expand on this. The case I shall consider first is that of a web page that is licensed with a Creative Commons (CC) licence, labelled as recommended by Creative Commons. This web page will contain embedded (hidden) RDF/XML metadata explaining its rights. This could be used as the basis for providing a preliminary semantic web search engine, by restricting results to only those pages that have the appropriate metadata for the search. This can then be expanded to include all other CC licences, and then the search interface can be expanded to include various categories such as 'modifiable' etc., and, in the case of CC, even jurisdiction (i.e. licence country). It is worth noting at this point that the details of each individual licence, specifically URL and jurisdiction, are data that essentially represent domain knowledge, meaning that they will have to be entered by an administrator of the search engine.

But wait there’s more

That’s it for now, but stay tuned for at least three more posts on this topic. In the next chapter, I will talk about the (many) other mechanisms that can be used to give something public rights. Then, I will consider non-web content, and the tricky issue of public domain works. Finally, I will look at where Google is at with these ideas, and consider the possible downfalls of a semantic web search engine as described in this series. Stay tuned.

Labels: ben, search

(permalink) posted by Ben Bildstein @ Tuesday, October 10, 2006 0 comments links to this post

Thursday, May 01, 2008

Get a list of all (indexable) URLs on a site from the Wayback Machine

Wednesday, April 30, 2008

Quantifying open software using Google Code Search

Tuesday, April 08, 2008

Table comparing Yahoo and Google's commons-based advanced search options

Tuesday, February 19, 2008

The problem with search engines

Thursday, February 14, 2008

CC0 - Creative Commons' new solution for the public domain

Friday, August 24, 2007

My software found something!

Friday, December 08, 2006

Positive feedback

Wednesday, November 29, 2006

A Challenge to Search Engines

Tuesday, October 17, 2006

SWS: Conclusion and Retrospective

SWS: The Tricky Bits

SWS: Many Other Mechanisms

Tuesday, October 10, 2006

Semantic Web Search for Commons: An Introduction

Contributors

On this page

Supporters

Archives

IP blogosphere