Monday, October 23, 2006
A Google search revealed that this topic has –unsurprisingly - been picked up by the world’s media and has been analysed on a number of sites. The majority of these sites have discussed the story with the same tongue-in-cheek quality as the AP report on the SMH site. However, all joking aside, it’s great that kids are learning about copyright and infringement. But why not take it one step further and introduce a “Copyright Commons” or similar patch? Surely the values espoused by the open content and open access movements are in the same vein as those that Boy Scouts aspire to. Sharing gives you the same warm, fuzzy feeling whether you’re sharing a canoe, tucker cooked at a campfire, or some content that you created last night on your computer. To achieve this patch, the Scouts could launch campaigns against copyright term extension in areas where copyright hasn’t been extended to life-plus-70 years or create plays based on the Eldred v Ashcroft litigation (characters could include Lawrence Lessig, the Justices of the United States Supreme Court and Mickey Mouse -- oh, hang on a second, the play probably can't include Mickey Mouse!)
Even if the Boy Scouts don’t pick up on my idea, at least this group will be unlikely, in the years to come, to ask that age-old question: “Is it legal to copy my CDs onto my iPod?” In fact, perhaps that should be the first lesson.
Tuesday, October 17, 2006
In my last three posts (see part 1, part 2 and part 3), I have been exploring the potential for a search engine that focuses solely on commons content (i.e. works with expanded public rights). Now I’d like to touch on where this idea might fit into the broader scheme of things, including what its shortcomings might be.
At present, Google web search is the benchmark in Internet search. I consider Google to have almost unlimited resources, so for any given feature that Google web search lacks, it is worth considering why they do not have it. In general, I think the answer to this question is either that it will clutter their user interface, it will not be of significant use to the public, or it is computationally intractable. However, given that Google have many services, many in beta and many still in Google Labs, the first problem of cluttering the user interface can clearly be solved by making a new search a new service (for example, Google Scholar). This leaves only the possibilities of the feature not being of significant use, or being computationally intractable.
As this clearly demonstrates, there is no single or simple way to establish works as part of the commons. This may go some way to answering the question of why Google hasn't done this yet. However, Google has implemented at least three of these ideas: in the advanced search, Creative Commons-based usage rights can be specified; in the new Code Search (beta), various text-based licences are recognised by the system; and in Google Book Search (beta), authorship is used to establish public-domain status (in some small subset of Google's scanned books). Google hasn’t done anything all encompassing yet, but it may be that it’s just around the corner, or it may be that Google has figured out that it’s not really what the public want. Depending on how the big players (Google, but also Yahoo and others) proceed, my research may move more towards an analysis of either why so much was done by the big search engines (and what this means to the Unlocking IP project), or alternatively why so little was done (and what this means).
Lastly, I should consider what this idea of a semantic web search engine, as presented, is lacking. First, there is the issue that licence metadata (URLs, if nothing else) need to be entered by someone with such knowledge – the system can not discover these on its own. Second, there are the issues of false positives (web pages that are erroneously included in search results) and false negatives (suitable web pages that are erroneously excluded from search results). The former is the most obvious problem from a user's perspective, and my description here focuses mostly on avoiding these false positives. The latter problem of false negatives is much harder to avoid, and obviously is exacerbated by the system's incomplete domain knowledge, and the inevitable non-standard licensing that will be seen on the web.
Thanks for reading all the way to the end of my last post. If you got this far, stay tuned – I’m sure there’s more to come.
(This is part 3 of a series of posts on semantic web search for commons. For the previous posts, see here: part 1, part 2)
Another possible expansion to the proposed semantic web search engine is to include non-web documents hosted on the web. Google, for example, already provides this kind of search with Google Image Search, where although Google can not distil meaning from the images that are available, it can discover enough from their textual context that they can be meaningfully searched.
Actually, from the perspective of semantic web search as discussed in this series, images need not be considered, because if an image is included in a licensed web page, then that image can be considered licensed. The problem is actually more to do with non-html documents that are linked-to from html web pages, hence where the search engine has direct access to them but prima facie has no knowledge of their usage rights.
This can be tackled from multiple perspectives. If the document can be rendered as text (i.e. it is some format of text document) then any other licence mechanism detection features can be applied to it (for example, examining similarity to known text-based licences). If the document is an archive file (such as a zip or tar), then any file in the archive that is an impromptu licence could indicate licensing for the whole archive. Also, any known RDF – especially RDF embedded in pages that link to the document – that makes usage-rights statements about the document can be considered to indicate licensing.
Public domain works
The last area I will consider, which is also the hardest, is that of public domain works. These works are hard to identify because they need no technical mechanism to demonstrate that they are available for public use – simply being created sufficiently long ago is all that is needed. Because the age of the web is so much less than the current copyright term, only a small portion of the actual public domain is available online, but some significant effort has been and is continuing to be made to make public domain books available in electronic form.
The simplest starting point for tackling this issue is Creative Commons, where there is a so-called 'public domain dedication' statement that can be linked to and referred to in RDF to indicate either that the author dedicates the work to the public domain, or that it is already a public domain work (the latter is a de facto standard, although not officially promoted by Creative Commons). Both of these fit easily into the framework so far discussed.
Beyond this, it gets very difficult to establish that a web page or other document is not under copyright. Because there is no (known) standard way of stating that a work is in the public domain, the best strategy is likely to be to establish the copyright date (date of authorship) in some way, and then to infer public domain status. This may be possible by identifying copyright notices such as "Copyright (c) 2006 Ben Bildstein. All Rights Reserved." Another possibility may be to identify authors of works, if possible, and then compare those authors with a database of known authors' death dates.
But wait, there’s more
In my next and last post for this series, I will consider where Google is up to with the tackling of these issues, and consider the problems in the framework set out here. Stay tuned.
From the starting point of semantic web search just for Creative Commons (CC) licensed works using RDF/XML metadata (see my previous post), we can expand the idea to include other mechanisms and other kinds of documents. For example, the AEShareNet licensing scheme does not promote RDF, but instead web pages simply link to AEShareNet's copy of the licence, and display the appropriate image. A semantic web search engine could then use this information to establish that such pages are licensed, and in combination with an administrator entering the appropriate rights metadata for these licences, such documents can be included in search results. Using this new architecture of link-based licensing identification, we can also expand the Creative Commons search to include pages that link to their licences but for one reason or another do not include the relevant metadata (this is the primary method that Yahoo advanced search uses). Note that such link-based searching will inevitably include false positives in the search results.
The next mechanism that can be considered is that of HTML 'meta' tags. This is part of the HTML format, and is an alternative (and older) way of putting metadata into web pages. The same information can be carried, and given the nature of 'meta' tags they are unambiguous in a similar way to RDF, so false positives should not be a problem.
Another possibility is that the RDF that describes the rights in a page will not be embedded in that page, but will exist elsewhere. This is not too much of an issue, because the search engine can certainly be made capable of reading it and correctly interpreting it. However, it is worth noting that we should have less confidence in such 'foreign RDF' than we would in locally embedded RDF, because it is more likely than otherwise to be a demonstration or illustrative example, rather than a serious attempt to convey licensing information.
One mechanism that poses significant challenges is what I call 'text-based licences', as compared with RDF-based (e.g. CC) or link-based (e.g. AEShareNet) licences. What I mean by ‘text-based’ is that the licence text is duplicated and included with each licensed work. This raises two problems: What happens if the licence text undergoes some slight changes? And what happens if the licence text is stored in a different document to the document that is a candidate for a search result? (This is common practice in software, as well as uses of the LGPL in multi-page documents. Wikipedia is a prime example of the latter.)
The first question can be answered fairly simply, although the implementation might not be as easy: the search engine needs a feature that allows it to compare a licence text to a document to see if they are (a) not similar, (b) similar enough that the document can be considered a copy, or (c) similar, but with enough extra content in the document that it can be considered a licensed document in its own right.
The other question is more tricky. One possible solution might be to keep a database of all such copies of known licences (what I term 'impromptu' licences), and then, using the functionality of establishing links to licences, record every web page that links to such an impromptu licence as licensed under the original licence. This idea will be useful for all types of text-based licensed, from free software to open content.
But wait, there’s more
Stay tuned for the third instalment, where I will talk about how to utilise non-web content, and the difficulties with public domain works.
Friday, October 13, 2006
Carman also dedicates a footnote (p. 59, fn 33) to discussing the ‘public commons’, distinguishing the commons from the public domain. Carman states that he has used to the term “public commons to describe these elements which under no circumstances constitute private property.”(p. 59, fn 33). He purposely chose to avoid using the term ‘public domain’ because of the confusion that this term creates. Interestingly, it seems that the similarities and differences between the commons and public domain are still yet to be solved…and that will make an interesting post in the future! Returning to Carman, as far as I know, this is the earliest usage of the term ‘commons’ in relation to creative content – but if anyone can identify earlier use then let us know!
We want to make sure that the term the “commons” remains common for all to use and this usage is not dominated by any particular connotation, group or organisation. We hope that naming our blog “House of Commons” promotes the ‘commons’ and encourages widespread use of the term.
Although maybe no encouragement is needed: see the Academic Commons, the Digital Library of the Commons, the Environmental Commons, the Cricket Commons (admittedly, the last one isn’t a “commons” exactly but a set of luxury suites in Philadelphia, USA, but with cricket season underway in Australia I wanted to see if there was a ‘cricket commons’….)
Tuesday, October 10, 2006
As part of my research, I’ve been thinking about the problem of searching the Internet for commons content (for more on the meaning of ‘commons’, see Catherine’s post on the subject). This is what I call the problem of ‘semantic web search for commons’, and in this post I will talk a little about what that means, and why I’m using ‘semantic’ with a small ‘s’.
When I say 'semantic web search', what I’m talking about is using the specific or implied metadata about web documents to allow you to search for specific classes of documents. And of course I am talking specifically about documents with some degree of public rights (i.e. reusability). But before I go any further, I should point out that I’m not talking about the Semantic Web in the formal sense. The Semantic Web usually refers to the use of Resource Description Framework (RDF), Web Ontology Language (OWL) and Extensible Markup Language (XML) to express semantics (i.e. meaning) in web pages. I am using the term (small-s ‘semantic’) rather more generally, to refer more broadly to all statements that can meaningfully be made about web pages and other online documents, not necessarily using RDF, OWL or XML.
Commons content search engine
Now let me go a bit deeper into the problem. First, for a document to be considered 'commons' content (i.e. with enhanced public rights), there must be some indication, usually on the web and in many cases in the document itself. Second, there is no single standard on how to indicate that a document is commons content. Third, there is a broad spectrum of kinds of commons, which breaks down based on the kind of work (multi-media, software, etc.) and on the legal mechanisms (public domain, FSF licences, Creative Commons licences, etc.).
So let us consider the simplest case now, and in future posts I will expand on this. The case I shall consider first is that of a web page that is licensed with a Creative Commons (CC) licence, labelled as recommended by Creative Commons. This web page will contain embedded (hidden) RDF/XML metadata explaining its rights. This could be used as the basis for providing a preliminary semantic web search engine, by restricting results to only those pages that have the appropriate metadata for the search. This can then be expanded to include all other CC licences, and then the search interface can be expanded to include various categories such as 'modifiable' etc., and, in the case of CC, even jurisdiction (i.e. licence country). It is worth noting at this point that the details of each individual licence, specifically URL and jurisdiction, are data that essentially represent domain knowledge, meaning that they will have to be entered by an administrator of the search engine.
But wait there’s more
That’s it for now, but stay tuned for at least three more posts on this topic. In the next chapter, I will talk about the (many) other mechanisms that can be used to give something public rights. Then, I will consider non-web content, and the tricky issue of public domain works. Finally, I will look at where Google is at with these ideas, and consider the possible downfalls of a semantic web search engine as described in this series. Stay tuned.