Cyberspace Law and Policy Centre, University of New South Wales
Unlocking IP  |  About this blog  |  Your say  |  Disclaimer  |  Copyright & licencsing  |  Privacy

Thursday, May 01, 2008

 

Get a list of all (indexable) URLs on a site from the Wayback Machine

Earlier this year I complained about the problem with search engines. Today, Alexander Osborne (from the National Library of Australia) corrected me, at least a little bit.

I said I'd like to see an interface that (among other nice-to-haves) answers questions like "give me everything you've got from cyberlawcentre.org/unlocking-ip", and it turns out that that's actually possible with The Wayback Machine. Not in a single request, that I know of, but with this (simple old HTTP) request: http://web.archive.org/web/*xm_/www.cyberlawcentre.org/unlocking-ip/*, you can get a list of all URLs under cyberlawcentre.org/unlocking-ip, and then if were to want to you could do another HTTP request for each URL.

Pretty cool actually, thanks Alex.

Now I wonder how big you can scale those requests up to... I wonder what happens if you ask for www.*? Or (what the heck, someone has to say it) just '*'. I guess you'd probably break the Internet...

Labels: ,


Wednesday, April 30, 2008

 

Quantifying open software using Google Code Search

Google Code Search lets you search for source code files by licence type, so of course I was interested in whether this could be used for quantifying indexable source code on the web. And luckily GCS lets you search for all works with a given licence. (If you don't understand why that's a big deal, try doing a search for all Creative Commons licensed work using Google Search.) Even better, using the regex facility you can search for all works! You sure as heck can't do that with a regular Google web search.

Okay, so here's the latest results, including hyperlinks to searches for you to try them yourself:

And here's a spreadsheet with graph included: However, note the discontinuity (in absolute and trend terms) between approximate and specific results in that (logarithmic) graph, which suggests Google's approximations are not very good.

Labels: , , , ,


 

CAL Wants YOU!....To distribute any funds that it owes you

Trawling the web in search of copyright knowledge, as one does from time to time, I came across an interesting post on the Copyright Agency Limited (CAL) site. CAL collects funds under the statutory licence scheme that is provided in the Australian Copyright Act and, according to its website, there are a number of corporations and individuals who are missing out on their moolah. Could you be one of them? Head here to find out.

Sadly this blogger is not in the money, but there are some interesting names on the lists. Some make perfect sense - for example, Australian artist Pro Hart. Then there are the numerous estates who are owed money, including the estate of Ernest Hemingway, AA Milne, and Australian architect Harry Seidler. Then there are the more unusual 'publishers' -for example, eBay Australia & New Zealand and Air Caledonie International. I'm not sure what Air Caledonie has published or who's reproduced it, but I want to go to New Caledonia after visiting that website.

Labels: ,


Monday, April 28, 2008

 

CAL v NSW in the High Court

Regular readers will know of my interest in all-things-Crown-copyright, so I have come out of my blogging hiatus* to let you all know that last week argument in the appeal of Copyright Agency Limited v State of New South Wales was heard before five members of the High Court (Gleeson CJ, Gummow, Heydon, Crennan, and Kiefel JJ). As you may recall, this case considers whether the Copyright Agency Limited (CAL) can collect money from the NSW Government for the use of certain copyright-protected surveyor plans. The Full Court of the Federal Court of Australia found that CAL could not collect on these plans on the basis an implied licence exists, permitting the NSW Government to do everything it needs to in relation to the plans, as dictated by statute.

A transcript for the hearing can be found on AustLII here. I will get some comments up within the next week.


* Self-imposed in a desperate attempt to actually write my thesis, and I am pleased to report that it's going well, in case my supervisors are reading this.

Labels: ,


Tuesday, April 08, 2008

 

Table comparing Yahoo and Google's commons-based advanced search options

Hi commons researchers,

I just did this analysis of Google's and Yahoo's capacities for search for commons (mostly Creative Commons because that's in their advanced search interfaces), and thought I'd share. Basically it's an update of my research from Finding and Quantifying Australia's Online Commons. I hope it's all pretty self-explanatory. Please ask questions. And of course point out flaws in my methods or examples.

Also, I just have to emphasise the "No" in Yahoo's column in row 1: yes, I am in fact saying that the only jurisdiction of licences that Yahoo recognises is the US/unported licences, and that they are in fact ignoring the vast majority of Creative Commons licences. (That leads on to a whole other conversation about quantification, but I'll leave that for now.)

(I've formatted this table in Courier New so it should come out well-aligned, but who knows).


Feature                       | Google | Yahoo |
------------------------------+--------+-------+
1. Multiple CC jurisdictions  | Yes    | No    | (e.g.)
2. 'link:' query element      | No     | Yes   | (e.g. GY)
3. RDF-based CC search        | Yes    | No    | (e.g.)
4. meta name="dc:rights" *    | Yes    | ? **  | (e.g.)
5. link-based CC search       | No     | Yes   | (e.g.)
6. Media-specific search      | No     | No    | (GY)
7. Shows licence elements     | No     | No    | ****
8. CC public domain stamp *** | Yes    | Yes   | (e.g.)
9. CC-(L)GPL stamp            | No     | No    | (e.g.)


* I can't rule out Google's result here actually being from <a rel="license"> in the links to the license (as described here: http://microformats.org/wiki/rel-license).
** I don't know of any pages that have <meta name="dc:rights"> metadata (or <a rel="licence"> metadata?) but don't have links to licences.
*** Insofar as the appropriate metadata is present.
**** (i.e. doesn't show which result uses which licence)

Notes about example pages (from rows 1, 3-5, 8-9):

Labels: , , ,


Tuesday, February 19, 2008

 

I think I found a trump card (update: no, I didn't)

(following on from this post)

http://www.archive.org/web/researcher/intended_users.php

I'll certainly be looking into this further.

(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:
We are in the process of redesigning our researcher web interface. During this time we regret that we will not be able to process any new researcher requests. Please see if existing tools such as the Wayback Machine can accommodate your needs. Otherwise, check back with us in 3 months for an update.
This seems understandable except for this, on the same page:
This material has been retained for reference and was current information as of late 2002.
That's over 5 years. And in Internet time, that seems like a lifetime. I'll keep investigating.)

Labels: , ,


 

Reminder: Review of Private Copying Exceptions

The Government is conducting a review of the recently introduced format shifting exceptions in the Copyright Act (47J and 110AA). The review is required by the Copyright Amendment Act 2006. The Attorney-General's Department has released an issues paper inviting submissions on the operation of these provisions. More information is available here.

Submissions are due just around the corner (29 February) - so get submitting!

Labels: , ,


 

Copyright Enforcement, UK Style

Earlier in the week SMH reported that the Government is considering forcing ISPs to disconnect users who access pirated material (three strikes and you're out, UK style).

Kim Weatherall has done an excellent overview of the problems with this approach.

Labels: , ,


 

The problem with search engines

There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...

The very basics of web search

Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
None of these is trivial. I'm no expert, but I suggest indexing is the easiest. Performing searches well is what made Google so successful, where previous search engines had been treating the search step more trivially.

But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.

A bit more about the web crawler

There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.

Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.

They're all downloading the same data

So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.

My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.

Here's what I'd like to see specifically:
If you know somewhere this is happening, let me know, because I can't find it. I think the Wayback Machine is the closest to an open access Web cache, and http://archive-access.sourceforge.net/ is the closest I've found to generalised access to the Wayback Machine. I'll read more about it, and let you know if it comes up trumps.


* I know.

Labels: , , ,


Monday, February 18, 2008

 

Creative Commons has data!

As you aren't aware*, Creative Commons has some data on the quantification of Creative Commons licence usage (collected using search engine queries). It's great that they are a) collecting this data, and b) sharing it freely.

If you look around, you can probably find some graphs based on this data, and that's probably interesting in itself. Tomorrow I'll see about dusting off my Perl skills, and hopefully come up with a graph of the growth of Australian CC licence usage. Stay tuned.


* If you knew about this, why didn't you tell me!

Labels: , , ,


Thursday, February 14, 2008

 

CC0 - Creative Commons' new solution for the public domain

Creative Commons have come up with a better way for people to mark works as copyright-free, or part of the public domain. It's called CC0 (CC Zero). The page for using it is here: http://labs.creativecommons.org/license/zero.

There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."

It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.

Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.

Here's my example waiver:

CC0


To the extent possible under law,
Ben Bildstein
has waived all copyright, moral rights, database rights, and any other rights that might be asserted over Sensei's Library: Bildstein/Votes.

Labels: , ,


Wednesday, January 16, 2008

 

'Ideas' Now Free

Lawrence Lessig's seminal work The Future of Ideas has now been released under a Creative Commons Attribution-NonCommercial 3.0 licence and can be freely downloaded off the Internet. Read more on Lessig's blog here or download the book here.

This completes what I will describe as Lessig's trilogy in four parts: all four of his books (Code and Other Laws of Cybersapce, Code v 2.0, The Future of Ideas and Free Culture) are now available under various Creative Commons licences.

Labels: , ,


Friday, December 21, 2007

 

It Wouldn't be Christmas Without....

In 2006, the (sadly, soon to be ex-) Senator Andrew Bartlett helped us identify a new species of creature, one borne out of the Copyright Act amendments, known affectionately as "the Congealed, Wobbling Blob of Copyright" (see here and here).

In 2007, it appears the Blob has moved with the times and now has its own 'BlobBook' page, and sends its silly season good wishes:



Seasons Greetings from all of us at the House of Commons!

Labels: ,


Friday, December 07, 2007

 

The Harry Potter Lexicon - Fair Use?

The Fair Use Project at the Centre for Internet and Society (Stanford Law School) will help defend a book publisher planning on releasing a print version of the The Harry Potter Lexicon. Publication of the book has been blocked by JK Rowling and Warner Brothers based on claims of copyright and trademark infringement. Rowling notes:
"It is not reasonable, or legal, for anybody, fan or otherwise, to take an author's hard work, re-organize their characters and plots, and sell them for their own commercial gain. However much an individual claims to love somebody else's work, it does not become theirs to sell."
Rowling previously shared quite a close relationship with the Lexicon and has publicly praised the website. (Read more in this post on Ars Technica).

According to SMH:
"Fair Use Project Executive Director Anthony Falzone said the Lexicon is protected by US rules that have long given people 'the right to create reference guides that discuss literary works, comment on them and make them more accessible.'"

Labels: , ,


 
 

This page is powered by Blogger. Isn't yours?