The House of Commons: Current Focus

Tuesday, July 10, 2007

Current Focus - The Deep Web

My previous two posts, last week, talked about my research, but didn't really talk about what I'm researching at the moment.

The deep web

Okay, so here's what I'm looking at. It's called the deep web, and it refers to the web documents that the search engines don't know about.

Sort of.

Actually, when the search engines find these documents, they really become part of the surface web, in a process sometimes called surfacing. Now I'm sure you're wondering: what kinds of documents can't search engines find, if they're the kind of documents anyone can browse to? The simple answer is: documents that no other pages link to. But a more realistic answer is that it's documents hidden in databases, that you have to do searches on the site to find. They'll generally have URLs, and you can link to them, but unless someone does, they're part of the deep web.

Now this is just a definition, and not particularly interesting in itself. But it turns out (though I haven't counted, myself) that there are more accessible web pages in the deep web than in the surface web. And they're not beyond the reach of automated systems - the systems just have to know the right questions to ask and the right place to ask the question. Here's an example, close to Unlocking IP. Go to AEShareNet and do a search, for anything you like. The results you get (when you navigate to them) are documents that you can only find by searching like this, or if someone else has done this, found the URL, and then linked to it on the web.

Extracting (surfacing) deep web commons

So when you consider how many publicly licensed documents may be in the deep web, it becomes an interesting problem from both the law / Unlocking IP perspective and from the computer science, which I'm really happy about. What I'm saying here is that I'm investigating ways of making automated systems to discover deep web commons. And it's not simple.

Lastly, some examples

I wanted to close with two web sites that I think are interesting in the context of deep web commons. First, there's SourceForge, which I'm sure the Planet Linux Australia readers will know (for the rest: it's a repository for open source software). It's interesting, because their advanced search feature really doesn't give many clues about it being a search for open source software.

And then there's the Advanced Google Code Search, which searches for publicly available source code, which generally means free or open source, but sometimes just means available, because Google can't figure out what the licence is. This is also interesting because it's not what you'd normally think of as deep web content. After all Google's just searching for stuff it found on the web, right? Actually, I class this as deep web content because Google is (mostly) looking inside zip files to find the source code, so it's not stuff you can find in regular search.

This search, as compared to SourceForge advanced search, makes it very clear you're searching for things that are likely to be commons content. In fact, I came up with 6 strong pieces of evidence that I can say leads me to believe Google Code Search is commons related.

(As a challenge to my readers, see how many pieces of evidence you can find that the Advanced Google Code Search is a search for commons (just from the search itself), and post a comment).

Labels: ben, deep web, research

(permalink) posted by Ben Bildstein @ Tuesday, July 10, 2007

Comments:

Denis said:
hi,

It's called the deep web, and it refers to the web documents that the search engines don't know about.
Then you might be interested in the following works about characterization of deep Web:
- The Deep Web: Surfacing Hidden Value by Bergman (it is highly-cited, almost sure you saw it; it should be noted however that it is better to be skeptical about the deep web size estimation given by Bergman - looks like the size was greatly(?) overestimated);
- Structured Databases on the Web: Observations and Implications by Chang et al. (reliable estimates);
- Exploring the Academic Invisible Web by Lewandowski et al. (surveys just part of the deep Web - academical part; plus some critics on Bergman's work);
- On Estimating the Scale of National Deep Web by Shestakov et al., to appear in DEXA'07 proceedings (surveys one national segment of deep Web)

Denis Shestakov

(permalink) posted by

Denis : 3:03 AM, July 16, 2007

Ben Bildstein said:
Denis, that's really great. I look forward to investigating those resources, thanks. Yes, I have see Surfacing Hidden Value; it is kind of hard to miss. But the others look interesting.

(permalink) posted by

Ben Bildstein : 1:55 PM, July 16, 2007

Tuesday, July 10, 2007

Current Focus - The Deep Web

Contributors

On this page

Supporters

Archives

IP blogosphere