Cyberspace Law and Policy Centre, University of New South Wales
Unlocking IP  |  About this blog  |  Contact us  |  Disclaimer  |  Copyright & licencsing  |  Privacy

Tuesday, February 19, 2008

 

The problem with search engines

There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...

The very basics of web search

Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
None of these is trivial. I'm no expert, but I suggest indexing is the easiest. Performing searches well is what made Google so successful, where previous search engines had been treating the search step more trivially.

But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.

A bit more about the web crawler

There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.

Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.

They're all downloading the same data

So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.

My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.

Here's what I'd like to see specifically:
If you know somewhere this is happening, let me know, because I can't find it. I think the Wayback Machine is the closest to an open access Web cache, and http://archive-access.sourceforge.net/ is the closest I've found to generalised access to the Wayback Machine. I'll read more about it, and let you know if it comes up trumps.


* I know.

Labels: , , ,


Comments:
Anonymous Anonymous said:
The alternative to The Internet Archive is Alexa Web Search.
 
Anonymous Margie Borschke said:
"My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface?"

It's a criticism that immediately makes me think of how on many mailing lists people still apologise for "cross posting". The apology stems from an early idea of the Internet that data would 'live' in one location and one would point to it when the reference was necessary (as well as obviously trying to avoid becoming spam.)But the reality is that everyone doesn't focus their attention in one place so it's actually efficient to post in several relevant venues.

The real strength of the Internet network comes from it's *distributed* nature. Creating multiple copies coupled with linking is what makes access vast amounts of information possible in a relatively short period of time. So while on the surface it seems inefficient to have multiple crawlers essentially trying to document the same thing, what's more remarkable is that none are able to traverse the entire web. (Even an engine like google sends out multiple crawlers.)

That said, there clearly are crawlers that unnecessarily tax bandwidth.
 
Post a Comment

Links to this post:

Create a Link



<< Home
 
 

This page is powered by Blogger. Isn't yours?