Friday, October 10, 2008

Week 7: Access in Digital Libraries

Web Search Engines: Part 1

Problems for Web Searchers:
1. Infrastructure: must have the computers and hardware to meet the number of demands in a given period of time.
2. Crawling algorithms: Bots that go around the internet and index it. Crawlers start with a list or queue of good 'seed URLs'- sites that have lots of links to to other good websites. They then add all the unseen links to the queue, and save the content for indexing. The keep doing this until they hit the end of the queue.
Speed: One crawler can't do the whole internet! Need multiple crawlers who are assigned to different URLs to work in parallel (hashing function). Each crawler machine has internal parallelism as well, with multiple threads working at once.
Politeness: Don't harass the website's servers!
Excluded content: must look at robot.txt to determine what content should not be crawled.
Duplicate content: Avoid it.
Continuous crawling: Have a priority queue so important URL's are checked more frequently than low value/static URL's.
Spam: Prevent it! Blacklists, etc.

Web Search Engines: Part 2
Indexing algorithms: Scans document for indexable terms. These are then ranked in terms of position and repetition to give importance.
Real indexers:
Scaling up: divide the load among many machines, and fill up memory space with partial inverted files, and then combine the partials.
Term look up: So many phrases, so little time. Engines use trees, hierarchies and 2 level structures to make things more efficient.
Compression: Save space, compress data structures. This also makes searches faster.
Phrases: produce lists of common phrases.
Anchor text: words used to describe a link. A strongly repeated anchor text gives a good clue as to what the website is about.
Link popularity: the more people link to you, the better you are. This is proof that life really is a popularity contest, no matter what your mother told you.
Query-independent score: high scores in other ways improve ranking, even if it doesn't match the query as well.

Query processing algorithms: simple query processor looks up each word in its dictionary and locates postings list. It scans the postings list for documents in common.

Make them faster! Skip unnecessary parts of the list, end the results list early, number the documents based on their decreasing query-independent scores. Another option: cache! Precompute and store HTML results pages for popular searches. Spit this out upon request.

Henzinger:
The first part of this article focused on the same issues related to web search engines as the previous two articles.

Where it differed was in the third section:

Content Quality: How do you deal with wrong or misleading information? This is a topic that has occupied librarians' attention for a while. We produce guides and tutorials and lists on how to filter out the 'junk' on the internet. You forget that search engines try to help out with that. It's not a question of tricking the search engine into giving results that are not appropriate to the query, but one of whether or not the information provided is correct, even if it answers the query. Page rank and hits are a good measure, but not perfect. Anchor text might be useful, but junky websites can still have quality links. The most plausible is text based analysis.

Quality Evaluation: Measure the number of clicks a given result gets, and then the number of click throughs from that website.

Web Conventions: These habits of websites must be adhered to for the search engine to be able to use them correctly.
Anchor Text: Text in the links describes the link.
Appropriate and interesting links for the website's audience, related to the website's content.
Meta tags: like metadata in a library catalog, meta tags in a webpage can describe the site's content.

Duplicate hosts: multiple domain names resolve at the same end site for increased visibility. This is why typing in "pubmed.gov" sends you to http://www.ncbi.nlm.nih.gov/sites/entrez/. This is called a mirror. Search engines run the risk of providing results for each of those names, even though they have identical content. However, it could be hard to tell that if the ads on the pages are slightly different from one viewing to the next. If a webcrawl of the full site is not complete, then it might appear that they are not duplicates. A good way to avoid them is to predict whether similar domain names are likely to be duphosts.

Vaguely-structured data: Prose on a website that is marked up with HTML to affect how it is seen by the viewer. This HTML can give clues to the website. Large text followed by small text can imply that the small text is further details about the large text. Pages with an image in the upper left are often personal pages. Pages with more meta mistakes are likely to be of lower quality.

These readings gave me some interesting insight into how search engines work. I know that none of them are specific, because the actual algorithms are tightly guarded secrets. But they do give a clue as to why we get the results that we do, and just how hard the programmers work to fight off the spammers and such.

No comments: