Friday, October 17, 2008

Muddiest Point

I'm not sure I understand the Deep Web. Does that mean that it hasn't been indexed by a web crawler? How can a given website owner know if his site is in the deep or visible web?

Weekly Response 8

Chapter 1. Definition and Origins of OAI-PMH

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH): greater interoperability between digital libraries and more efficient dissemination of information. Gee, that sounds like the general goal of all libraries. One thing I'm learning in this class is that while digital libraries are very different than traditional libraries in terms of structure and management, they have the same goals. Get information to the people! Preserve it for future people!

Scope: Metadata, using XML. It is moving into working with other classes of metadata and full content. The metadata is specifically for document-like-objects, in digital form. Often digital libraries are not just books and papers, but digital images, digital objects, and other things that require metadata.

Purpose: define a standard way to move metadata information from point A to point B in the world wide web; to facilitate sharing and aggregation of metadata.

They accomplish this by dividing the universe into OAI data providers (have the content and/or metadata) and OAI service providers (harvest info from data providers and make it available). This follows the client/server model. Data providers are the servers, service providers are the clients. This model allows one-stop shopping.

What it is not: an open access system, an archival standard, Dublin Core, or a realtime/dynamic search service.

Federated Searching: Put it in its place

Users want a search box! Give simple and easy access to information in one place, just like Google does. Whether or not the answer is the best one or from the best source is a moot point. Therefore, make federated searching mimic Google: one stop shopping that spits out an answer.

The Truth about Federated Searching
1. Does not search everything, ever! You will still have to consult other sources.
2. You will still get duplicates. To truly avoid duplication, it would take too long to download.
3. Relevancy is not perfect because it is only looking at the citation.
4. Federated searching out to be used as a service, not purchased as software. Updates happen to often to make it feasible.
5. The federated search engine does not search your catalog better than you can, it only searches it as well as your own search engine can.

The Z39.50 Information Retrieval Standard

Z39.50 is a standard allowing patrons to search other libraries' catalogs using their native library's interface. A client machine searches the server for data and it is retrieved using the client machine.

The server has all the catalog information and it retrieves the appropriate information and returns it to the user machine. Each set of database records has a set of access points for the collection.

Search Engine Technology and Digital Libraries
Since libraries are academic institutions with minimal universal searching capacity, and places like Google are universal search engines with minimal (although still a lot!) academic focus, the best of both worlds would be to marry the two entities: the academic internet! Google does have GoogleScholar now, although I am uncertain if it existed in June 2004, when this article was written. My understanding is that GoogleScholar works by bringing up papers and publications known to be 'academic' in nature that fulfill the search request. If you are searching from an academic IP address (like Pitt!) it will sort things so that emphasis is given to information available through the databases that that IP address subscribes to. So, if you search GoogleScholar from a Pitt computer, you are likely to retrieve fulltext items that you could have found through a database available at Pitt, but with the comfort of the Google interface.

This article appears to be focusing on academic libraries indexing the academic internet and making it available. Essentially, they would be putting the "LIBRARIAN APPROVED!" stamp on it. This helps the uninitiated user discern what would be an appropriate and trust-worthy source, vs. an inappropriate and untrustworthy source.

Friday, October 10, 2008

Muddiest Point

This is not about the lecture, but I do need clarification on the final project, so here's the place.

I know we are supposed to have variation among the 3 digital collections. Could we do 2 collections of digital photographs of objects that are unrelated to each other and then a 3rd collection of scanned material? Would that be varied enough?

Week 7: Access in Digital Libraries

Web Search Engines: Part 1

Problems for Web Searchers:
1. Infrastructure: must have the computers and hardware to meet the number of demands in a given period of time.
2. Crawling algorithms: Bots that go around the internet and index it. Crawlers start with a list or queue of good 'seed URLs'- sites that have lots of links to to other good websites. They then add all the unseen links to the queue, and save the content for indexing. The keep doing this until they hit the end of the queue.
Speed: One crawler can't do the whole internet! Need multiple crawlers who are assigned to different URLs to work in parallel (hashing function). Each crawler machine has internal parallelism as well, with multiple threads working at once.
Politeness: Don't harass the website's servers!
Excluded content: must look at robot.txt to determine what content should not be crawled.
Duplicate content: Avoid it.
Continuous crawling: Have a priority queue so important URL's are checked more frequently than low value/static URL's.
Spam: Prevent it! Blacklists, etc.

Web Search Engines: Part 2
Indexing algorithms: Scans document for indexable terms. These are then ranked in terms of position and repetition to give importance.
Real indexers:
Scaling up: divide the load among many machines, and fill up memory space with partial inverted files, and then combine the partials.
Term look up: So many phrases, so little time. Engines use trees, hierarchies and 2 level structures to make things more efficient.
Compression: Save space, compress data structures. This also makes searches faster.
Phrases: produce lists of common phrases.
Anchor text: words used to describe a link. A strongly repeated anchor text gives a good clue as to what the website is about.
Link popularity: the more people link to you, the better you are. This is proof that life really is a popularity contest, no matter what your mother told you.
Query-independent score: high scores in other ways improve ranking, even if it doesn't match the query as well.

Query processing algorithms: simple query processor looks up each word in its dictionary and locates postings list. It scans the postings list for documents in common.

Make them faster! Skip unnecessary parts of the list, end the results list early, number the documents based on their decreasing query-independent scores. Another option: cache! Precompute and store HTML results pages for popular searches. Spit this out upon request.

Henzinger:
The first part of this article focused on the same issues related to web search engines as the previous two articles.

Where it differed was in the third section:

Content Quality: How do you deal with wrong or misleading information? This is a topic that has occupied librarians' attention for a while. We produce guides and tutorials and lists on how to filter out the 'junk' on the internet. You forget that search engines try to help out with that. It's not a question of tricking the search engine into giving results that are not appropriate to the query, but one of whether or not the information provided is correct, even if it answers the query. Page rank and hits are a good measure, but not perfect. Anchor text might be useful, but junky websites can still have quality links. The most plausible is text based analysis.

Quality Evaluation: Measure the number of clicks a given result gets, and then the number of click throughs from that website.

Web Conventions: These habits of websites must be adhered to for the search engine to be able to use them correctly.
Anchor Text: Text in the links describes the link.
Appropriate and interesting links for the website's audience, related to the website's content.
Meta tags: like metadata in a library catalog, meta tags in a webpage can describe the site's content.

Duplicate hosts: multiple domain names resolve at the same end site for increased visibility. This is why typing in "pubmed.gov" sends you to http://www.ncbi.nlm.nih.gov/sites/entrez/. This is called a mirror. Search engines run the risk of providing results for each of those names, even though they have identical content. However, it could be hard to tell that if the ads on the pages are slightly different from one viewing to the next. If a webcrawl of the full site is not complete, then it might appear that they are not duplicates. A good way to avoid them is to predict whether similar domain names are likely to be duphosts.

Vaguely-structured data: Prose on a website that is marked up with HTML to affect how it is seen by the viewer. This HTML can give clues to the website. Large text followed by small text can imply that the small text is further details about the large text. Pages with an image in the upper left are often personal pages. Pages with more meta mistakes are likely to be of lower quality.

These readings gave me some interesting insight into how search engines work. I know that none of them are specific, because the actual algorithms are tightly guarded secrets. But they do give a clue as to why we get the results that we do, and just how hard the programmers work to fight off the spammers and such.

Friday, October 3, 2008

Week 6: Preservation in Digital Libraries

Research Challenges in Digital Libraries

We must research digital libraries in order to get a grasp on where we can take them. They are too widespread and heterogenous to really understand anything that's going on at the moment. We also need to figure out how to preserve the digital libraries as they are now for future study.

Big Issues:
1. We must figure out how to deal with all the digital libraries and preserve them while using humans as infrequently as possible.

2. We must protect the digital archives now. They require a lot of effort to maintain, so we must find a way to do that while, again, using humans as infrequently as possible.

3. We need to look at economic and business models of digital libraries to see how we can maintain these things in ways beyond technology. How can we afford to keep them up?

4. In order to expand the usefulness of digital libraries, new technologies need to be created. This needs to happen in order to make DL's cheaper while using humans as infrequently as possible.

5. We need shared and scalable infrastructure to support digital libraries. Sequestering them within institutions prevents interoperability and scalability, which hinders the usefulness of digital libraries.

Open Archival Information System Reference Model: An Introductory Guide

Open: reference model was developed in an open public forum: anyone could participate.
Archival Information System: people and institutions who agree to preserve info and make it available.

An OAIS must:
1. Get the appropriate information.
2. Make sure they have long term control of the information.
3. Know their user community.
4. Have appropriate metadata for the user to understand the info.
5. Make sure information is totally preserved.
6. Make it available to the user.

Tasks of OAIS:
-Ingestion (of data)
-Preservation Planning
-Data Management
-Archival Storage
-Administration
-Access (of data to user)

Types of information packages:
-Submission Information Package
-Archival Information Package
-Disseminated Information Package

This model provides a formula for digital library producers to follow. By doing so, they could produce an efficient, effective digital library. The paper does not provide any guidance on the technology or infrastructure to make this happen, but it does provide the guideposts of what sorts of things the technology and infrastructure must do.

Preservation Management of Digitized Materials
- The authors state that guidance is needed for digital preservation. It seems to be a recurring theme.


This book is to extensive to takes notes in much detail. However, it is an extremely interesting, useful guide for a novice in digital libraries to get a handle on the field. It introduces the reader to the vocabulary, provides reasons on why this information is vital, and explains how digital libraries are made, who uses them, what the rules and requirements are, and provides models for institutions to follow as they delve into this realm. Since this is a very new world, and many librarians are long out of library school, having this sort of resource, perhaps with additional instruction, they can get up to speed. Staying abreast of technological developments is important, and digital libraries are a huge part of that.

Actualized Preservation Threats
The National Digital Newspaper Project is an effort to "Chronicle America" by digitally preserving printed newspapers. It "also has a digital repository component that houses the digitized newspapers, supporting access and facilitating long-term preservation. Taking on access and preservation in a single system was both a deliberate decision and a deviation from past practices at LC." They wrote this paper to discuss the work done so far. Specifically, they discuss the preservation threats encountered by the project in 2 years.

Types of failures:
Media- Failure in the portable hard drives transporting the digital images from the awardees to LC. Fixed using 'fixity checks' as part of the transfer process and keeping a copy at the awardees until it was verified that LC had received it.

Hardware- Internal hard drives failed. They avoided data loss by using multiple HD arrays in a RAID 5 array with a hot spare. This prevented data loss in case one failed. Data was only lost when a second event occurred in the array while the system was rebuilding the harddrive using the hot spare.

Software- Three software problems occurred. The first involved a validation problem: records were put into the NDNP repository that had passed validation but 'did not conform to the appropriate NDNP profile'. This was fixed with new validation rules. The second was more problematic. During transformation, the newspaper title record had stripped the original METS record of the XML, and also, was producing invalid METS records. This broke the application, and also made parts of the data unreadable. The third problem occurred when the XFS file system was corrupted. This caused data loss. In a large, complex system such as this, it is harder to prevent problems, and to diagnose them when they occur. This is a serious flaw of huge digital libraries.

Operator- One error occurred when a series of files were deleted accidentally. Another occurred when the operator accidentally ingested the same batches multiple times, or perhaps did not purge a successful ingest before re-ingesting it. Many duplicates were produced.

The conclusions of the paper are that in a huge task such as this, errors are going to occur in many different ways, no matter what one does to protect against them. This makes performing a large digitization project extremely daunting, since one of the tasks is to make sure that the files are not only accessible but also permanently preserved.


This is Katie's favorite person in the world. His name is Kevin. Yes, all 3 of us have K names. It was not planned: Katie came prenamed, and we didn't have any choice in our names.

Muddiest Point 4

In XML, it seems like there are multiple ways of structuring things to get the same result. Are these rules hard and fast, or fairly soft?