Digital Libraries: September 2008

Sunday, September 28, 2008

Flickr Assignment

To see some cool pictures of Batman comicbook covers, check out this URL:

http://www.flickr.com/photos/30893186@N05/

Friday, September 26, 2008

Is it possible for metadata to be stored with the object, and separately from the object? Would that redundancy be a problem?

(I forgot to put the puppy in the reading notes post. So here she is. Sleeping again.)

Week 5: XML Galore!

"Introducing the Extensible Markup Language"

XML is extensible: it can be altered and added to indefinitely to tweak the language to suit the needs of the user. This makes it a robust language to use for digital libraries. As things change, XML can accommodate the changes without requiring a total overhaul of the system. Libraries like things that work that way, because it doesn't require them to reinvent the wheel. It is also useful for metadata, because the tags can be used for labeling different types of metadata.

"A Survey of XML Standards" is a good reference source for the different versions of XML because it provides other resources to look at for further instruction. The sheer number of versions illustrates the extensibility of XML.

"Extending your Markup" is an interesting and short overview of how XML works. Again, it is a good resource for a novice to look at to get started in this new world.

Major definitions:
DTD: document type definitions. This tags a given field as including a given type of information, such as author. They define the structure of the XML document.
DTD elements:
Nonterminal: they have a series of other choices or sequeneces. A DTD defining a book has sequences following it such as author.
Terminal: They do not have choices. They may include things like PC data, or are empty, or labeled as 'any'.

DTD attributes: do not prescribe order on the DTD, but include further information

Namespaces: to prevent conflict between two fields that use the same tag but in different contexts (email address vs. postal address) namespaces define the two as distinct. Do not play well with DTDs.

Linking: Goes beyond HTML to describe different types of linking
Xlink: describes how 2 documents can be linked
Xpointer: links 2 parts of the same document.
XPath: (used by Xpointer) describes the linking path

XSLT: Extensible Style Sheet Language Transformer: goes from XSL to HTML.

XML Schema: Overcome the limitations of DTDs (expression limited and non XML syntax)

Document definition markup language (DDML): define datatypes
Document content description (DCD)
Schema for object-oriented XML (SOX)
XML-Data (replaced by DCD)

"Introduction to XML schema"

Schema replace DTDs! They do the same things like define the element, define child elements, define the order of the elements, and other similar things. However, they are more extensible, richer and powerful, they support data types and namespaces and they are still XML. Essentially, they perform the same function as DTD's only better.

Friday, September 19, 2008

Muddiest Point week 3

Who assigns a DOI? Is it the creator of the digital object, or an outside organization?

Week 4: META DATA GALORE!

Witten:

Bibliographic systems:
1. Finding: locate item with known info.
2. Collocation: finding other things related to this item, such as other books the author has written.
3. Choice: A list of other available options arranged graphically (other editions) or topically (similar subjects).

Bibliographic entities
1. Documents: analog or digital form
2. Works: inhabitants of bibliographic universe: can have different forms, mediums and editions
3. Editions: multiple publications, revisions. Electronic form is usually a version, release or revision not an edition
4. Authors: Can have different names, numbers of authors, versions of name, can be a group or entity like the LOC. The LOC provides controlled vocabulary and standard names to clear up any problems.
5. Titles: straight forward attribution of the work
6. Subject: key-phrase extraction or key-phrase assignment. LOC uses a controlled vocabulary (LCSH) to standardize subject assignment.
7. Subject classification: organizing books on the shelf by subject. LC call number system does this automatically, as does Dewey. This allows the user to physically browse the shelves and gain access to the full content to choose materials.

Bibliographic Metadata
1. MARC: Machine Readable Catalog: using numerical tags, organizes info
2. Dublin Core: same concept, but simplified without all the numerical tags.
3. BibTex: prefered by scientific and technical authors who use a lot of mathematical structures.
4. Refer: basis of EndNote

Metadata for images, etc
1. Tagged Image File Format: TIFF. Used for images. Tags describe elements of the image, such as size, colors, etc.
2. MPEG-7: multimedia content description interface. Tags describe the data in the file.

Extracting Metadata
1. Reading the document helps one understand it.
2. Markup languages give clues as to the content without reading the full document: XML, etc.
3. Extracting information: generic entity extraction can pull information out using clues in the text
4. Bibliographic references: provide information in the form of citations. A citation index, such as a 'works cited' page organizes these.

Setting the Stage
This article covers the basics of different types of metadata systems already covered. However, what it does cover is how metadata, the structure of metadata, and the organization of metadata are important to extend to museums and archives, especially as those institutions move to digital resources. The use of metadata is second nature to libraries because they've been doing it for generations now. The analog metadata can easily be transcribed into digital systems when the items are digitized. However, archives and museums have resisted using metadata and instead use 'finding guides' to locate their items. This precludes amateur users from independently finding items, and it precludes digitization. This is a situation that needs to be rectified in order for these institutions to move into the digital age.

Border Crossings
This article looks back on the past 10 years of the efforts of the D-Lib DCMI management team. It talks about how necessary it is to create a universal and international system of metadata management. As information and metadata become more digitized and accessible over the Internet, the more important it is for the systems to be able to speak to each other. An overarching goal of libraries has always been for them to be able to share information with each other and make materials as accessible to patrons every where as possible. The Internet provides the infrastructure to make that happen, but in order to work, all the different systems must be able to communicate. This article focuses on the metadata aspect of that. I found especially applicable the comparison to the rail changes between Mongolia and China. Two given libraries ought not have the animosity of centuries that those nations do, so they certainly shouldn't have the level of complexity of communication that they do.

Puppy picture!

A tired puppy is a good puppy.

Friday, September 12, 2008

Week 3 Readings

Lesk Ch. 2

Computer typesetting:
1. Printers
2. Word processing
a. exact appearance of the text
b. content of the text

Text Formats
1. ASCII standard: 7-bit code for 26 Latin letters
2. Unicode is gaining popularity: covers all characters for all major languages in 16-bit-per character
3. Higher level descriptive systems: characters are marked for meaning
a. MARC: Machine-Readable Cataloging
b. SGML: Standard generalized Markup Language
c. HTML: Hypertext Markup Language

Document Conversion: analog to digital forms
1. Keying in: expensive
2. Scanning: less expensive
a. Optical character recognition: improving reliability
3. Converted documents can then be made online: digital libraries!

Arms Ch. 3
1. Structure: elements of the document: font, characters, paragraphs, etc
2. Appearance: How the elements are arranged on the page
3. Page-description languages: describe appearance on the page. TeX, PostScript, PDF
4. Encoding characters: ASCII, Unicode, transliteration, SGML, HTML (simplified SGML), XML (bridge between SGML and HTML)
5. style sheets (formatting on screen/printed page)
a. Cascading style sheets (CSS): used with HTML
b. Extensible style language (XSL): used with XML
6. Page description languages: layout
a. TeX: focus on mathematics
b. PostScript: graphical output for printing, with support for fonts
c. Portable document format (PDF): from PostScript. Similar attributes to reading paper, but on the screen. Can limit unlawful printing. Adobe provides excellent, free PDF readers, making the format widely accepted.

Identifiers and Their Role In Networked Information Applications
1. ISBN, ISSN, OCLC, RILN: make locating a given object easy.
2. New identifiers are emerging the electronic world: URLs and URNs
a. URLs: not long lasting locators, very ephemeral.
b. URN: naming authority identifier and object identifier
c. OCLC persistant URL (PURL): maintained for a much longer time than regular URLs- less likely to produce dead links.
d. Serial Item and Contribution identifier (SICI): using ISSN, can identify individual journal or article.
e. Book Item and Contribution Identifier (BICI): can identify individual volumes or chapters within a work.
f. Digital object identifier (DOI): based on the URN idea. Can allow copyright limitations to control who has what kind of access

Digital Object Identifier
1. DOI is the digital identifier of an object, not the identifier of a digital object. It is a persistent identifier.
2. It includes: Syntax (name), resolution of the name to the object, metadata describing the object, and social networking of the object through interoperability
3. DOI does not preserve the object: it merely finds a way of sharing information about the object.

These 4 readings are all centered around communicating meaning about a given object or text. The characters on the page don't mean anything to a computer, so it is necessary to tag them and use appropriate languages so that you can convey that meaning to the computer. When you do that, the computer can organize it in the way you want.

Affixing meaning also applies to identifiers. Without a good identifier, a given object will be very difficult to find. Providing an identifier like a DOI not only helps the user to access the object, but it also provides other information about the object that is translatable across a variety of mediums. This means that the record will be persistent.

All of this applies to digital libraries. What is the point of having a digital library if you can't find what you are looking for? Or if you may have found what you're looking for, but you're not quite sure if it is without looking at the entire object? Providing information about a given object is absolutely vital in any library, including digital libraries.

And, here is an entirely gratuitous puppy picture, for those who are interested.

We took her camping in Fayette county a few weeks ago. There was a lake there and she swam and swam and swam. She's a water dog, you might say.

Look at those little paws paddling! awwww.

Friday, September 5, 2008

Muddiest Point 1

This is a muddiest point about muddiest points. Do we have to post a muddiest point about the lecture, or can it also be the readings? We didn't have a lecture this week, so obviously this muddiest point is not about the lecture. Speaking of, are the readings for this week supposed to go with next week's lecture? Did I post my response to the readings too early? Does it not really matter?

Week 2 Response

First of all, the over arching theme of these readings is interoperability. A large emphasis is on interchangeable parts: different tools that can be exchanged and used as needed by multiple types of digital libraries. This makes sense, and is a concept that has been around for a long time. Car manufacturers save time, money and effort by building their engines and cars with a lot of parts that can be used in as many of their products as possible. By making sure that every car in their 2008 fleet uses widget A to complete task 1, they can make a whole lot of widget A's all at once and put them in every car. If some cars used widget A, others used widget B and the rest used widget C to complete task 1, they would have to make widget A's, B's and C's, and each of them would require a different factory or machine to produce. That raises the cost of completing task 1. It's what one might call 'reinventing the wheel'.

With this in mind, it is completely logical to take this concept into the digital library environment. Why reinvent the wheel? Obviously, different digital libraries are going to have different requirements, so they can pick and choose their given widgets cafeteria-style. This lowers the cost of developing the digital library. Hence, this is why the the Suleman article discusses producing software toolkits for producing digital libraries.

Furthermore, it allows for different digital libraries to talk to each other if there is a common language. This is a concept that is not new to libraries. Much of the technology that they produced before the digital age was focused on sharing information between libraries. Union catalogs filled this purpose by letting people know what various libraries had available. Bibliographies helped libraries know what's new in their particular field. In the digital universe, libraries being able to share what they have and have the collections communicate is a logical extension of this philosophy. The Payette article gives definitive protocols and evidence of their success for the interoperability of digital library systems.

Now. Is the Internet a digital library? It is a collection of data and information, in a digital format, that is stored on various servers and can be searched and accessed. By that definition, it is a digital library. However, the Internet is not maintained by a given body or individual. It is full of wrong information and a lot of the good information is hard to find. Much of it has restricted access. Amazon.com has servers storing a lot of personal data, but users can't access it using Google.

One might say that the Internet is a 'bad' digital library. It has many characteristics that the authors of these articles are specifically trying to avoid, and problems that they are trying to overcome in digital libraries. It seems unfair to declare something as a non-entity just because it is not a good example of it. It is akin to saying that your daughter is not your child anymore because she misbehaved.

However, learning how to overcome these problems and develop robust digital library systems could revolutionize the Internet. Perhaps one day the recalcitrant child will grow up to be a fine, upstanding citizen!

Digital Libraries