Friday, September 25, 2009

Metadata in digital libraries

Reading question - how interoperable are the different metadata schemas?

-The term metadata is less commonly used among creators and consumers of networked digital content. Using web page tabs, folksonomies, and social bookmarks are growing practices.

Metadata should reflect three thing
1. Content - what the object contains or what is intrinsic to an information object
2. Context - indicates the who, what, why, where and how aspects associated with an objects creation and is extrinsic to an information object
3. Structure - relates to the formal set of associations with or among individual information objects and can be both intrinsic and extrensic.

Data Structure Standards - Catagories or containers of data that make up a record or information object. (MARC, EAD)

Data Value Standards - Terms, names, and other values that are used to populate data structure standards or metadata elements. (LOC Subject headings)

Data Content Standards - guidlines for the format and syntax of the data values that are used to populate metadata elements (DACS)

Data format/ technical interchange - type of standard is often a manifestation of a particular data structure standard, encoded or marked up for machine processing. (XML)

- information communities are aware that the more highly structured on information object is, the more that structure can be exploited for searching, manipulation and interrelating with other information objects. This can only occur with strict adherence to metadata standards.
-certifies the authenticity and degree of completeness of the content.
- est. and documents the context of the content
- identifies and exploits the structural relationship that exists within and between information objects.
- provides a range of intellectual access points for an increasingly diverse group of users.
- provides some of the info that an info professional would have provided in a reference scenario


Different types of metadata
Administrative - used in managing and administering collections and information resources

Descriptive - used to identify and describe collections and related information resources

Preservation - preservation management of collections and information resources. Documentation of physical condition of resources.

Technical - how a system functions

use - level and type of collections

Attributes and characteristics of metadata
source of metadata - internal metadata is generated by the creating agent with the item is digitized or born. external metadata is created by someone who is not the creator.

method of creation - automatically generated by the computer or manually by humans

nature of metadata - non-expert vs. expert creation

status static metadata never changes, dynamic changes with use, manipulation, or preservation

Semantics - controlled metadata vs. uncontrolled metadata

- metadata creation has become a complex combination of manual and automatic processes

Primary functions of metadata
-creation, multivisioning, reuse and recontextualization of information objects.
-organization and description
-validation - users scrutinize metadata to assure authenticity and authoritativeness
-searching and retrieval
-utilization and preservation - metadata related to user annotations, rights tracking and version control
-disposition - accession and deaccessioning

Bibliographic entities
documents, works, editions, authors, titles and subjects

MARC
-governed by AACR2R
-stored as a collection of tagged fields in a fairly complex format and is also used to represent authority records which are standarized form that are part of controlled vocabulary.

Dublin Core
-designed for nonspecific use
-simple/flexible has only 15 elements compared to hundreds in MARC

BibTeX - used for mathematical notation. manages bibliographic date and references within docs. end note?

Refer - similar to BibTeX

Thursday, September 24, 2009

Assignment 2 link to flickr

http://www.flickr.com/photos/mcrib/sets/72157622447509222/

Thursday, September 17, 2009

Reading notes week 3

Reading question: How prevalent are identifiers used in place of URL's for digital objects in DLs? How prevalent are they outside DLs on a site similar to flickr?

Identifiers and Their Role in Networked Information Applications

-Bibliographic utility identifier numbers such as the OCLC and RLIN numbers are used in duplicate detection and conslidation in the construction of online union catalog databases.

-"The assignment of identifiers to works is a very powerful act; it states that, within a given intellectual framework, two instances of a work that have been assigned the same identifier are the same, while two instances of a work with different identifiers are distinct."

- URLs serve as the key links between physical artifacts and content on the Web, as well as providing linkage between objects within the Web.

- URLs are not really names, merely instructions on how to access an object. URLs were never intended to be long lasting names for content; they were designed to be flexible, easily implemented and easily extensible ways to make reference to materials on the Net.

URN- uniform resource names. the syntax of a URN for a digital object is defined as consisting of a naming authority identifier and an object identifier which is assigned by that naming authority to the object in question; the specific content of the identifier may have structure and significance to users familiar with the practices of a given naming authority, but has no predefined meaning within the overall URN framework.

- Could you talk in class about the function of "resolvers" within the URN framework?

-browsers do not understand URNs

-PURL server creates a database entry linking this hostname and filename to the identifier that will appear in the PURL. When the PURL server is contacted because because someone is valuation a PURL, it looks up the identifier in its database, finds out where the object in question currently resides, and uses the redirect feature of the HTTP protocol to connect the requester to the host houseing the object.

-SICI- Serial Item and Contribution Identifier. can be used to identify a specific issue of a serial, or a specific contribution within an issue (such as an article or table of contents)

-BICI (Book item and contribution identifier. can be used to identify specific vloumes within a multivolume work, or components such as chapters within a book

Digital Object Identifier - provides a mechanism for implementing a naming system that fits roughly within the URN framework and that provides a mechanism for implementing naming systems for arbitrary digital objects.

-DOI provides a method for collecting revenue for access to material that is described by a DOI if the organization that owns the rights to DOIs in and of themselves are the only identifiers and do not imply that any sort of copyright enforcement mechanisms will be bundled with the objects that they describe; the presence or absence of such copyright enforcement technologies is an entirely separate issue.

Digital Object Identifier System (Paskin)

Identifier is

- a string, typically a number or name denoting a specific entity. Think ISBN

- A specification, which prescribes how such strings are constructed.

- a scheme, which implements the specification. Typically such schemes provide a managed registry of the identifiers within their control, in order to offer a related service.

Uniqueness - is the requirement that one string denotes one and only one entity (the "referent").

Resolution - is the process in which an identifier is the input to a service to receive in return a specific output of one or more pieces of current information related to the identified entity.

Persistence - is the requirement that once assigned an identifier denotes the same referent indefinitely.

URLs do not refer to the identity of an entity but its location on a network.

The DOI system is such a managed system for persistent identification of content on digital networks, using a federation of registries following a common specification. Information, such as where to find an object may change over time (URL?) but its DOI will not change. It brings together a syntax specification, defining the construction of a string. A resolution component, providing the mechanism to resolve the DOI name to data specified by the registrant. A metadata component, defining an extensible mode for associating descriptive and other elements of data with the DOI name. A social infrastructure, defining the full implementation through of policies and shared technical infrastructure in a federation of registration agencies.

Arms Chapter 9

Methods for storing textual materials must represent two different aspects of a document: its structure and its appearance. The structure describes the division of a text into elements such as characters, words, paragraphs and headings. It identifies parts of the documents that are emphasized, material placed in tables or footnotes, and everything that relates one part to another. The structure of text stored in computers is often represented by a mark-up specification. In recent years, SGML (Standard Generalized Markup Language) has become widely accepted as a generalized system for structural mark-up.

The appearance is how the document looks when displayed on a screen or printed on paper. The appearance is closely related to the choice of format: the size of font, margins and line spacing, how headings are represented, the location of figures, and the display of mathematics or other specialized notation. In a printed book, decisions about the appearance extend to the choice of paper and the type of binding. Page-description languages are used to store and render documents in a way that precisely describe their appearance. This chapter looks at three, rather different, approaches to page description: TeX, PostScript, and PDF.

style sheet - describes how each structural element is to appear, with comprehensive rules for every situation that can arise.

- Mark-up languages can represent almost all structures, but the variety of structural elements that can be part of a document is huge, and the details of appearance that authors and designers could choose are equally varied

OCR - Optical character recognition is the technique of converting scanned images of characters to their equivalent characters. The basic technique is for a computer program to separate out the individual characters, and then to compare each character to mathematical templates

- Computers store a character, such as "A" or "5", as a sequence of bits, in which each distinct character is encoded as a different sequence

- Since it is impossible to represent all languages using the 256 possibilities represented by an eight-bit byte, there have been several attempts to represent a greater range of character sets using a larger number of bits. Recently, one of these approaches has emerged as the standard that most computer manufacturers and software houses are supporting. It is called Unicode.

- SGML is a system to define mark-up specifications. An individual specification defined within the SGML framework is called a document type definition.

- SGML is firmly established as a flexible approach for recording and storing high-quality texts. Its flexibility permits creators of textual materials to generate DTDs that are tailored to their particular needs.

- html is considered a simplified DTD

- xml is designed to bridge the gap between html and the full power of sgml

- Every time a new feature is added to HTML it becomes less elegant, harder to use, and less of a standard shared by all browsers. SGML is the opposite. It is so flexible that almost any text description is possible, but the flexibility comes at the cost of complexity. Even after many years, only a few specialists are really comfortable with SGML and general-purpose software is still scarce.

- Since XML is a subset of SGML, every document is based on a DTD, but the DTD does not have to be specified explicitly. If the file contains previously undefined pairs of tags, which delimit some section of a document, the parser automatically adds them to the DTD.


Thursday, September 10, 2009

Reading Notes for week two

Framework for Building a Digital Library:

- DL's are expected to remain stable but computer science field has the make sure they are stable despite rapid advancements in Internet technology

- Systems are expected to be interoperable with other DLs

-existing systems classified as DLs have resulted from custom built software development projects. There are built in isolation to suit the needs of a specific community. Most DLs are quick responses to urgent needs by a community of users. As DL systems get more complex extensibilty becomes more difficult and maintainability is compromised. There are few software toolkits available to build dls.


The solution to this problem is the creation of software toolkits. Of the the existing toolkits there are two main problems.
1. The rang of possible workflows is restricted by the design of the system
2. The software is either built as a monolithic system or as componets that communication using non-standard protocols.

In 1999 the OAI was launched in an attempt to address issues of interoperability among dls. The resulting protocol is simple and popular. In the OAI dls are modeled as networks of extended open archives, with each extended OA being a source of data and/or a provider of services. Componentization and standardization are built into the system. Closely resembles the way physical libraries work.

How OAI can provide higher level dl services
1.All dl services should be encapsulated within components that are extensions of open archives (I am not sure what this means)
2.All access to the dl services should be through their OAI interfaces
3.The semantics of the OAI protocal should be extended or overloaded as allowed by the OAI protocal but without contradiction the essential meaning
4.All dl services should get access to other data using extended OAI protocol.
5. Dls should be constructed as networks of extended open archives

Digital Libraries and the Problem of Purpose:

-Problem facing public libraries? How will the Internet affect the accepted library purpose. Will they fashion themselves into portals for using the Internet.

-Problem facing academic libraries? How can the perform their traditional functions when faced with increase in prices for materials. Perhaps work with scholarly societies to create their own journals.

Purpose issues with DLs
1.The idea of an all digital world will probably not come to pass. By prescribing to this idea creators of DLs are missing out on opportunities to integrate heterogeneous collections into DLs.
2.More information is not always better. So the push to continually put more content on DLs does not improve them and in some cases makes them less functional.
3.The DL agenda has been largely set by the computer science community. DLs need input from social scientist and the traditional library community.

The Internet and the World Wide Web


-The Internet is an interconnected group of independently managed networks. Each network supports the technology for inter-connection

Local Area Networks - created to link computers within a department or organization
Wide Area Networks - National Networks

IP - Internet protocol. joins together separate network segments that constitute the Internet Assigns a unique (IP) address to every computer on the Internet.
TCP - Transport Control Protocal. Takes a message divides that message with a destination IP address and sequence number and sends it out on the network. The receiving computer reassembles the message sends it to the application program and acknowledges that the message has been received.

-Not all packets are received successfully overloaded routers drop/ignore some packets meaning the sending computer never gets acknowledgement that the sent message has been received and sends the packet again.
Dropping-a-packet - overloaded router
Time-out - resending packet

UDP - sending computer sends out a sequence of packets hoping they all arrive. The UDP does its best to guarantee all packets all packets will arrive. Think streaming audio

Domain Names - links multiple IP addresses under one domain name.

TCP/IP suite - a group of programs
-Terminal Emulation - telnet is a program that allows personal computers to emulate a terminal that relies on a remote computer for processing. Typically used for system administration.
-File Transfer - the basic protoc0l for moving files from one computer to another across the Internet. FTP. Email uses the simple mail protocol (smpt)

World Wide Web - is a linked collection collection of information on many computers around the world. "It provides a convenient way to distribute information over the Internet. Individuals can publish information and users can access that information by themselves without training.

URL - uniform resource locator. Provides a simple/flexible addressing mechanism that allows the web to link info on computers all over the world. Three parts
1. http is the name of the protocol
2. www.blah.com is the domain name
3. andrew.html is the file on that computer.

http is the protocol that is used to send messages from web browsers to web servers

MIME types - specifies the data type of a file being sent across the Internet
Reading question
I was hoping we could spend some time discussion OAI. I understand the need for standardization as a means of increasing interoperability among different DLs but the specifics of the OAI, such as the components, are confusing to me.