Andrew McGraw LIS 2670: 2009

Friday, December 4, 2009

Access Management

Muddiest Point - why have digital materials in a library that you don't own copyright to? I would assume these are born digital materials but if not why incur the expense to digitize materials when the access to them is going to be restricted?

Reading notes - for several reasons, such as copyright and patient privilage (medical records), user access to digital libraries has to be restricted. Different users can only be permitted access to certain records contained in a digital library. There are several ways to accomplish this task. One is password and id verification another is through the users IP address. But it is important to keep in mind that management access will have adverse affects on the the user interface.

The central concept of the general model framework is that access is controlled through the creation of policies. These policies assigns users a set of digital material which they have permission to access while denying access to others.

Materials can also be assigned a level of risk and categorized that way so the level of access can be determined through that risk assessment. Analysis of users and their roles must also be examined when implementing access management.

Thursday, November 19, 2009

Reading notes

no muddiest point.

The concept of creating separate digital libraries, from an institutional standpoint, makes a good deal of sense. But while a good deal of attention has been focused on creating interoperability between these distinct digital libraries there are some that feel, despite these efforts, that the approach is not in the users best interest. That has led people to consider the World Wide Web as a model for the future of digital libraries. Google books represents this model to a degree. They're goal seems to become the digital library for digital books. Users have embraced and become accustomed to this structure because of their web experiences. They do not care who has created or digitized the material/information they are looking for but only that the information is easily retrieved. So the question is do we move toward creating the digital library as opposed to organizational digital libraries.

Thursday, November 5, 2009

Reading notes digital preservation

Muddiest Point: none this week

Main requirements of OAIS

1. provide long-term persistence of digital information
2. ensure access to that information
3. negotiaite for and accept appropriate information from information producers.
4. determine the scope of the archives user community
5. ensure that the preserved information can be understood by users without the assistance of the information producer.
6. make the preserved information available to the user community.

An OAIS must retain sufficient intellectual property rights, along with custody of the information in order to guarentee preservation of those materials. The three distinct parts: the producer of the information, the managers of the information, and the consumers of the information. Managers provides strategic planning, defining the scope of the collections, and ensuring preservation of materials. Producers submit the information to be preserved and associated metadata for ingest. Consumers are the users of the information.

The OAIS function model is a collection of 6 high-level services, or functional components that taken together fulfill the OAIS duel function of providing access and preservation. Those components are
1. ingest - the set of processes responsible for accepting information submitted by producers.
2. archival storage - part of the system that handles long-term storage and maintenance of ingested information.
3. data management - maintains a a database of descriptive metadata identifying and describing the archived information in support of the OASI's finding aids.
4. Preservation planning - responsible for mapping out the OAIS's preservation strategy.
5. Access - manages the the processes and services by which consumers locate, request, and receive delivery of items residing in the OAIS.
6. Administration - day to day management and coordinating the previous five elements.

There are many research challenges associated with long-term preservation in the digital realm. Traditional practices for preservation seem to be insufficient for use in the digital realm. Reasons for the need for a paradigm shift in digital preservation include
1. Traditional digital preservation tools can no longer keep pace with the complexity and dynamic mulit-media digital objects.
2. If long-term preservation is going to span decades the threat of interrupted management of digital objects is critical.
3. There are no formal models for dealing with the economic, social, technical aspects of preserving digital materials over time.
4. New tools and technologies are needed to streamline many of the processes associated with digital preservation and that support human decision-making.
5. Infrastructure needs to be created so that digital preservation becomes sustainable and effective.

Sunday, October 25, 2009

Reading Notes Retrieval:

Muddiest Point - How common is collaborative filtering in DLs?
How common is it for DL users to use Internet search engines to find digital content?
Can you talk more about web crawling technology? Specifically why is so much of the deep web academic in nature?

Reading question - How can DLs be structured to be accessed easily by Internet search engines?

Federated Searching:
- average users seeking information lack sophisticated search techniques. They don't want to search "they want to find"
-success of google demonstrates what type of searching the average information seeker wants to use.
-the universe of available content is no longer limited to that stored within library walls. the type of content users are looking for is less commonly cataloged than it had been in the past.
-"We shouldn't force users to predetermine the information source as a precondition to asking their question"
-Google proves that the best way to access information is often the simplest. more complex ways of accessing information block users from materials stored within that system.
-"Not all federated search engines can search all databases, although most can search Z39.50 and free databases. But many vendors that claim to offer federated search engines cannot currently search all licensed databases for both walk-up and remote users."
-"A federated search engine searches databases that update and change an average of 2 to 3 times per year. This means that a system accessing 100 databases is subject to between 200 and 300 updates per year—almost one per day! Subscribing to a federated searching service instead of installing software eliminates the need for libraries to update translators almost daily so they can avoid disruptions in service."

Z39.50 - Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" -- is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search
-"Z39.50 is one of the few examples we have to date of a protocol that actually goes beyond codifying mechanism and moves into the area of standardizing shared semantic knowledge. The extent to which this should be a goal of the protocol has been an ongoing source of controversy and tension within the developer community, and differing views on this issue can be seen both
in the standard itself and the way that it is used in practice."
-Recent versions of the standard are highly extensible, and the consensus process of standards development has made it hospitable to an ever-growing set of new communities and requirements.
-The OSI, or Open System Interconnection, model defines a networking framework for implementing protocols in seven layers. Control is passed from one layer to the next, starting at the application layer in one station, proceeding to the bottom layer, over the channel to the next station and back up the hierarchy.
-The protocol defines interactions between two machines only
-The basic architectural model that Z39.50 uses is as follows: A server houses one or more databases containing records. Associated with each database are a set of access points (indices) that can be used for searching. This is a much more abstract view of a database than one finds with SQL, for example. Relatively arbitrary server-specific decisions about how to segment logical data into relations and how to name the columns in the relations are hidden; one deals only with logical entities based on the kind of information that is stored in the database, not the details of specific database implementations
-A search produces a set of records, called a "result set", that are maintained on the server; the result of a search is a report of the number of records comprising the result set. The standard is silent as to whether the result set is materialized or maintained as a set of record pointers, and as to how the result set may interact with database updates that may be taking place at the server. Result sets can be combined or further restricted by subsequent searches

Search Engine Technology:
-How should libraries see the future of their information discovery services? Instead of a highly fragmented landscape that forces users to visit multiple, distributed servers, libraries will provide a search index, which forms a virtual resource of unprecedented comprehensiveness to any type and format of academically relevant content
-provide metadata based subject gateways to distributed content. Based on the OAI initiative, libraries and library service organisations are following the idea of "OAI Registries" as central points of access to worldwide distributed OAI repositories
-First of all, this is an acknowledgement that, particularly at universities, libraries deal with a range of users with often different usage behaviours
-Most systems focus solely on the search of metadata (bibliographic fields, keywords, abstracts). The cross-search of full text has only recently been introduced and is often restricted to a very limited range of data formats (primarily "html" and "txt").

Saturday, October 10, 2009

XML reading notes

Reading Question -How prevalent have XML schema become and will they replace the traditional XML format?

muddiest point - As we contemplate moving toward a standard metadata schema for all groups will that require something like Dublin Core to continue adding elements and will that make it far less simplistic if it were to grow to include groups such as archives. Does interoperability matter as much with EAD since it is the accepted standard within the archival community and so all sharing of metadata will all be done in EAD?

Reading Notes:

XML is designed to make it easier to interchange structured documents over the Internet. Defines how structured URLs can be used to identify components of XML data streams.

XML elements ensure that document creators put information in its appropriate place which is the Document Type Definition.

Allows users to:
-bring multiple files together to form compound documents.
-identify where illustrations are to be incorporated into text files and the format used to encode each illustration.
-provide processing control information to supporting programs such as doc. validators and browsers.
-add editorial comments to a file.

Core XML technologies:
-unicode defines strict rules for text format as well as the DTD validation language
-XML is a simplification of SGML and includes adjustments that make it better suited to the web environment.

XML Catalogs - defines a format for instructions on how an XML processor resolves XML entity identifiers into actual documents

URIs - uniform resource identifiers. an extensions of URLs

XML Namespaces - provides a mechanism for universal naming of elements and attributes in XML

XML Schema:
-defines elements that can be in a doc.
-what attributes can be in a doc.
-which elements are child elements
-the order of child elements
-the # of child elements
-whether and element is empty or can define text
-defines data types for elements and attributes
-defines default and fixed values for elements and attributes

XML Schema are the successor to DTDs because:
-They are extensible to future additions
-richer and more powerful than DTDs
-Schemas are written in XML
-support data types
-support namespaces

Schema support data types:
-easier to describe allowable document content, correctness of data, data from a database, restrictions on data. Also easier to define data formats and convert data between different elements.

-even well formed XML documents still contain errors but most of these will be found by validator in XML schema.

Simple Element - an XML element that can only contain text w/o other elements or attributes allowed.

attribute - string, decimal, interger, boolean, date, time
restrictions - used to define acceptable values for XML elements or attributes.

Complex Elements:
-Empty Elements
-elements that contain only other elements
-contain only text
contain both elements and text.

Indicators:
order indicators - used to define the order of element
-all
-choice - specifies that one child element or another can occur
-sequence - child elements must occur in a specific order

Occurance Indicators:
maxOccurs
minOccurs

Group Indicators - elements are defined with a group decleration
group names
attributGroupe name

Any Element - enables an XML doc. w/ elements not included in a schema

String data type- used for values that contain character strings. can contain characters, live feels, carriage returns and tab characters

Misc. Data Types - boolean, base64Binare etc.

Friday, September 25, 2009

Metadata in digital libraries

Reading question - how interoperable are the different metadata schemas?

-The term metadata is less commonly used among creators and consumers of networked digital content. Using web page tabs, folksonomies, and social bookmarks are growing practices.

Metadata should reflect three thing
1. Content - what the object contains or what is intrinsic to an information object
2. Context - indicates the who, what, why, where and how aspects associated with an objects creation and is extrinsic to an information object
3. Structure - relates to the formal set of associations with or among individual information objects and can be both intrinsic and extrensic.

Data Structure Standards - Catagories or containers of data that make up a record or information object. (MARC, EAD)

Data Value Standards - Terms, names, and other values that are used to populate data structure standards or metadata elements. (LOC Subject headings)

Data Content Standards - guidlines for the format and syntax of the data values that are used to populate metadata elements (DACS)

Data format/ technical interchange - type of standard is often a manifestation of a particular data structure standard, encoded or marked up for machine processing. (XML)

- information communities are aware that the more highly structured on information object is, the more that structure can be exploited for searching, manipulation and interrelating with other information objects. This can only occur with strict adherence to metadata standards.
-certifies the authenticity and degree of completeness of the content.
- est. and documents the context of the content
- identifies and exploits the structural relationship that exists within and between information objects.
- provides a range of intellectual access points for an increasingly diverse group of users.
- provides some of the info that an info professional would have provided in a reference scenario

Different types of metadata
Administrative - used in managing and administering collections and information resources

Descriptive - used to identify and describe collections and related information resources

Preservation - preservation management of collections and information resources. Documentation of physical condition of resources.

Technical - how a system functions

use - level and type of collections

Attributes and characteristics of metadata
source of metadata - internal metadata is generated by the creating agent with the item is digitized or born. external metadata is created by someone who is not the creator.

method of creation - automatically generated by the computer or manually by humans

nature of metadata - non-expert vs. expert creation

status static metadata never changes, dynamic changes with use, manipulation, or preservation

Semantics - controlled metadata vs. uncontrolled metadata

- metadata creation has become a complex combination of manual and automatic processes

Primary functions of metadata
-creation, multivisioning, reuse and recontextualization of information objects.
-organization and description
-validation - users scrutinize metadata to assure authenticity and authoritativeness
-searching and retrieval
-utilization and preservation - metadata related to user annotations, rights tracking and version control
-disposition - accession and deaccessioning

Bibliographic entities
documents, works, editions, authors, titles and subjects

MARC
-governed by AACR2R
-stored as a collection of tagged fields in a fairly complex format and is also used to represent authority records which are standarized form that are part of controlled vocabulary.

Dublin Core
-designed for nonspecific use
-simple/flexible has only 15 elements compared to hundreds in MARC

BibTeX - used for mathematical notation. manages bibliographic date and references within docs. end note?

Refer - similar to BibTeX

Thursday, September 24, 2009

Assignment 2 link to flickr

http://www.flickr.com/photos/mcrib/sets/72157622447509222/

Thursday, September 17, 2009

Reading notes week 3

Reading question: How prevalent are identifiers used in place of URL's for digital objects in DLs? How prevalent are they outside DLs on a site similar to flickr?

Identifiers and Their Role in Networked Information Applications

-Bibliographic utility identifier numbers such as the OCLC and RLIN numbers are used in duplicate detection and conslidation in the construction of online union catalog databases.

-"The assignment of identifiers to works is a very powerful act; it states that, within a given intellectual framework, two instances of a work that have been assigned the same identifier are the same, while two instances of a work with different identifiers are distinct."

- URLs serve as the key links between physical artifacts and content on the Web, as well as providing linkage between objects within the Web.

- URLs are not really names, merely instructions on how to access an object. URLs were never intended to be long lasting names for content; they were designed to be flexible, easily implemented and easily extensible ways to make reference to materials on the Net.

URN- uniform resource names. the syntax of a URN for a digital object is defined as consisting of a naming authority identifier and an object identifier which is assigned by that naming authority to the object in question; the specific content of the identifier may have structure and significance to users familiar with the practices of a given naming authority, but has no predefined meaning within the overall URN framework.

- Could you talk in class about the function of "resolvers" within the URN framework?

-browsers do not understand URNs

-PURL server creates a database entry linking this hostname and filename to the identifier that will appear in the PURL. When the PURL server is contacted because because someone is valuation a PURL, it looks up the identifier in its database, finds out where the object in question currently resides, and uses the redirect feature of the HTTP protocol to connect the requester to the host houseing the object.

-SICI- Serial Item and Contribution Identifier. can be used to identify a specific issue of a serial, or a specific contribution within an issue (such as an article or table of contents)

-BICI (Book item and contribution identifier. can be used to identify specific vloumes within a multivolume work, or components such as chapters within a book

Digital Object Identifier - provides a mechanism for implementing a naming system that fits roughly within the URN framework and that provides a mechanism for implementing naming systems for arbitrary digital objects.

-DOI provides a method for collecting revenue for access to material that is described by a DOI if the organization that owns the rights to DOIs in and of themselves are the only identifiers and do not imply that any sort of copyright enforcement mechanisms will be bundled with the objects that they describe; the presence or absence of such copyright enforcement technologies is an entirely separate issue.

Digital Object Identifier System (Paskin)

Identifier is

- a string, typically a number or name denoting a specific entity. Think ISBN

- A specification, which prescribes how such strings are constructed.

- a scheme, which implements the specification. Typically such schemes provide a managed registry of the identifiers within their control, in order to offer a related service.

Uniqueness - is the requirement that one string denotes one and only one entity (the "referent").

Resolution - is the process in which an identifier is the input to a service to receive in return a specific output of one or more pieces of current information related to the identified entity.

Persistence - is the requirement that once assigned an identifier denotes the same referent indefinitely.

URLs do not refer to the identity of an entity but its location on a network.

The DOI system is such a managed system for persistent identification of content on digital networks, using a federation of registries following a common specification. Information, such as where to find an object may change over time (URL?) but its DOI will not change. It brings together a syntax specification, defining the construction of a string. A resolution component, providing the mechanism to resolve the DOI name to data specified by the registrant. A metadata component, defining an extensible mode for associating descriptive and other elements of data with the DOI name. A social infrastructure, defining the full implementation through of policies and shared technical infrastructure in a federation of registration agencies.

Arms Chapter 9

Methods for storing textual materials must represent two different aspects of a document: its structure and its appearance. The structure describes the division of a text into elements such as characters, words, paragraphs and headings. It identifies parts of the documents that are emphasized, material placed in tables or footnotes, and everything that relates one part to another. The structure of text stored in computers is often represented by a mark-up specification. In recent years, SGML (Standard Generalized Markup Language) has become widely accepted as a generalized system for structural mark-up.

The appearance is how the document looks when displayed on a screen or printed on paper. The appearance is closely related to the choice of format: the size of font, margins and line spacing, how headings are represented, the location of figures, and the display of mathematics or other specialized notation. In a printed book, decisions about the appearance extend to the choice of paper and the type of binding. Page-description languages are used to store and render documents in a way that precisely describe their appearance. This chapter looks at three, rather different, approaches to page description: TeX, PostScript, and PDF.

style sheet - describes how each structural element is to appear, with comprehensive rules for every situation that can arise.

- Mark-up languages can represent almost all structures, but the variety of structural elements that can be part of a document is huge, and the details of appearance that authors and designers could choose are equally varied

OCR - Optical character recognition is the technique of converting scanned images of characters to their equivalent characters. The basic technique is for a computer program to separate out the individual characters, and then to compare each character to mathematical templates

- Computers store a character, such as "A" or "5", as a sequence of bits, in which each distinct character is encoded as a different sequence

- Since it is impossible to represent all languages using the 256 possibilities represented by an eight-bit byte, there have been several attempts to represent a greater range of character sets using a larger number of bits. Recently, one of these approaches has emerged as the standard that most computer manufacturers and software houses are supporting. It is called Unicode.

- SGML is a system to define mark-up specifications. An individual specification defined within the SGML framework is called a document type definition.

- SGML is firmly established as a flexible approach for recording and storing high-quality texts. Its flexibility permits creators of textual materials to generate DTDs that are tailored to their particular needs.

- html is considered a simplified DTD

- xml is designed to bridge the gap between html and the full power of sgml

- Every time a new feature is added to HTML it becomes less elegant, harder to use, and less of a standard shared by all browsers. SGML is the opposite. It is so flexible that almost any text description is possible, but the flexibility comes at the cost of complexity. Even after many years, only a few specialists are really comfortable with SGML and general-purpose software is still scarce.

- Since XML is a subset of SGML, every document is based on a DTD, but the DTD does not have to be specified explicitly. If the file contains previously undefined pairs of tags, which delimit some section of a document, the parser automatically adds them to the DTD.

Thursday, September 10, 2009

Reading Notes for week two

Framework for Building a Digital Library:

- DL's are expected to remain stable but computer science field has the make sure they are stable despite rapid advancements in Internet technology

- Systems are expected to be interoperable with other DLs

-existing systems classified as DLs have resulted from custom built software development projects. There are built in isolation to suit the needs of a specific community. Most DLs are quick responses to urgent needs by a community of users. As DL systems get more complex extensibilty becomes more difficult and maintainability is compromised. There are few software toolkits available to build dls.

The solution to this problem is the creation of software toolkits. Of the the existing toolkits there are two main problems.
1. The rang of possible workflows is restricted by the design of the system
2. The software is either built as a monolithic system or as componets that communication using non-standard protocols.

In 1999 the OAI was launched in an attempt to address issues of interoperability among dls. The resulting protocol is simple and popular. In the OAI dls are modeled as networks of extended open archives, with each extended OA being a source of data and/or a provider of services. Componentization and standardization are built into the system. Closely resembles the way physical libraries work.

How OAI can provide higher level dl services
1.All dl services should be encapsulated within components that are extensions of open archives (I am not sure what this means)
2.All access to the dl services should be through their OAI interfaces
3.The semantics of the OAI protocal should be extended or overloaded as allowed by the OAI protocal but without contradiction the essential meaning
4.All dl services should get access to other data using extended OAI protocol.
5. Dls should be constructed as networks of extended open archives

Digital Libraries and the Problem of Purpose:

-Problem facing public libraries? How will the Internet affect the accepted library purpose. Will they fashion themselves into portals for using the Internet.

-Problem facing academic libraries? How can the perform their traditional functions when faced with increase in prices for materials. Perhaps work with scholarly societies to create their own journals.

Purpose issues with DLs
1.The idea of an all digital world will probably not come to pass. By prescribing to this idea creators of DLs are missing out on opportunities to integrate heterogeneous collections into DLs.
2.More information is not always better. So the push to continually put more content on DLs does not improve them and in some cases makes them less functional.
3.The DL agenda has been largely set by the computer science community. DLs need input from social scientist and the traditional library community.

The Internet and the World Wide Web

-The Internet is an interconnected group of independently managed networks. Each network supports the technology for inter-connection

Local Area Networks - created to link computers within a department or organization
Wide Area Networks - National Networks

IP - Internet protocol. joins together separate network segments that constitute the Internet Assigns a unique (IP) address to every computer on the Internet.
TCP - Transport Control Protocal. Takes a message divides that message with a destination IP address and sequence number and sends it out on the network. The receiving computer reassembles the message sends it to the application program and acknowledges that the message has been received.

-Not all packets are received successfully overloaded routers drop/ignore some packets meaning the sending computer never gets acknowledgement that the sent message has been received and sends the packet again.
Dropping-a-packet - overloaded router
Time-out - resending packet

UDP - sending computer sends out a sequence of packets hoping they all arrive. The UDP does its best to guarantee all packets all packets will arrive. Think streaming audio

Domain Names - links multiple IP addresses under one domain name.

TCP/IP suite - a group of programs
-Terminal Emulation - telnet is a program that allows personal computers to emulate a terminal that relies on a remote computer for processing. Typically used for system administration.
-File Transfer - the basic protoc0l for moving files from one computer to another across the Internet. FTP. Email uses the simple mail protocol (smpt)

World Wide Web - is a linked collection collection of information on many computers around the world. "It provides a convenient way to distribute information over the Internet. Individuals can publish information and users can access that information by themselves without training.

URL - uniform resource locator. Provides a simple/flexible addressing mechanism that allows the web to link info on computers all over the world. Three parts
1. http is the name of the protocol
2. www.blah.com is the domain name
3. andrew.html is the file on that computer.

http is the protocol that is used to send messages from web browsers to web servers

MIME types - specifies the data type of a file being sent across the Internet

Reading question

I was hoping we could spend some time discussion OAI. I understand the need for standardization as a means of increasing interoperability among different DLs but the specifics of the OAI, such as the components, are confusing to me.