The Open Archives Initiative: Forging a Path Toward Interoperable Author Self-Archiving Systems
Richard E. Luce, Research Library, Los Alamos National Laboratory
College & Research Libraries News: March 2000, p 184.
Efforts aimed at giving authors control over the communication and distribution of their work, in the form of electronic author self-archiving systems, are gaining ground. The widely publicized Universal Preprint Service (UPS) initiative, later known as the Open Archives Initiative, was widely publicized and often mis-reported. The initiative is intended to develop a framework for a "universal e-print archive" that establishes interoperability standards supporting the search and retrieval of e-print papers from all disciplines. The hope is to catalyze progress in new scholarly publishing models over the next five to ten years.
Background
Scholarly communication has long suffered from the lag time between summarizing research results in a publishable article and the formal publication of the article. In certain areas of scholarly activity, electronic preprint archives have become an established medium to quickly communicate non peer-reviewed results of ongoing research.
The trend began in high energy physics in 1991 and since then, the centralized xxx preprint archive founded by Paul Ginsparg at Los Alamos has become a global repository for research in physics. xxx (also known as arXiv.org) houses over 122,000 papers, and is mirrored worldwide in 15 countries with over 60,000 users daily. xxx has also expanded to incorporate mathematics, non-linear sciences and computer science.
Similar efforts in other disciplines are noteworthy. CogPrints is modeled on xxx and focuses mainly on a collection of papers in cognitive science, psychology, neurology, linguistics, and related fields. NCSTRL is a similar initiative, providing a point of access for technical reports in computer sciences, that are either submitted to the CoRR (Computing Research Repository), a part of xxx, or to decentralized departmental archives that cooperate in the initiative. Archives in the NCSTRL initiative share the Dienst protocol, which enables the creation of library-like services that support searching and browsing the archive.
Along the same lines, the RePEc (Research Papers in Economics) initiative provides authors with the option to submit working papers to a departmental archive or - if one does not exist - to the EconWPA archive at Washington University.
The NDLTD project aims at building a digital library of electronic theses and dissertations (ETD) authored by students of member institutions. NDLTD addresses issues such as the creation of a workflow to submit ETD, the development of an XML DTD for ETDs and the support of a digital library for ETDs. Recently, NIH has expressed a strong interest in the establishment of an e-print initiative for biology. All the preprint initiatives share the same goal, which is to optimize scholarly communication by overcoming the barriers - financial, legal, etc., which the traditional framework has established.
While other disciplines and institutions have begun to create public research archives along the lines pioneered at LANL, what is needed are conventions that archives can adopt to ensure that they work together. Ideally, any paper in any of these preprint or e-print archives should be able to be found from anyone's desktop worldwide, as if it were all in one virtual public library. Slowly, the information industry is beginning to understand the potential of the preprint concept, regarding it either as an opportunity for collaborations, as a challenge, or as a threat.
Taking the First Step - Universal Preprint Service Initiative (UPS)
In April 1999 a call for participation was put out to existing e-print systems which was intended to mobilize a core technical group to work towards achieving a universal service for non peer-reviewed scholarly literature. Such a universal service is considered as the fundamental and free layer of scholarly information, on top of which both free and commercial services could flourish.
Paul Ginsparg, Herbert Van de Sompel and I, from Los Alamos National Laboratory, initiated the Universal Preprint Service Initiative (UPS) call for participation. We believed that important steps towards the establishment of such a universal service could be taken by identifying or creating interoperable technologies and frameworks for the dissemination of author self-archived documents (termed e-prints). The driving forces behind the initiative is the perception that many years of theoretical discourse have resulted in few fundamental methodological changes, and our hope that more-rapid progress could be catalyzed by a consortium of interested parties’ focusing directly on the relevant technological issues.
The UPS meeting was held in Santa Fe, N.M. on October 21-22. The participants in the meeting were digital librarians and computer scientists specializing in archiving, metadata, and interoperability, and they included the founders of the principal public research archives that exist so far. The participants were diverse in their underlying motivations, but entirely unified in their objective of paving the way for universal public archiving of the scientific and scholarly research literature on the Web. Sponsorship for the meeting was obtained from: Council on Library and Information Resources; the Digital Library Federation; SPARC; ARL and the Research Library, Los Alamos National Laboratory.
A set of objectives was outlined for the meeting, to consider solutions to some of the purely technical obstacles to a more-effective electronic scholarly communication system and centered around the following concepts:
Stimulating the adoption of the preprint concept in all areas of scholarly research;
The integration of preprint services in the scholarly document system of scholarly journals, A&I services and libraries;
The creation of search and retrieval functionality for preprint archives that can be simultaneously useful for discipline-specific, cross-disciplinary purposes, inter-institutional and intra-institutional purposes;
Developing user-friendly systems, i.e. along the lines of established search and retrieval methods;
Include the full range of meta-data, full-text, citation data.
The group agreed on minimal technical requirements for archives. These will be published separately as the "Santa Fe Conventions" and, during the next six months, will be implemented in the existing archives.
Technical Summary
All the participants agreed that scientific papers should be freely accessible to the public, although individual participants differ on specifics, such as how to handle non-peer-reviewed material. The first meeting concentrated on the creation of cross-archive end-user services. The aim was to identify general archive solutions that would facilitate the creation of such services. These characteristics could then be used as recommendations for existing and upcoming initiatives.
The meeting began with a presentation and demonstration by a team consisting of Herbert Van de Sompel (University of Ghent and Los Alamos National Laboratory), Michael Nelson (NASA Langley and Old Dominion University) and Thomas Krichel (University of Surrey and RePEc initiative). This group had built an experimental end-user service providing access to data originating from main archive.
A variety of technologies were used in the project, including NCSTRL+ as the digital library service, intelligent objects called buckets as a means to store the archive metadata and the SFX linking solution as a means to inter-link the e-print data with the traditional scholarly communication mechanism. The presentation identified problems that arose during the project, and discussion of those problems served to launch the meeting discussions.
Participants concluded that many different archive initiatives were likely to emerge, with different conceptual, organizational and technical foundations. In order for such initiatives to successfully become part of the scholarly communication system, interoperability was essential.
Consensus was reached that interoperability hinges on a fundamental distinction between the archive-functions, which include data-collection and maintenance, and end-user functions, like the cross-system search and linking prototype service described in the opening session. Although archive initiatives can implement their own end-user services, it is essential that the archives remain "open" in order to allow others to equally create such services1.
A discussion on the technicalities of creating end-user services for data originating from different archives followed. The group recognized that there are basically two ways to implement these: a distributed searching approach and a harvesting approach. The former would require archives to implement a joint distributed search protocol, which would be difficult. Moreover, there are important problems of scale when implementing such distributed search solutions, in light of the possible emergence of thousands of institutional and/or subject-oriented archives worldwide. The group agreed that this was not a realistic approach at this point in time, and that a harvesting solution was more appropriate. Such a harvesting solution would allow trusted parties -- the ones that subscribe to the Santa Fe Conventions -- to selectively collect data from different archives. The conventions propose adoption of portions of the Dienst protocol for the harvesting of data and a minimal Dublin Core compliant metadata set, called the Santa Fe Set, which should be made available by all archives to respond to harvesting requests.
The representatives of existing archive initiatives at the meeting, as well as those from institutions that are in the process of setting up archive initiatives, agreed to comply with those guidelines. The Dienst protocol will be enhanced to allow for the functions mentioned above and a minimal Dienst release, facilitating the process of making an archive compliant to the required aspects of Dienst, will be made available. A transport format for MARC-formatted metadata will be proposed, as well as an XML Document Type Definition for the description of the Santa Fe Set. The recommendations will be extensively documented on a Web site and adoption of the recommendations will be promoted worldwide.
The Path Forward
The Open Archives initiative has created a forum to discuss and solve technical matters of interoperability between author self-archiving solutions, as a way to promote their global acceptance (see http://www.openarchives.org).
Agreement was reached on the following to pave a path forward:
Some Issues and Questions
The initiative discussed above raises several social issues concerning scholarly communication. Among the issues of relevance to academic and research institutions are the following:
While it is not the intent of the Open Archives Initiative to deal with those social issues, their resolution will be an important factor in determining how quickly the paradigm for scholarly communication will change. At our meeting in October, we tried to lay the groundwork for technical standards that will support new models of publishing.
Footnote:
1. This concept was formalized in the distinction between providers of data (archive initiatives) and implementers of data services (initiatives to create end-user services). The group agreed that an essential feature of the Santa Fe Conventions would be that providers of data use a standard mechanism to state the conditions under which their data-sets can be used by implementers of data services. Similarly, the implementers of data services could describe the use they make of archive data.