Softening the borderlines of archives through XML - a case study

Stephan Heuscher
Swiss Federal Archives, 3003 Bern, Switzerland
stephan.heuscher@bar.admin.ch

Abstract. Archives have always had troubles getting metadata in formats they can process. With XML, these problems are lessening. Many applications today provide the option of exporting data into an application-defined XML format that can easily be post-processed using XSLT, schema mappers, etc, to fit the archivesŽ needs. This paper highlights two practical examples for the use of XML in the Swiss Federal Archives and discusses advantages and disadvantages of XML in these examples. The first use of XML is the import of existing metadata describing debates at the Swiss parliament whereas the second concerns preservation of metadata in the archiving of relational databases. We have found that the use of XML for metadata encoding is beneficial for the archives, especially for its ease of editing, built-in validation and ease of transformation.

1 Introduction

Metadata acquisition has always been one of the important tasks of an archive. In the digital age this task has become herculean. More and more digital documents are created, modified and copied without recording the trail of these actions. Even if the basic metadata like provenience, creation date, chain of custody, etc of the digital document are stored, the extraction of this metadata from an arbitrary system is always tedious, especially since there are so many different systems. Another problem emerges from the extracted data coming in an uncommon, poorly documented, or overly complex format. The resources needed to convert the metadata into usable form often exceed the technical, budgetary and personnel constraints of the archives. This leads to data islands with high borders hindering data exchange.

The emergence of XML [1] promises a more unified and system independent way of representing data. Most of the current software offers an export in some XML format. The supporters of XML point out that the broad range of technologies spawned by and connected to XML is much easier for the user to learn and integrate, as they share the common XML base. The wide availability of affordable XML-processing tools also lessens the financial strain on the budget. This paper briefly discusses possible uses of XML in an archival context and the policies of the Swiss Federal Archives concerning this use (Section 2), provides a rough overview of the applications we have that use XML (Section 3) and the experiences we made (Section 4).

2 Uses of XML in an Archival Context

There are three forms in which XML can be used in an archive, namely as metadata, primary data and exchange format. In this section the advantages and disadvantages of these forms and grounds for the decisions taken in the applications described in Section 3.

2.1 XML as metadata format

Metadata is data that is used to describe data. There are multiple metadata standards, which do not define the format of the data, but can be encoded in several different ways. Most of the standards define an encoding into an XML format, but none is widespread enough to identify it as the single standard. The Encoded Archival Description (EAD) is "a standard for encoding archival finding aids" [2] and is the most used in the archives. The EAD is very promising for the interoperability between archives due to its compatibility to ISAD(G)[3], but is not precise enough for automatic metadata harvesting. The overall most used metadata standard is the Dublin Core Metadata Standard, which also defines a basic set of elements that is "likely to be useful across a broad range of vertical industries and disciplines of study" [4]. This standard is widespread but it is also very open, so there is still the need to adapt it to the point that it fits the needs of the data it describes. It is not well suited for description of archival collections because it doesn't support multilevel description.

In our projects we use a specifically tailored XML formats for the storage of metadata. Naturally, the metadata standards serve as a guide on what should be included, but the decision for the metadata XML format always lies with the implementation. This permits a maximum of flexibility and combined with the ease of conversion provided by XML this approach has proven effective until now.

2.2 XML as primary data format

Most of today's data formats are proprietary and binary. While proprietary formats fall out of the scope of archives for obvious reasons, well-defined, binary formats cannot a priori be dismissed, as they are specified to fit their purpose. They are more widespread, better supported, and are better documented than a would-be self-made XML format. We currently do not support any XML formats for primary data. XML formats are not per se non-proprietary, as the owner of the format can change it whenever he pleases.

There do exist some non-proprietary XML formats that are quite widespread, XHTML [5] or SVG [6] coming to mind, but they do not normally hold structured data and are mainly used for user interaction and mark-up.

2.3 XML as exchange format

Since XML was developed for interoperability and given the current XML hype in the IT industry, most modern (newly developed) systems can communicate using XML and for most of them it is even the main interface.

We use XML formats as standard for importing data into the system, first converting it to a self-defined format using an XSLT stylesheet [7]. This self-defined format varies from application to application, but it is always defined by an XML schema [8], which is used to guarantee the structural integrity of every XML document sent or received.

2.4 Not using XML

As the previous sections have shown, XML is not always a good solution, especially if another standard format could be used which is simple, open, clearly defined and documented, and widespread.

Currently, these points particularly hold true for primary data, as they are primarily used in the IT industry and lots of standardization has gone into them. A typical example is the TIFF [9] format, the current version being 6.0, which is a standard format for storing image data. TIFF is also a good example for the need for strict checks on ingest (using OAIS [10] terms), because it can hold many different images with different compressions. A digital archive that doesn't accept JPEG images but requires TIFF images could be tricked (not necessarily in bad faith) into accepting JPEG [11] images when they are put into a TIFF wrapper.

3 System Descriptions

3.1 AMDA: Audio Metadata Acquisition

The Swiss Federal Archives acquires the official bulletin (stenographic proceedings) of the Swiss parliament in paper form since 1891. The recordings of the debates on magnetic tape range back to the early 80Žs and a test to record the debates in digital form has just started. As the quality of the sound on the magnetic tapes deteriorates, the tapes are difficult to handle and require specialized hardware to be listened to, they are also digitized and post-processed to improve the sound quality of the debates. This is done manually with every debate on each topic being stored in a file. The descriptive metadata is taken from the printed version of the bulletin and copied into a Microsoft Access database.

The requirements towards AMDA are multifold. The main driving force was the possibility to acquire metadata automatically from the system [12] in the parliamentary services that is used to pre-process the bulletin before they are printed. This frees the operators from the manual and error-prone copying of the printed bulletin into the database. Secondly, the data already contained in the Access table should be migrated into AMDA and this data should then be quality-audited to ensure the integrity of the AMDA data. Thirdly, there is the need to export the stored data into yet unknown formats.

Fig. 1. The basic working of AMDA.

AMDA uses XML to define a unifying import format for any data source. This import format is defined as an XML schema that is used to validate the import data a first step in the validation chain. As a schema cannot fully specify the dependencies in a XML document, the consistency and integrity is checked on import because much of the imported information is inserted manually. The checks are very thorough, as the import is the step where mistakes can easily be detected and efficiently corrected. This would be much harder once the data is inside AMDA. The import from the Microsoft Access database should only occur once, when AMDA replaces this mode of data entry. After each session period of the parliament the metadata of the debates is imported, which means that this has to be done three to four times per year. The data inside AMDA can be accessed, edited and refined through a standard XHTML web browser with the possibility not only to change entries but also to add new entries.

It has to be emphasized that the raw data of the digitized debates never enters AMDA but is stored externally. The data about these debates is only merged when the data is exported from AMDA. AMDA does not provide any functionality to manage the raw data.

The exported result is a self-defined XML format closely related to the internal data structure. Additionally, the possibility of an integrated XSL transformation exists, so the system using the AMDA output can import it in the desired format.

3.2 SIARD: Software Invariant Archiving from Relational Databases

The Swiss Federal Archives Act (SR 152.1) [13] mandates the Swiss Federal Archives to archive the records of governmental agencies. This mandate is independent of the information carrier to which the records are tied. Important data collections from database systems of the Swiss federal government have been transferred to the Swiss Federal Archives since the early 1980s.

Fig. 2. The basic working of SIARD.

SIARD is a pure Java-Client that connects to an SQL-based database through a JDBC driver. SIARD analyzes the system dictionary and presents an overview of all DB elements (catalogs, schemas, table, views, constraints etc), then looks for elements, which do not conform to the SQL3 ISO standard [14] (e.g. proprietary data types or functions). Some elements will be automatically mapped to correct SQL3 equivalents, others that cannot be converted to generic SQL3 will be marked "not archiveable". Furthermore, the user himself can choose any element (schema, table, view etc) to be excluded from the archiving process. All manipulations are logged. All elements excluded from the process of archiving will be documented for evidentiary purposes.

Next, the database administrators provide additional description (metadata) of the database in a standardized way (full-text meaning of keywords, code tables, information concerning provenance and usage of the original system, etc). This information is typically not included in a database but it is required for the long-term preservation. After these two steps, the database is archived in a fully standardized form (generic SQL3 plus a standardized metadata description in XML) while the data is still human readable as plain text.

In essence, by using SIARD, the database (to be archived) undergoes a process in which it is completely detached from its proprietary software and operating environment. Furthermore, it is also detached from any target software environment required for its usability and administration: After a database has been archived, SIARD is not required to maintain and use the archive over long times. As long as the ISO SQL3 documentation is available, the archived data can be reloaded into any database system supporting the DDL core of SQL3 or it can be migrated from SQL3 to a future description language for relational data models. Currently SIARD supports reloading archived databases into an Oracle database.

4 Experiences and Future Work

The systems described above are now just being deployed into real world use, so the experiences presented here are drawn from the development process and preliminary testing. No hard facts in testing the sustainability of XML could be gathered, as the test is time itself. This test will be passed when we can still access the data stored today, including all metadata, in ten or twenty years.

We've seen that XML lives up to many its promises. It is very easy to edit a XML document, because any editor can edit it. Although this should not lead to the foregone conclusion that this will hold in the future, it is a strong sign for the operating system independency of XML.

Another positive feature of XML is the built-in validation of the structure and integrity of documents. This is especially important, when receiving data from outside of the system. The DTDs or schemas offer a good tool to specify and test the delivered XML data. This is important, because once corrupted data is in the system, it's very difficult to remove it again.

The ease of transforming one XML format into another was of great help, but required some initial effort in understanding the function-based nature of XSLT and the finesse of XPath [15]. The effort was well worth it in the end, as XSLT is a powerful tool for not only transformation but also standard based error detection. Further work will be done on integrating the error detection into XML processing.

The main problem area with our applications was the encoding of the XML documents and the non-standard XML document generation of some applications. When dealing with the different encodings (UTF-8, UTF-16, ISO-8859-1, etc) some applications purported a different encoding in the header of the XML document than the true encoding of the document. These errors were quickly identified, as no application was able to read the documents.

Acknowledgements

This work would not have been possible without the support of Peter Keller-Marxer and the ARELDA [16] team.

References

[1] Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 06 October 2000, http://www.w3.org/TR/REC-xml

[2] Encoded Archival Description (EAD), http://www.loc.gov/ead/

[3] ISAD(G): General International Standard Archival Description Second Edition, Adopted by the Committee on Descriptive Standards, Stockholm, Sweden, 19-22 September 1999, http://www.ica.org/biblio/cds/isad_g_2e.pdf

[4] Dublin Core Metadata Initiative, http://dublincore.org/

[5] XHTML 1.0 The Extensible HyperText Markup Language (Second Edition): A Reformulation of HTML 4 in XML 1.0, W3C Recommendation, 26 January 2000, revised 01 August 2002, http://www.w3.org/TR/xhtml1/

[6] Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, http://www.w3.org/TR/SVG/

[7] XSL Transformations (XSLT): Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xslt

[8] XML Schema Part 0: Primer, W3C Recommendation, 02 May 2001, http://www.w3.org/TR/xmlschema-0/

[9] Adobe Developers Association, TIFF: Revision 6.0, Final, 03 June 1992, http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf

[10] Space data and information transfer systems - Open archival information system - Reference model, ISO 14721, 16 December 2002, http://www.ccsds.org/documents/650x0b1.pdf

[11] Information technology -- Digital compression and coding of continuous-tone still images, ISO/IEC 10918, 1994, http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[12] Official Bulletin of the Swiss parliament, http://www.parlament.ch/ab/frameset/d/index.htm (german)

[13] Bundesgesetz über die Archivierung (Archivierungsgesetz, BGA), 152.1, http://www.admin.ch/ch/d/sr/1/152.1.de.pdf (german)

[14] Information technology Database languages SQL, ISO/IEC 9075-2, First edition, December 1999

[15] XML Path Language (XPath): Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xpath

[16] Archiving of Electronic Digital Data and Records in the Swiss Federal Archives (ARELDA): e-government project ARELDA, Management Summary, Bern, March 2001, http://www.bundesarchiv.ch/webserver-static/docs/e/arelda_expose_0301_e.pdf

All links were working as of 31 January 2003.

FileAttachment:	Action:	Size:	Date:	Who:	Comment:
Amda.gif	view update	8742	30 Jan 2003 - 18:30	StephanHeuscher	How AMDA works
PaperUrbinoXml2002.pdf	view update	208933	05 Feb 2003 - 13:02	StephanHeuscher	PDF of this paper
Siard.gif	view update	16893	30 Jan 2003 - 18:31	StephanHeuscher	How SIARD works
Urbino2002_v0.3.ppt	view update	166912	10 Jan 2003 - 15:42	StephanHeuscher	Presentation held in Urbino
	add