Systematic Characterisation of Objects in Digital Preservation: The eXtensible Characterisation Languages
Christoph Becker (Vienna University of Technology, Austria)
Andreas Rauber (Vienna University of Technology, Austria)
Volker Heydegger (University of Cologne, Germany)
Jan Schnasse (University of Cologne, Germany)
Manfred Thaller (University of Cologne, Germany)
Abstract: During the last decades, digital objects have become the primary medium to create, shape, and exchange information. However, in contrast to analog objects such as books that directly represent their content, digital objects are not usable without a corresponding technical environment. The fast changes in these environments and in formats and technologies mean that digital documents have a short lifespan before they become obsolete. Digital preservation, i.e. actions to ensure longevity of digital information, thus has become a pressing challenge. The dominant strategies prevailing today are migration and emulation; for each strategy, different tools are available. When converting an object to a different representation, a validation of the content is needed to verify that the transformed objects are still authentically representing the same intellectual content. This validation so far is largely done manually, which is infeasible for large collections.
Preservation planning supports decision makers in reaching accountable decisions by evaluating potential strategies against well-defined requirements. Especially the evaluation of different migration tools for digital preservation has to rely on validating the converted objects and thus on an analysis of the logical structure and the content of documents. Existing approaches for characterising and describing objects do not attempt to fully extract the informational content of digital objects and thus are not suffficient for an in-depth validation of transformed content.
This paper describes the eXtensible Characterisation Languages (XCL) that support the automatic validation of document conversions and the evaluation of migration quality by hierarchically decomposing a document and representing documents from different sources in an abstract XML language. The description language XCDL provides an abstract representation of digital content in XML, while the extraction language XCEL allows an extraction engine to create such an abstract description by mapping file format structures to XCDL concepts.
We present the context of the development of these languages and tools and describe the overall concept and features of the languages. We further give examples and show how the languages can be applied to the evaluation of digital preservation solutions in the context of preservation planning.
Keywords: XML languages, content characterisation, digital libraries, digital preservation, evaluation, file conversion, file formats, migration, preservation planning
Categories: H.3.7, I.7