Skip to Main Content

Digitization of Printed Material:The Metadata Engine Project (METAe)

Simon Tanner

The METAe project http://meta-e.uibk.ac.at/ is a highly collaborative research and software development project in which university departments, libraries, archives and software companies from seven European countries and the USA are cooperating in order to develop application software for the digitization of printed material. Initial prototypes of the software will be available in 2002. The METAe project is co-funded by the European Commission, "Digital heritage and cultural content" http://www.cordis.lu/ist/ka3/digicult/

The main objectives of METAe are to:

ease the digitization of books, journals and magazines in terms of cost-effectiveness and degree of automation;

enrich the output of the conversion process in terms of structural metadata capturing; and

enhance the opportunity for successful digital preservation from the very beginning of life-cycle-management by producing highly standardized information objects.

The METAe software is designed to be a comprehensive software package where all tasks within a digitization workflow can be carried out according to the standards currently emerging, such as: the Open Archival Information System; the NISO working draft for Technical Metadata for Digital Still Images; http://www.niso.org/commitau.htmlor the NISO draft standard for Book Item and Component Identifier http://www.niso.org/pdfs/BICI-DS.pdfThe functionality of the software will include:

image creation;

image enhancement and pre-processing;

capturing descriptive metadata from electronic library catalogues;

carrying out the OCR-processing;

creating technical and administrative metadata;

extracting structural metadata;

organizing permanent quality control.

The key technology to enable such remarkable progress in enlarging the degree of automation and enriching the output of conversion projects is based on the introduction of layout- and document-analysis and capturing techniques. Since the layout and structure of printed material are not arbitrary, but follow strong and often ancient rules, the project partners hope to succeed in extracting more information from the page images in a highly automated way than is usually possible. Page numbers, headlines, footnotes, graphs and caption lines can be extracted. Further than even this, the hierarchical structure of books and journals such as: periodical; issue; single article; graph within this article;will also be automatically recognized and captured.

The METAe software package will also consist of a specialized Optical Character Recognition (OCR) engine adapted to recognize old typefaces and historical texts. This is an overdue task, especially for the German typeface "Fraktur", a derivate of the gothic letter (used in a large majority of printed texts in Central Europe and the Nordic countries until the middle of the twentieth century). Five historical dictionaries representing the historical orthography of the English, French,German, Italian and Spanish languages will support the OCR engine. The software package is completed with an XML/SGML search engine that is intended to perform queries on the full-text as well as on the structure of XML documents.

Keep informed about the progress of the project:METAe mailing list: http://meta-e.uibk.ac.at/contact/METAe newsletter: http://meta-e.uibk.ac.at/newsletter/news.htm

Simon Tannerworks for the Higher Education Digitization Service of the University of Hertfordshire, Hatfield, UK (s.g.tanner@herts.ac.uk)

Data & Figures

Supplements

References

Languages

or Create an Account

Close Modal
Close Modal

Gift article access

As a benefit of your subscription, you can share temporary access to restricted articles.

Each link will stop working after 30 days or 10 uses. You may create up to 10 links in a 30 day period.

Please sign in to your personal account to gift article access.

Register

Gift article access

As a benefit of your subscription, you can share temporary access to restricted articles.

Each link will stop working after 30 days or 10 uses. You may create up to 10 links in a 30 day period.

Gift articles remaining: --

Gift article access

Each link will stop working after 30 days or 10 uses. You may create up to 10 links in a 30 day period.

Gift articles remaining: --

Gift article access

As a benefit of your subscription, you can share temporary access to restricted articles.

Each link will stop working after 30 days or 10 uses.

You have reached the limit of 10 links within a 30 day period.