The LAI Cataloguing and Metadata Group's AGM (2020) and networking event took place on 10th March, 2021 hosted virtually by TCD Library courtesy of the LAI Zoom account. This event featured a number of presentations on remote Cataloguing Projects during Lockdown at the Library of Trinity College Dublin and the 1872 Printed Catalogue Conversion Project.
Christoph Supprian-Schmidt (Acting Keeper, Collection Management), in his opening remarks outlined the situation for cataloguers during lockdown and explained how Library of Trinity College Dublin used the COVID-19 lockdown period to work on a range of remote cataloguing projects. He noted that at the one-year mark of the original lockdown, these projects will have added well over 200,000 records to Trinity's main online catalogue, Stella Search with most records are coming from the 1872 Printed Catalogue Conversion Project, - the focus of the presentations of the evening.While the loading e-book records, cataloguing digital collections and the remote cataloguing of new books (from scanned title and key pages and other cataloguing data), continued, TCD Library also used this time to work on the 1872 printed catalogue conversion project. This work was facilitated by access to scanned pages (and OCR data) of legacy catalogues.
The TCD Library Printed Catalogue 1835-1887 by Trevor Peare, former Keeper (Readers’ Services)
John Gabriel Byrne entered TCD in 1952 and graduated top of the class in engineering in 1956. He also studied French, Latin and Greek. He completed his PhD 1957 –1961 and began lecturing in 1963. He was appointed the first Chair of Computer Science in 1973 and became interested in the printed catalogue in 1985. He arranged to begin scanning of the original text in 1990 and the first database and search system 1993 was available in-house in TCD in 1993 and on the internet by 2005. The 5121 pages of one set of the eight volumes were separated in 1987 in order to make a microfiche copy and these pages, which were provided by Dr. Charles Benson, Keeper of Early Printed Books, were used to develop this on-line system. There are about 250,000 entries in the catalogue (including 'see references').The catalogue contains entries in at least eighteen languages. English and Latin occur most frequently and other languages in the Roman alphabet include French, Italian, Spanish, Portuguese, German, Dutch, Icelandic, Danish, Norwegian, Swedish, Welsh and Irish.
Dirty secrets of OCR, or, how to wrangle a big set of bibliographic legacy data by Joe Nankivell, Junior Bibliographer (Early Printed Books and Special Collections)
Joe Nankivell described the process of transforming the raw data from John Byrne’s OCR project into MARC records, with a particular focus on the data-cleaning side. The data had been shared on a memory stick that contained all Professor Byrne’s project files, including the records that formed the basis of his searchable online version of the printed catalogue. These records were distributed across over 5,000 files, one for each printed page, each containing between 30 and 50 records. The first task was to merge all these records into a single file where they could be manipulated in bulk, to impose consistency across the dataset prior to line-by-line proofreading by the wider team.
The OCR records had a superficial resemblance to rudimentary MARC records, with clearly identifiable bibliographic elements – a main heading (usually the author), the title, an imprint statement with place and date of publication, and a shelfmark. They could not be simply transformed and loaded, however, for three main reasons: Mapping difficulties due to inconsistent data structure, and more complex records that needed nuanced approach. OCR errors, as the original scans were low resolution. Missing data information not captured by OCR, or lacking in the original record.
The talk focused mostly on the first of these problems. One of the largest issues was how the 1872 catalogue handled multiple editions of the same title. These needed to be represented in MARC with an individual record for each edition, but in the printed catalogue they are filed under a single uniform title, usually reflecting the earliest edition held by TCD. This in turn appeared in the OCR data as a single record. Joe described the process of separating these out into new records using OpenRefine data-cleaning software, which proved to be the ideal tool for working with such a large and complex dataset.
Some of the OCR errors could also be identified and cleaned at the batch-edit stage, as they followed predictable patterns. And the data was further enriched at this stage by separating the imprint out into fresh fields for place and date of publication, as well as printer, series, date range, language, and other information that was available in some of the more detailed records. This allowed the creation of more technically precise MARC records, populating fields for country, language and date.
With the batch edits complete by the end of April, the dataset was shared among colleagues from across TCD Library, who painstakingly compared each line of the data with all 5,121 pages of the printed catalogue. This work went on over the rest of the year, and was finally complete just before Christmas 2020. In the final phase, the proofread data was given its final integrity checks and further augmented by Niamh Harte, the project manager who converted the records into MARC format and loaded them into TCD’s live online catalogue one volume at a time. All the presenters paid tribute to the work of their TCD colleagues Niamh Harte, Barbara McDonald and John Byrne on this project.
*Special thanks to Joe Nankivell for his help in summarising his work on data-wrangling and OCR for this blogpost.
Patricia Moloney is secretary of the LAICMG and works as a cataloguing librarian on the Dónal Ó Súilleabháin Collection in Special Collections, Glucksman Library, University of Limerick.
0 comments:
Post a Comment