3 May 2012

Digitising legacy theses

Quite a few libraries out there hold theses in analogue, hardbound format. In many cases they are available to students for reference-only access in the library. Demand various throughout the academic year, but tends to peak in and around the time when final-year students work on their projects. Half-empty shelves and frantic enquiries about the whereabouts of this or that thesis are typical tell-tale signs.

There are good reasons to get rid of hardcopies and replace them with digitised versions.
  • Create space on library shelves
  • Incidents of lost and possibly irreplaceable theses are eliminated
  • Enable safe and perpetual access via an online archive
  • Facilitate multi-user access (fluctuating demand can be met)
  • E-theses can be retrieved more easily (full-text searchable, accessible via OPAC/online archive)
  • Plagiarism risk is reduced significantly (all works become transparent and traceable)
  • Shelving load is reduced (free up staff time)
  • Annual stock-take workload is reduced (less stuff to count)

But there are risks too. Students of past days still own the copyright in their works. Arguably, chasing those students to get them to share their copyright (via, say, a Creative Commons licence) is often not practical. For this reason, providing public, full-text open access to archived works can become an issue. A solution to this is to limit access within the campus IP-range. From a conversion process point-of-view, only theses that are robust enough to withstand the manual handling can feasibly be processed. So a residual set might still end up sitting on the shelves after all.

I’d like to emphasize that there’s no right or wrong workflow in realising a conversion project swiftly. Local conditions, such as funding, available equipment, time etc. play their part. The below is a brief outline of how I went about this job (launched recently and ongoing) with the tools available to me.

Available equipment:
  • Fujitsu fi-6140 multi-sheet, high-speed scanner
  • Adobe Acrobat Pro 9 (not the ideal choice for OCR as Adobe’s OCR correction function is dodgy at best; Abbyy is king)
Steps involved (high-level outline):
  1. Dismantle and prepare source document for scanning
  2. Scan at 300dpi (generic capture figure)
  3. Insert copyright disclaimer
  4. Adjust image size (crop image etc.)
  5. Retain uncompressed/lossless .PDF master file
  6. Optimise file size
  7. Convert to PDF/A-1b
  8. Publish item in repository
  9. Clean up existing catalogue record held in the Library Management system and embed full-text link pointing towards the archived item
A note on OCR...
One objective is to realise full-text indexing as best as possible. However, it’s important to emphasize that OCRing aims for a find/search capability rather than bona fide replicas of original sources. OCR has little to no understanding of layout, format, word - line - paragraph structure, etc.

A note on PDF/A-1b...
The PDF/A standard defines provisions for long term archiving of electronic documents in the PDF format on different platforms. All content in a PDF/A file must be contained in such a way that viewing or printing of the file can be achieved reliably over a long period of time. PDF/A-1b requires that all page content and resources necessary for displaying or printing the document are included in the PDF file. It is not required that the page content is structured. The use of PDF/A-1b is recommended whenever no content is present, like in scanned documents or PDFs which have been created without structure information.

How long does it take to process a thesis from start to end? Again, this depends on various variables including the condition of the source documents, scanner speed, OCR speed, etc. I sort of broke down and timed each step of the conversion process. With an average of about 55 – 80 pages per thesis, the job takes about 45 minutes per thesis (give or take).

The bottom line here, clearly, is that some input is involved in shifting legacy theses from analogue to digital. However, the underlying benefits more than justify the effort involved.


Post a Comment