8 Dec 2013

Google/OpenRefine for metadata cleanup and linked data

Last Friday, I attended a half-day workshop at the RIA, which provided a birds-eye view introduction to GoogleRefine. The morning session kicked off with a 45-minute recap on the history/complexity/limitations of databases. The rest of the time was spent playing with OpenRefine.

GoogleRefine (now OpenRefine) is a standalone open source desktop for data cleanup and transformation. It displays itself as a flat table but behaves like a relational database. It’s a hugely powerful tool and requires some legwork and practice to fully exploit its potential.

An immediate and very practical use of the software includes the ability to clean up messy metadata effectively. Say you have an export of a text file with some semi-structured data; you can edit it using transformations, facets and clustering to re-structure the data.

Screenshot of “categories” for sample data-set
http://data.freeyourmetadata.org/powerhouse-museum/phm-collection.zip


















GoogleRefine can also be used to convert data values to other formats and extending it with web services, for example for geocoding addresses to geographic coordinates.

Check out http://collection.cooperhewitt.org/people/18060335/ as a good example for linked metadata (person search).

Resources:
OpenRefine (Project homepage)
Getting started with OpenRefine
Using OpenRefine: a manual

3 comments:

  1. Hi Alexander, I came across a post on the Programming Historian site today which covers a lot of the material from the workshop: http://programminghistorian.org/lessons/cleaning-data-with-openrefine

    ReplyDelete
  2. Excellent stuff, Padraic. Cheers :-)

    ReplyDelete
  3. Nice write up Alexander. I came across a useful listing of regular expression "recipes" on github when playing around with google refine last week:
    https://github.com/OpenRefine/OpenRefine/wiki/Recipes

    John

    ReplyDelete