The European Library's Deduplication Case Study


A catalogue of the duplicate books, The Wellcome Library, CCBY

Under DSI-1 The European Library had the task of deduplicating records and carried out a case study on the deduplication of the Europeana Newspapers and Europeana 1914-1918 collections. Duplicate records occur when the same record is submitted in two different collections or datasets, such as a general topic digital library containing digital objects from across the institution and a more specific thematic collection. Some duplicating collections are subsets, where all the records are included in some larger set, while some duplicating collections overlap partially and have some of their records, but not all, duplicated.

The European Library's deduplication workflow differs depending on what type of collection is being checked. The Newspapers collections are structured around titles, and the titles are shared across many records. This allows for a systematic analysis, checking if any particular newspaper title (such as Berliner Börsenzeitung or Le Siècle) is also found in a different collection. If so, then the records are checked to see if they are the same issues (based on the date) and if there is a digital object, rather than a bibliographic record. The Europeana Collections 1914-1918 collections are structured by set and subset, with sets and subsets being organised thematically (e.g. the Royal Library of Belgium's collections including a parent set 'Picture material' which includes several subsets such as 'Posters' and 'Drawings'). In this case the analysis relies on the manual selection of metadata and title terms and searching in the library's overall catalogue. Through this ten institutions were identified as having duplicated records, of which three were deduplicated under DSI-1.

From this analysis TEL has learned that deduplication, while it can be valuable, is labour intensive and varies in effectiveness. Some forms of deduplication are very easy, such as deactivating one of two identical sets or deactivating a subset where the parent set is also in Europeana. In this case the primary challenge is identifying the collection, rather than identifying a range of records. When two sets overlap partially, meaning that some number of records are in both, but both sets contain records not in any other set, it is much more difficult, as not only the collection but all the individual records must be identified.

TEL and Europeana have also investigated the possibilities for further development and what Europeana can learn from TEL's experience. The method TEL currently has can work for small collections, but would be ineffective on the scale that Europeana would require. Europeana encourages investigating the automation of identifying duplicates records. Deduplication would be of value to Europeana and its partners, and an automatic method would overcome the limitations of identifying records rather than collections and the amount of time required.

