Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

Informatics efforts by the Mass Digitization Program at the Digitization Program Office, OCIO

Increase the quantity, quality, and throughput of digitization - DPO Strategic Goal 1

We are developing systems and tools to help the Smithsonian units create and expand the digital collection records. These include:

  • Simplifying data cleanup (Excel is not always the best tool)
  • Scripting data conversion (If you need to do something more than once, a script saves you time)
  • Link records to taxonomies or other databases (Doing this by hand is not sustainable in any large collection, we let the computers do the tedious tasks)
  • Spatial databases to improve geo-referencing processes for a large amount of records
    • Detect common issues like inverted sign in coordinates or values outside the country (A missing sign means sending a record from Brazil to Madagascar)
    • Obtain approximate coordinates from a location string using natural language (There is not enough time or people to georeference records one by one)
  • Big Data tools to analyze datasets with millions of records (e.g. Google BigQuery)
  • Consolidation of strings, matching terms in a database to a single taxonomy

These are some of the projects we have been working on:

  • GBIF Issues Explorer - This Shiny app allows researchers and data/collection managers to navigate the records with issues in a GBIF Darwin Core Archive. Winner of 2nd Place Award in GBIF 2018 Ebbe Nielsen Challenge!
  • Match Getty AAT - A prototype app that matches terms in a file to the Getty Art & Architecture Thesaurus using their Linked Open Data portal. The app tries to find the best match by using a set of keywords included with each row, when available, to try to disambiguate the usage. For terms where many matches are found, the app allows the user to select the best one. Once the process is completed, the results file can be downloaded for further processing or importing to the CIS or other database.
  • Locality MatchingShiny app that takes locations from transcription that are not clear and matches them to known locations. Runs an approximate match using databases from EMu and GBIF. 
  • EPICC Name Match - A Shiny app to help the NMNH Paleobiology collection on their digitization efforts by matching the scientific names from labels in the collection to the taxonomy by EPICC (Eastern Pacific Invertebrate Communities of the Cenozoic). The app tries to find a match taking into account the variety of ways that a scientific name can be writtenVirtual Barcodes - This system allows the vendor to lookup an item in a database and scan the unique identifier from a computer screen. This allows us to reduce the production time and error rate from other methods like paper barcodes, spreadsheets, or call and response. The database is updated twice a day with a view from the unit's CIS.
  • Packages to query EDAN from both R and Python.

We are also looking into training needs at the institution, digitization and data scrubbing tools that can benefit more than one unit, and other innovative approaches to improve the digital records. 

Some areas we are working on:

Software Tools

  • Linking data to enhance the collections


  • What informatics training is needed?
  • Resources

Informatics Resources

  • What is available at SI

DPO Projects

  • Admin and hardware projects at DPO

Spatial Database

  • Georeference
  • Check for errors in coordinates

Publications and Reports

  • By DPO Informatics

  • With contributions by DPO

Reference Info

  • API's
  • Data sources

Social Media

  • Follow our projects

Have a data problem? We are looking for ways to help the Smithsonian units create and enhance the digital records of the collections. 

Contact us: 

Informatics Training and Events @ SI

Team Calendars