Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Informatics efforts by the Mass Digitization Program at the Digitization Program Office, OCIO

Increase the quantity, quality, and throughput of digitization - DPO Strategic Goal 1

We are developing systems and tools to help the Smithsonian units create and expand their digital collection records. Some of our work includes:

  • Simplifying data cleanup (Excel is not always the best tool)
  • Scripting data conversion (If you need to do something more than once, write a script)
  • Developing data scrubbing workflows (Optimize, standardize, and document the fixes to the data)
  • Link records to taxonomies or other databases (Doing this by hand is not sustainable in any large collection, we let the computers do the tedious tasks)
  • Spatial databases to improve georeferencing and detect errors for a large amount of records
    • Detect common issues like inverted sign in coordinates or values outside the country (A missing sign means sending a record from Brazil to Madagascar)
    • Use other databases to estimate coordinates for localities (Other collections with duplicates or objects from nearby areas can help speed up georeferencing)
    • Obtain approximate coordinates from a location string using natural language (There is not enough time or people to georeference records one by one)
  • Big Data tools to analyze datasets with millions of records (e.g. High-Performance Cluster, Google BigQuery)


These are some of the projects we are working on:

  • Zonal OCR using Machine Learning - We are developing some tools that use AI to detect text in images, what is usually referred to as optical character recognition (OCR). While OCR can give mixed results, our approach seeks to subset each image and evaluate each subsection by itself. This allows us to reduce the amount of manual work needed. 
    Georeferencing Tool Example
  • Mass Georeferencing Tool - We are building a tool to allow to georeference records at a massive scale. The tool will use new workflows to allow georeferencing of many records in the less amount of time possible. 
  • Spatial Data API - An API inside the SI firewall that allows you to query for political boundaries, protected areas, and other named areas in the world. Documentation coming soon.
  • Spatial Data Scrubbing Tool - Backed by a PostGIS spatial server, the tool will allow users to quickly check that the coordinates in a file are correct. The system will check by comparing the coordinates with several known databases of spatial data, including political boundaries, protected areas, species distributions, and others. 
  • Match Getty AAT - A web app that matches terms in a file to the Getty Art & Architecture Thesaurus using their Linked Open Data portal. The app tries to find the best match by using a set of keywords included with each row, when available, to try to disambiguate the usage. For terms where many matches are found, the app allows the user to select the best one. Once the process is completed, the results file can be downloaded for further processing or importing to the CIS or other database.
  • GBIF Issues ExplorerR/Shiny app that allows researchers and data/collection managers to navigate the records with issues in a GBIF Darwin Core download.
  • Virtual Barcodes - This system allows the vendor in a digitization project to lookup an item in a database and scan the unique identifier from a computer screen. This allows us to reduce the production time and error rate from other methods like paper barcodes, spreadsheets, or call and response. The database is updated from a view into the unit's CIS at a set frequency (once or twice a day).
  • Packages to query EDAN from both R and Python.


We are also looking into training needs at the institution, digitization and data scrubbing tools that can benefit more than one unit, and other innovative approaches to improve the digital records. 

Have a data problem? We are looking for ways to help the Smithsonian units create and enhance the digital records of the collections. Contact us: 




Records to Digital Data

Digitization Program Office logo

Social Media:


Informatics Training and Events @ SI

Team Calendars
defaultViewlist
idd8c903b0-e859-4b8b-a6bd-49f1deaa13d5
hideWeekendstrue