Skip to end of metadata
Go to start of metadata

Tools for Data Scrubbing

Here are some software tools and packages useful for data scrubbing, analysis, and visualization in almost any area. Priority is given to Open Source software. 

SoftwareLanguage or TechnologySummary


(as of 2019-07-29)



OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Version 3.2 or higher with Java 8

  • NMNH
  • OCIO

R + Tidiverse


The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. 

  • FSG



RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

  • OA
  • FSGA
  • STRI
  • NMNH
  • NMAH

Python + Pandas


pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


  • STRI

Geospatial Data Abstraction Library (GDAL)

Library and command line toolsGDAL is a translator library for raster and vector geospatial data formats.
  • NZP

Jupyter Notebook

PythonThe Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. 
  • STRI


ApplicationA Free and Open Source Geographic Information System.
  • STRI


File-based Database (SQL)SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world.
  • STRI
  • OCIO

ZBar Barcode Reader

Application and library

ZBar is an open source software suite for reading bar codes from various sources, such as video streams, image files and raw intensity sensors.

Modern Python 2.7-3.6 library: pyzbar

  • None

Tools in development

ToolPurposeRoadblocks or tasks to do

Georeferencing System

System that allows a variety of inputs (csv, Excel, database) for georeferencing against several databases. The system will seek to reduce time by combining records, comparing information between data sources, and using known previous data to estimate best matches.

  • PostGIS backend with multiple databases:
    • Political boundaries of the world
    • Protected areas
    • Geonames
  • Other APIs
    • Getty TGN
    • EDAN
    • Wikidata
  • Mapping of results
  • Approximate string search (for cases with typos, abbreviations, alternative spellings)

  • Need sources of historical data. 

Taxonomy matching

Will include keeping IDs to track for updates, splits, other changes.

  • Getty AAT (see above)
  • Biological taxonomies
  • Other widely used taxonomies
  • Identify other taxonomies to use


R package that will provide a number of tools for data validation and scrubbing of collection data.

  • Make functions useful for more data types
  • Link other data scrubbing packages
  • Add documentation

Tools Released by DPO

Here are the software tools we have released. The links will take you to the source code repository. We are making the software tools from DPO available under an Apache 2.0 open source license. 

If you are interested in using these tools, we can provide a server location and customize them to your specifications. 

SoftwareLanguage or TechnologySummaryURL's



An R package to query EDAN and obtain item information.

You will need your own EDAN API key.

EDAN Python


Basic code to query EDAN from Python.

You will need your own EDAN API key.

Virtual Barcodes

R + Shiny

This system allows the vendor to lookup an item in a database and scan the unique identifier from a computer screen. This allows us to reduce the production time and error rate from other methods like paper barcodes, spreadsheets, or call and response. The database is updated twice a day with a view from the unit's CIS.

GBIF Issues Explorer

R + Shiny

This Shiny app allows researchers and data/collection managers to navigate the records with issues in a GBIF Darwin Core Archive.

Winner of 2nd Place Award in GBIF 2018 Ebbe Nielsen Challenge!

Match Getty AAT

R + Shiny

Shiny app that takes terms from a CSV file and matches them to the Getty Art & Architecture Thesaurus. The user can also provide keywords to try to do a better match.

Locality Matching - Botany

R + ShinyShiny app that takes locations from transcription that are not clear and matches them to known locations. Runs an approximate match using databases from EMu, GBIF, and EDAN.

EPICC Name Match

R + Shiny

A Shiny app to help the NMNH Paleo collection on their digitization efforts. The app tries to match the scientific names from labels in the collection to the taxonomy by EPICC (Eastern Pacific Invertebrate Communities of the Cenozoic).

This app takes the string in the column "Taxonomy" of a csv file and matches it with the Taxonomy from EPICC. The process tries to find a match taking into account the variety of ways that a scientific name can appear.


Raspberry Pi 3 B+

and Raspbian

Image and video Kiosk using Rasperry Pi computers.

MD5 tool


Utility for Mac and Windows that calculates, and saves to a file, the MD5 hash of all files in a directory.

  • No labels