Skip to end of metadata
Go to start of metadata

Tools for Data Scrubbing

Here are some software tools and packages useful for data scrubbing, analysis, and visualization in almost any area. Priority is given to Open Source software. 

SoftwareLanguage or TechnologySummary

In TRM?

(as of 2019-06-19)
URL's

OpenRefine

Java

OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

NMNH

http://openrefine.org

R + Tidiverse

R

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. 

NZP/CEC; FSG; OCIO/DPO

https://www.tidyverse.org/

Python + Pandas

Python

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

SLCDA; NMNH/MCI; OCIO/DPO; STRI

https://pandas.pydata.org/

Geospatial Data Abstraction Library (GDAL)

Library and command line toolsGDAL is a translator library for raster and vector geospatial data formats.NZPhttps://gdal.org/

Jupyter Notebook

PythonThe Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. STRIhttps://jupyter.org/

QGIS

ApplicationA Free and Open Source Geographic Information System.NZP/SCBI; STRIhttps://qgis.org/















Tools in development

ToolPurposeRoadblocks or tasks to do

Georeferencing System

System that allows a variety of inputs (csv, Excel, database) for georeferencing against several databases. The system will seek to reduce time by combining records, comparing information between data sources, and using known previous data to estimate best matches.

  • PostGIS backend with multiple databases:
    • Political boundaries of the world
    • Protected areas
    • Geonames
  • Other APIs
    • Getty TGN
    • EDAN
    • Wikidata
  • Mapping of results
  • Approximate string search (for cases with typos, abbreviations, alternative spellings)


  • Need sources of historical data. 

Taxonomy matching

Will include keeping IDs to track for updates, splits, other changes.

  • Getty AAT (see above)
  • Biological taxonomies
  • Other widely used taxonomies
  • Identify other taxonomies to use

collexScrubber

R package that will provide a number of tools for data validation and scrubbing of collection data.

https://github.com/Smithsonian/collexScrubber

  • Make functions useful for more data types
  • Link other data scrubbing packages
  • Add documentation






Tools Released by DPO

Here are the software tools we have released. The links will take you to the source code repository. We are making the software tools from DPO available under an Apache 2.0 open source license. 

If you are interested in using these tools, we can provide a server location and customize them to your specifications. 

SoftwareLanguage or TechnologySummaryURL's

EDANr

R

An R package to query EDAN and obtain item information.

You will need your own EDAN API key.

https://github.com/Smithsonian/EDANr

EDAN Python

Python

Basic code to query EDAN from Python.

You will need your own EDAN API key.

https://github.com/Smithsonian/EDAN-python

Virtual Barcodes

R + Shiny

This system allows the vendor to lookup an item in a database and scan the unique identifier from a computer screen. This allows us to reduce the production time and error rate from other methods like paper barcodes, spreadsheets, or call and response. The database is updated twice a day with a view from the unit's CIS.

https://github.com/Smithsonian/Virtual-Barcodes

GBIF Issues Explorer

R + Shiny

This Shiny app allows researchers and data/collection managers to navigate the records with issues in a GBIF Darwin Core Archive.

Winner of 2nd Place Award in GBIF 2018 Ebbe Nielsen Challenge!

https://github.com/Smithsonian/GBIF-Issues-Explorer

Match Getty AAT

R + Shiny

Shiny app that takes terms from a CSV file and matches them to the Getty Art & Architecture Thesaurus. The user can also provide keywords to try to do a better match.https://github.com/Smithsonian/Match-Getty-AAT

Locality Matching - Botany

R + ShinyShiny app that takes locations from transcription that are not clear and matches them to known locations. Runs an approximate match using databases from EMu, GBIF, and EDAN.https://github.com/Smithsonian/Locality-Matching-Botany

EPICC Name Match

R + Shiny

A Shiny app to help the NMNH Paleo collection on their digitization efforts. The app tries to match the scientific names from labels in the collection to the taxonomy by EPICC (Eastern Pacific Invertebrate Communities of the Cenozoic).

This app takes the string in the column "Taxonomy" of a csv file and matches it with the Taxonomy from EPICC. The process tries to find a match taking into account the variety of ways that a scientific name can appear.

https://github.com/Smithsonian/EPICC-name-match

Pi-Kiosks

Raspberry Pi 3 B+

and Raspbian

Image and video Kiosk using Rasperry Pi computers.

https://github.com/Smithsonian/Pi-Kiosk

MD5 tool

C++

Utility for Mac and Windows that calculates, and saves to a file, the MD5 hash of all files in a directory.

https://github.com/Smithsonian/MD5_tool

  • No labels