Informatics efforts by the Mass Digitization Program at the Digitization Program Office, OCIO
Increase the quantity, quality, and throughput of digitization - DPO Strategic Goal 1
We are developing systems and tools to help the Smithsonian units create and expand their digital collection records. Some of our work includes:
- Simplifying data cleanup (Excel is not always the best tool)
- Scripting data conversion (If you need to do something more than once, write a script)
- Developing data scrubbing workflows (Optimize, standardize, and document the fixes to the data)
- Link records to taxonomies or other databases (Doing this by hand is not sustainable in any large collection, we let the computers do the tedious tasks)
- Spatial databases to improve georeferencing and detect errors for a large amount of records
- Detect common issues like inverted sign in coordinates or values outside the country (A missing sign means sending a record from Brazil to Madagascar)
- Use other databases to estimate coordinates for localities (Other collections with duplicates or objects from nearby areas can help speed up georeferencing)
- Obtain approximate coordinates from a location string using natural language (There is not enough time or people to georeference records one by one)
- Big Data tools to analyze datasets with millions of records (e.g. High-Performance Cluster, Google BigQuery)
These are some of the projects we are working on:
- Mass Georeferencing Tool - We are building a tool to allow to georeference records at a massive scale. The tool will serve as a testbed for workflows and techniques to allow georeferencing of many records in the less amount of time possible.
- Spatial Data API - An API inside the SI firewall that allows you to query for political boundaries, protected areas, and other named areas in the world. Documentation coming soon.
- Spatial Data Scrubbing Tool - Backed by a PostGIS spatial server, the tool will allow users to quickly check that the coordinates in a file are correct. The system will check by comparing the coordinates with several known databases of spatial data, including political boundaries, protected areas, species distributions, and others.
- Match Getty AAT - A web app that matches terms in a file to the Getty Art & Architecture Thesaurus using their Linked Open Data portal. The app tries to find the best match by using a set of keywords included with each row, when available, to try to disambiguate the usage. For terms where many matches are found, the app allows the user to select the best one. Once the process is completed, the results file can be downloaded for further processing or importing to the CIS or other database.
- Virtual Barcodes - This system allows the vendor in a digitization project to lookup an item in a database and scan the unique identifier from a computer screen. This allows us to reduce the production time and error rate from other methods like paper barcodes, spreadsheets, or call and response. The database is updated from a view into the unit's CIS at a set frequency (once or twice a day).
- Packages to query EDAN from both R and Python.
We are also looking into training needs at the institution, digitization and data scrubbing tools that can benefit more than one unit, and other innovative approaches to improve the digital records.
Have a data problem? We are looking for ways to help the Smithsonian units create and enhance the digital records of the collections. Contact us:
Informatics Training and Events @ SI
- EDIT THE CALENDAR
Customise the different types of events you'd like to manage in this calendar.#legIndex/#totalLegs
- RESTRICT THE CALENDAR
Optionally, restrict who can view or add events to the team calendar.#legIndex/#totalLegs
- SHARE WITH YOUR TEAM
Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.#legIndex/#totalLegs
- ADD AN EVENT
The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.#legIndex/#totalLegs
Subscribe to calendars using your favourite calendar client.#legIndex/#totalLegs
- No labels