Skip to content
Dugénie edited this page Nov 9, 2017 · 14 revisions

Introduction

Background

For several centuries, Natural History Collection (NHC) institutes (i.e. museums and botanical garden) accross the world have been responsible for preserving the physical copies of herbaria. These herbaria are collections of plants sticked on a sheet with annotations that describe a given specimen. Each herbarium specimen has been collected, carefully prepared and annotated by botanists.

Herbaria hold large numbers of collections: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. Through the years, these collections have been studied and enriched by taxonomists. This represents altogether a very precious yet fragile scientific basement. Therefore, in the digital era, it has become obvious for the NHC institutes to yield a vaste digitization campaign of their herbaria and limit the manipulation of the physical copies.

Nevertheless, the responsibility of preserving the the digital versions is a new challenge. High-resolution images of herbarium specimens require substantial bandwidth and disk space. Moreover, ensuring long-term preservation of digital objects is not a straight forward task for organisms that cannot afford to manage high volumes and acquire the suitable knowledge on data format that will still be readable in many years.

Another significant challenge is the new perspective for performing all kind of image analysis using intensive computing and post-processing techniques. Again, this requires computing skills that are not always obvious in NHC institutes.

Since data storage and image processing are not natural skills of NHC institute, some of them decided to rely on a third party that can provide a Trusted Digital Repository (TDR) and a shared access to the data for the whole community. Thus, all demanding tasks in terms of operational constrains, respect of formal OAIS processes, etc. are delegated.

Objectives

The two core objectives of the Herbadrop data pilot are:

  1. long-term preservation of scientific natural heritage: collections of digitalized herbaria are transferred from several European museums and botanical gardens to a TDR.

  2. extraction of written information from these images by using Optical Character Recognition (OCR) analysis using intensive computing.

Clone this wiki locally