Skip to content
Dugénie edited this page Nov 9, 2017 · 14 revisions

Introduction

Background

For several centuries, Natural History Collection (NHC) institutes (i.e. museums and botanical garden) accross the world have been responsible for preserving the physical copies of herbaria. These herbaria are collections of plants sticked on a sheet with annotations that describe a given specimen. Each herbarium specimen has been collected, carefully prepared and annotated by botanists.

Herbaria hold large numbers of collections: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. Through the years, these collections have been studied and enriched by taxonomists. This represents altogether a very precious yet fragile scientific basement. Therefore, in the digital era, it has become obvious for the NHC institutes to yield a vaste digitization campaign of their herbaria and limit the manipulation of the physical copies.

Nevertheless, the responsibility of preserving the the digital versions is a new challenge. High-resolution images of herbarium specimens require substantial bandwidth and disk space. Moreover, ensuring long-term preservation of digital objects is not a straight forward task for organisms that cannot afford to manage high volumes and acquire the suitable knowledge on data format that will still be readable in many years.

Another significant challenge is the new perspective for performing all kind of image analysis using intensive computing and post-processing techniques. Again, this requires computing skills that are not always obvious in NHC institutes.

Since data storage and image processing are not natural skills of NHC institute, some of them decided to rely on a third party that can provide a Trusted Digital Repository (TDR) and a shared access to the data for the whole community. Thus, all demanding tasks in terms of operational constrains, respect of formal OAIS processes, etc. are delegated.

Objectives

The two core objectives of the Herbadrop data pilot are:

  1. long-term preservation of scientific natural heritage: collections of digitalized herbaria are transferred from several European museums and botanical gardens to a TDR.

  2. extraction of written information from these images by using Optical Character Recognition (OCR) analysis using intensive computing.

The long term preservation is an ongoing task that is mainly focusing on operational aspects. Therefore, for the purpose of this paper, the focus in made on reporting results for the analysis of the OCR results.

Community partners

Initially, the consortium was formed by five NHC institutes from Finland, France, Germany, Netherlands and Scotland. Their common objective was to share their herbaria for future research projects by making the specimen images and data available on-line from different institutes allows cross domain research and data analysis for botanists and researchers with diverse interests (e.g. ecology, social and cultural history, climate change).

BGBM (De.): The Botanischer Garten und Botanisches Museum (BGBM) of Berlin is to a large extent based on its scientific plant collections. A central element of its activities is taxonomic research, through which plants are identified, described, named and classified.

MNHN (Fr.): The Muséum National d'Histoire Naturelle (MNHN) of Paris is in charge of the main collection of botanical and zoological specimens in France. Between 2008 and 2012, it completed a massive digitization program of the herbarium specimens, putting online nearly 6 millions of images. It will greatly benefit of the pilot for both long-term preservation of image files and extraction of the label information.

RBGE (Sco.): The Royal Botanic Garden Edinburgh (RBGE) has a very active herbarium of 3 million specimens and living collection of around 64,000 plants. All of the living collection records, including more than 40,000 linked images, are online and 300,000 of the herbarium specimens are images at high resolution which are available online. RBGE has incorporated OCR technology into the digitisation workflow and is currently testing Handwritten Text Recognition.

Digitarium (Fin.): Digitarium is the digitisation centre of the Finnish Museum of Natural History and the University of Eastern Finland. In 2014, Digitarium coordinated a H2020 proposal for designing a European distributed digitisation infrastructure for natural heritage (acronym: ICEDIG).

Naturalis (NL): Naturalis Biodiversity Center (Naturalis, Leiden, The Netherlands) is the merger of the National Museum of Natural History, the Zoological Museum of Amsterdam and the National Herbarium of the Netherlands. Naturalis has just finished its mass digitisation project in which 4,2 M higher plants were scanned, databased and published. Naturalis decided to not extend its membership after the fist phase of Herbadrop, but will come back since it is partner of the forthcoming ICEDIG project.

A new partner, the Botanic Garden of MEISE, Belgium, joined the consortium during 2017. The herbarium of Botanic Garden Meise houses around 4 million specimens. The Vascular Plant Herbarium contains three main collections: the General Herbarium with more than one million specimens; the Belgian Herbarium with about 200,000 specimens; and the African Herbarium comprising at least one million specimens (of which over half are from central Africa). The 800,000 specimens in the Cryptogam Herbarium consist of mosses, lichens, algae, fungi and myxomycetes.

Clone this wiki locally