Tom the Librarian

19 March 2008

Biodiversity Geek’s Haven – model for the BLC and others?

Folks from the Biodiversity Heritage Library gave a presentation to Boston Library Consortium (BLC) members today about how they are using books and serials scanned from their collections into the Internet Archive (as charter participants in the Open Content Alliance) to create a scholarly portal (geek’s haven) for accessing their content in a variety of interesting ways. The natural science collections they are scanning, some of the oldest yet still currently used scientific literature, lends itself to searching by species and other like names. The most intriguing tool they have developed is to cross-index all the content of the books and journals they have scanned (and are continuing to scan) against the NameBank taxonomic classification system (currently at 10,775,553 records) created by the Marine Biological Laboratory in Woods Hole, Massachusetts, whose library, the MBLHWHOI Library, is also a member of the BLC. As they explained it, names of plants, animals, insects, etc. in scientific literature very much depend upon history and precedence – where does this fit in with what has been observed and classified before? – which sounds to me a lot like the ISI principle of citation history – who cites whom – tracking the growth and development of a scholarly body of literature.

There’s no reason these same principles could not be applied to other scholarly schemes. Someone mentioned, for example, tracking every instance of the words “Tom Sawyer” in fiction not written by Samuel Clemens utilizing a human “namebank” would yield some fascinating results. A multi-type library academic consortium such as the BLC could provide fascinating “windows” into its scanned collection(s) this way. It also strikes me that there are a lot of institutional repository-like lessons to be learned here as well as a striking example of creating a sophisticated web interface using a dazzling variety (“purposeful emerging technology”) of off-the-shelf web tools / software / applications, etc.

Hit more for my detailed notes on today’s meeting-

Boston Library Consortium
Boston Public Library , Mezzanine Meeting Room
Tuesday, March 18, 2008, 10am-Noon

Robert Miller, Internet Archive: Scanning Update

Over 300,000 books total; 10,308 scanned in Boston. Scanning approximately 1,000 books/day, anticipating capacity of 1,500 soon; 7 million pages/month going up to 10 million pages/month.

  • 2006: 20 million pages scanned
  • 2007: 50 million pages scanned
  • 2008: shooting for 100 million pages scanned

Scanning locations: Los Angeles, San Francisco, Toronto, Boston Public Library, New York Public Library, Library of Congress, Microfiche scanning at Univ. of Alberta, Microfilm scanning at Univ. of New Brunswick. Satellite scanning locations: Univ. of North Carolina, North Carolina State, Johns Hopkins, Getty LA, Smithsonian, Univ. of Illinois, Guatemala, London, National Library of Scotland, National Library of Sweden.

Technical improvements: near the end of developing effective procedures for handling foldouts of up to 18 X 24 inches (188dpi @ 18X24); can scan microfilm en masse at San Francisco and deploying at Univ. of Alberta. Migrated to Abbyy (FineReader) 8.0 OCR software from 6.0 (skipped 7.0). Surprising development has been need to repair 40-45 cameras monthly. Moved from Canon 1D (EOS-1D), which had a shelf life of 1,000,000 shots (advertised at 250,000) to Canon 5D (EOS 5D), which, with a published life of 100,000 shots, tends to fail at 150,000. Overhauled some scribes (scanning work stations); engineer working on project declared with proper maintenance, oiling joints, etc. scribes should have useful life of 100 years. Miller spoke about the web application that records downloads of IA text works. Archive folks are confident that this application does not record spiders and other automated inquiries, but is designed to record instances of people downloading a publication and viewing individual pages. The BLC Executive Director mentioned emailing information about this which I cannot find and I have placed a follow-up inquiry to learn more.

Staff: 30% of staff stay one year or longer; acceptable considering repetitive nature of work and modest pay; enthusiasm and commitment high. Expects current economic climate will yield better educated and more motivated work force.

Delivery issues: BLC Executive Director announced a need for feedback on the three available delivery options for transporting materials to the BPL for scanning and back. She will want one response per institution, and will probably use Doodle to record those responses.

Global Library for Life: Biodiversity Heritage Library

Martin Kalfatovic
Head, New Media Office and Preservation Services Department
Smithsonian Coordinator, Biodiversity Heritage Library
Smithsonian Institution Libraries

Slide show for Martin Kalfatovic and Chris Freelans presentations found at http://www.slideshare.net/Kalfatovic/global-library-of-life-the-biodiversity-heritage-library

BHL – Biodiversity Heritage Library is a collaborative content project. Cites Charles Darwin on how any study of natural science requires an extensive reference library. Libraries house 250 years of published records dating back to Carl Linnaeus and are the trusted resource of deposit for taxonomic literature. Among all the sciences, taxonomy has the longest half-life with much of this literature being continually consulted. The idea for the BHL found genesis in a 2003 Telluride meeting conceptually developing the Encyclopedia of Life (EOL). The idea became concrete among natural history libraries at a 2005 London conference, including the MBLWHOI Library among its core members. There are also partner libraries.

BHL is first focused upon literature. Why? The NOW factor. Print can be scanned at a low cost of 10-19 cents/page. The time for mass digitizing has arrived and Internet Archive scanning project provides an effective vehicle. Natural history literature has a well-defined domain of pre-1923 literature with a core of 100 million pages. There is a convergence of interest now, too, as evidenced by the Global Diversity Information Facility (GBIF) and the Darwin Declaration.

BHL Tools: Serial titles – there is a “bidding” (this is not the term used by the British for the process as that would be too commercial and competitive sounding) mechanism for institutions to select serial titles for scanning in order to avoid duplication; followed by a partial bidding to allow for needed fill-ins in the run. Monographs – there is a de-duping tool used to avoid unnecessary scanning. WonderFetch allows additional XML information alongside the MARC XML: intellectual property info, due diligence info, other IDs, enumeration and chronological information for serials, etc. OCLC Collections Analysis Tool has been used for gross collection analysis.

BHL’s connection to the Encyclopedia of Life is key. BHL content will underpin EOL entries and is a key vehicle entry into BHL content. Legacy literature is a critical part of the EOL.

BHL principals are utilizing IA’s divergent scanning centers to cope with libraries in multiple locations producing a single content enterprise. Have already added 3.5 million scans to a pre-IA base of 2 million from early projects.

Why not just leave it to Google? Libraries are depository of record!

  • Will Google be here in 50 years? The long now – institutions persist thru time and provide more stability than the commercial world. Institutional libraries are more likely to still be here.
  • Bibliographical accuracy; what editions exactly? Google has already combined multiple print copies into single online iterations without detailed bibliographic information demonstrating a re-use and re-purposing philosophy that does not respect congruence of original to digital.
  • Quality! Shows a Google scan iwth fingers prominenetly displayed and the poor quality of the resulting image once the fingers have been “cleaned” out.
  • Need for persisitent identifiers for scholarly use.
  • Superior structural markup of the publication using CiteSeer from Penn State.
  • Semantic markup classification of species mentioned in text, etc., using GoldenGATE and INOTAXA systems. Chris later noted that semantic markup is controversial and BHL goal is agnosticism using whatever tools are developed / available.
  • BHL is using an opt-in approach with publishers and NOT Google’s opt-out approach. BHL hopes to work with publishers and scholarly societies, and cites Herpetological Review as an example of folks with whom they are working. This approach offers increased use of publishers and societies journal literature (higher citation rates) and effective long-term management of archived content.

What is the one most unanticipated thing in this whole project? More staff needed!

Chris Freeland
Director of Bioinformatics, Missouri Botanical Garden
Technical Director, Biodiversity Heritage Library

Describes himself as a technologist: not a librarian (although he has worked in libraries since middle school), not a programmer (gave that up to the more able), not strictly information technology. The BHL uses purposeful emerging technology, NOT playing around with technology. Goal is to unlock library collections literature upon what scientific assertions are based. Connect:

  • scientist–>literature
  • literature –> users
  • users –> users (inspire and create community, web 2.0)

What does the BHL do? Ingests in weekly downloads from 1) the Internet Archive, 2) Botanicus (BHL predecessor model limited to plant life), and 3) other sources, utilizing a quality assurance process to identify only the best, information about newly scanned material. Takes Library of Congress subject headings for the materials, parses them, and creates a tag cloud interface for those headings reflecting the content within as an initial opening screen. At 3,500 titles now default limits to top 100 subject headings upon initial display (clicking to display all subject headings in a tag cloud just about choked my computer). Takes geographic subject headings and applies them to a Google Maps API – some anomalies, pointing out for example that Google Maps places the center of the United States in the Northwest taking Alaska and Hawaii into consideration. For display uses the jpeg 2000 at 85% original size (usually from RAW image file or TIFF). To accommodate variable browser and connection capability uses LizardTech decoder (see the Internet Archive listed as a public site utilizing LizardTech products) and loads jpeg 2000 in 256 tiled squares with the help of GSIV (tiles onscreen like Google maps; an AJAX-like product). Very effective paltform for distributing high quality images, especailly since BHL publications are full of paintings, illustrations, etc. The BHL portal interface was launched February 27, 2008. Integrates a blog, conference info, provides RSS feeds, stable urls (not yet officially “persistent”).

Text OCR is run through MBLWHOI taxon finder server. First removes known English, then crosses remaining text against the NameBank IDs looking to identify scientific names. So far 2.9 million pages have yielded 14 million scientific names, exhibiting the taxonomic strength of this literature with a heavy preponderance so far toward insects and the marine. False positives, of course, require context issues to identify. Earliest currently displayable publication from 1490. Demonstraties 39 titles tagged as “pictorial works” and shows an exquisite Album of Abyssinian Birds. BHL changes IA OCR for the book using CiteSeer to parse the book into individual pages. Identifies pages as to type: map, illustration, text cover, end plate, etc. Pages with figures may be “false drops” in that the figure may be embedded within text on the page and not solely a figure page. Does not download IA’s jpeg 2000 scanned images; leaves that on IA’s servers and retrieves them. Sees jpeg 2000 images as an excellent building block to move forward. Have not implemented RDF, Resource Description Framework (yet), but the goal is make it open and movement is towards Fedora Commons platform which is RDF based.

Developer Tools: Disambiguation required with with metadata coming in from multiple source catalogs (authority / consistency issues). What about user-supplied metadata to be added to illustrations, etc.? Many good ideas yet to incorporate. Some issues over counting by IA – what is it counting as use? Demos Scientific Name page – right now taxonomy is not linked to hierarchy, goal for near future. BHL is a Frankenportal – not a single algorithm has yet been written; everything is grabbed from available tools off-the-shelf. Moving to LMN DT (lost on me).

Driver is: Think how a scientist will search the collection –> a geek’s interface per Brewster Kahle. Reminds me of the ISI who cites whom principle – taxonomic literature as described today is driven by citation and precedence and BHL is providing access of that history to its roots. Another driver is provider integration, especially Encyclopedia of Life, so realted to BHL founding. Wikipedia also mentioned. Would love to take EOL development and repurpose, “classification banks.”

There is no done. There will always be standards to which to adhere and functionality to improve.

During the question and answer period several in attendance asserted that some BLC members asked / are de-duping against Google Scholar content. There was much puzzlement over this for all of the reasons listed by the BHL folks.

Serendipitous use; there is no predicting what will get used, how , and why. UMass Amherst has already been delighted to find the popularity of scanned copies of a local just this side of xeroxed publication called Fruit Notes.

   

2 Comments »

  1. Tom, many thanks for the excellent coverage here, and for your questions during the meeting! The slides from our talks are online at:
    http://www.slideshare.net/chrisfreeland/biodiversity-heritage-library-for-boston-library-consortium/

    and

    http://www.slideshare.net/Kalfatovic/global-library-of-life-the-biodiversity-heritage-library

    Comment by Chris Freeland — 20 March 2008 @ 9:47 pm | Reply

  2. Tom,

    Thanks for the good report. I’ve heard brief descriptions of this. It is an exciting project.

    You might be interested in poking around in what’s being done in various “semantic web” projects. A couple places to start:

    http://www.w3.org/2001/sw/
    http://simile.mit.edu/

    Jack

    Comment by Jack Ammerman — 27 March 2008 @ 4:57 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.