Raptor Research News
With Scientist & Citizen-Scientist Participation
Sponsored by Hancock Wildlife Foundation

Preface to The Electronic Edition of Raptor Research and Management Techniques

Why an "Electronic Edition," not just an E-book?

The Electronic Edition of RRMT is not just an e-book. If it was, there would be no reason to have a preface to explain what was done and why; you would simply have an electronic presentation of exactly what was in the paper book, not much, if any different from the PDF file that was used to do the original printing.

Normally a book of scientific papers such as this would be presented as a whole, with a fairly minimal "meta" information page telling something about what might be found in the book, who wrote it, who edited it, who published it, and when it was published.

In fact, this electronic edition includes exactly that; a PDF version of the whole book. It also includes a PDF version of each chapter, and in fact each individual page. It also includes several additional presentations of the original text and graphics in different formats for additional reasons.

In addition to simply presenting this book in electronically readable format, we expect it to be used as the basis for the extension of and extensive dissemination of the knowledge base it represents. In order to do this, the text, images and extended meta information needed to be available in HTML format as well as PDF.

The problem with PDF presentation is that there are a number of web-based facilities that are either not possible or not practical when only PDF is used. The objectives of the Raptor Research News site require several of these including: in depth discussion and reference to the work, graphs and images and real-time translation (via Google translate function) of the text to any of the supported languages (the book has already been translated to Japanese for print, and 3 other language versions are in progress) because this information is borderless.

So, in addition to the exact PDF replication of the book, we have also included both straight text, graphic presentations of the pages in various sizes, and individually presented images from within the text. We have also used a variety of tools and facilities to create key-word index and meta information not only native to the site but for the web in general.

Technical Challenges

Our objectives included a need to have the full text of the original book available for various extended facilities such as keyword indexing and meta information creation. Normally, the original text of a book can be extracted from the manuscript after editing but before layout. Unfortunately this was not possible in this case.

For technical reasons we had to start with a PDF with no original text available.

In the preparation of the printed version of this specific book, many different original papers were brought together and combined in the master layout program. During the preparation there were edits applied after the original text was pulled into this proprietary layout program and, unfortunately, there is no way to get just the original text back out again. The only export is to PDF.

This meant that some of the extra work had to be done with text generated through Optical Character Recognition software, and there are inevitably some errors in this technology that will require some manual editing. We're hoping that eligible individuals will step up to the plate and do this as "many hands make lite work".  Of course we could have had the book copy-typed but this too has the likelihood of transcription errors and in any case there was neither the time nor the funding to have it done.

The Technologies Involved

The original layout document was re-rendered into a "web resolution" PDF file as one document. From this, using an open source facility called the PDF Toolkit (pdftk) it was broken up into individual pages.

A stand-alone, bootable version of Linux (WatchOCR) set up with other open source tools was then applied to these 464 individual page files to render them into TIFF images and then run Optical Character Recognition (OCR) software against these images. I'll note here that the OCR software (Cuneiform) in its original form has been used by a number of manufacturers as their "proprietary" package included with various scanners and fax modems for some years now.

The OCR scripting software was modified slightly to leave the TIFF image files intact rather than throwing them away after the OCR process was done, and the OCR processing was done a second time to produce a "web" version of the results with the images as separate files.

Once these text and image files were available, their contents were used to produce first a master word list, then word lists by chapter, and finally word lists by page. These lists had "uninteresting" words removed, and the results were used as "tag" key words for the keyword indexing function of the website, as well as to create the "meta keyword" HTML tags that would allow search engines to focus on the strengths of each page.

The text itself in full form was used to create both a "meta description" tag for the search engines (which in some cases needs manual editing, a task for the future) as well as a text version of PDF/image of each page such that the search engines themselves would not have to OCR either PDF or image, but which is normally hidden from viewers unless they press the "OCR" link near the bottom of each page.

A script pulled all these various versions and pieces of the pages and chapters into a format that could be directly imported into our chosen Content Management System, GLfusion, on a page by page basis. We note that, prior to any manual editing of the current pages, this script can be re-run with minor modifications such that the basic layout of the 464 individual pages can be changed if this is deemed necessary or desirable. Once manual changes have been made to the individual page content, this no longer is practical.

The Results

The initial presentation, which is open to suggestions from the members of the site, both for re-implementation of this specific book and for future book renderings, consists of a hierarchy of: Book, Sections/Chapters, Pages, PDF/Image/Text/Meta-tags/Keywords

The book's main page is available from the navigation menu under "Books Online" and consists of a lightened version of the cover over which a slightly extended HTML presentation of the table of contents is rendered, with links to the chapters either in the HTML hierarchy or as PDF files.

The Chapter HTML pages present a large thumbnail of each contained page which is hot to the HTML version of that page, as well as a link to the specific page's original PDF file.

The individual pages present a JPG graphic rendition of the original page in large enough format that it should be easily readable (we've aimed this whole site at viewers with minimum 1280x1024 monitors as this is the "sweet spot" for today's viewing web audience). This JPG image is "hot" to the original PDF. Links at the top and bottom allow navigation to the next/previous pages and to the chapter page or book main page.

In addition to the graphic presentation of the original page, there are presented the OCR-extracted versions of the images and tables if this page had any. The originals of these graphics are available to link here if the OCR version is inadequate (i.e. if color version is available for instance)

A link "OCR" will display the editable text version of the page as the OCR read it.

If the page has key words from the list of such words for the whole book, a "Tag" cloud link line is included for these key words that will take the viewer to a list of other pages in the web site that contain the linked key word.

In addition to these viewable features, the page includes two "meta-tag" HTML constructs that provide both "description" and "keyword" information to web search engines that need or will use them (not all do).

Potential Extensions

There is both a comment (per page) and a discussion forum (per chapter) facility available. At the present time the comment facility is turned off for this book's pages.

In addition, there is a WIKI (dokuwiki plugin to GLfusion) feature available under the CMS which may potentially be pre-loaded with the original text of the book's chapters, and from which fully documented, by-member/authorized person modifications and extensions may be created if this is deemed feasible.