About Odessa

Table of Contents


What Is Odessa?

Odessa is a digital library dedicated to the cultural and family history of the millions of Germans who emigrated to Russia in the 1800s and their descendants, who are now scattered throughout the world.

The Odessa document collection consists primarily of digitized books and records plus indexes of microfilms and research aids that enable users to trace individual and family migrations since the early 1800s.

Odessa is made freely available by the author, Roger Ehrich, on whose cloud server the library resides.

Contacting People

One of the most important goals of the Odessa Library has been to facilitate communication among people searching for information. Unfortunately, email address harvesting for spam and the frequency with which people change email addresses makes it impractical to publish email addresses openly in library documents. Hence in this version of the library, all email addresses have been removed from web pages and documents.

In its place is an author search facility on the library search page. Each library document has a copyright notice containing the name of the document author or compiler. Just enter the author's surname (spelled correctly) into the Author Surname text box on the Odessa search page and click Submit. You will receive the latest known contact information for that author. Should this information be out-of-date, you can help other users of the library by sending updated information to the librarian.

For the reasons above, this is the only place in the Odessa Library showing the email address of the librarian, which must be typed manually:

Librarian Address

Copyright Notice

All documents in this library are copyrighted; additional details are provided in the individual document headers. They may be freely used for personal, nonprofit purposes or linked by other WWW sites. They may also be shared with others for personal use, provided headers with copyright notices are included. However, no document may be republished in any form or embedded in public databases without permission of the copyright owner, since that represents theft of personal property.

For an introduction to copyright law, please read An Intellectual Property Law Primer for Multimedia and Web Developers.

Introduction

Since late 1993, a small group has been considering ways to apply state-of-the-art networking technologies to the problem of conducting genealogical research. Already in 1993 the developers recognized that finding information distributed across thousands of personal websites would pose considerable difficulties. Three main approaches to search are common today; personal websites offer authors control over their information at the risk of obscurity, centralized libraries offer superior search capabilities but offer contributors less control over their data, and webrings seek a compromise between the distributed and centralized approaches.

This digital library differs from a paper library in two main respects...the index is hierarchical instead of flat like a card catalog, and much of the collection can be searched electronically. The article on retrieval and copyright is recommended. Library size as of December, 2006 is about 360 Mbytes.

Full Text Search

Users are strongly encouraged to acquire a full text retrieval program and to index downloaded documents into this program where they can be researched locally in much the same way as Google is used to search the Internet. A number of good programs are available; my personal favorite is DTSearch. The Odessa 3 search program is unique in that it not only returns the documents containing the query, but also the actual information that it finds in each document. Thus, with the Odessa search program it isn't necessary to download any documents unless additional information is desired.

Those who do not have a full text retrieval program can read the data files on their own computers by importing the documents into a text editor. Most word processors can also import plain text documents... however, in order for aligned data to be viewed properly it is important to ensure that a non-proportional text font is used. 8 point Lucida Sans Typewriter is an excellent font to use, since it is available in the Microsoft font collections both for the PC and for the Mac.

Each Odessa document contains a header that describes the document and the document fields. The header also provides a version date and the name of the compiler and copyright owner. If you have questions about the data that are not explained here, please contact the copyright owner.

Contributing Documents to Odessa

Authors who would like to share their work should contact the librarian (see Contacting People, above). All Odessa documents are in tab-free plain text format. Umlauts are permitted but discouraged, since users may forget to use them in search queries. Submitted documents should not contain email addresses, and WWW URL's should be used sparingly, since these also tend to become obsolete quickly.

A signed paper copy of the copyright release form must be received by the librarian before a document is published. The mailing address can be found on the form, which is available in English and in German.

Following are the guidelines for preparing Odessa documents. Authors who need assistance should contact the librarian.

TITLE:
A short document title is the first line of text. This is used for the library menu entry and for indexers that automatically extract titles from the document itself.
COPYRIGHT STATEMENT:
Published by the Odessa Digital Library - 31 May 1999
     http://www.odessa3.org

This document may be freely used for personal, nonprofit
purposes or linked by other WWW sites.  It may also be
shared with others, provided the header with copyright
notice is included.  However, it may not be republished
in any form without permission of the copyright owner.

Copyright 1999, Hans Liebenthal
HEADER:
Provide information for document users, including the document contents, data sources, author, dates of original documents, compilation, or revision, and additional notes about the data. For example:
This file contains birth records from the village of
GrossLiebental, South Russia for the decade 183x. This
information was compiled by Dale Wahl and coworkers
December 12, 1994 from the St. Petersburg Lutheran
Evangelical archives published by the LDS.
DOCUMENT BODY:
  • No TABs or ASCII characters outside the code range 32-126. Note: Standard German character forms are ae, oe, ue, and ss, per per Library of Congress standard. WORD users...please disable "smartquotes" since this feature introduces non-standard characters.
  • Full names, directory style in tabular data
  • Surname sorted in tabular data
  • Columns minimized, files not more than 120 characters wide, whenever possible
  • Dates in standard format in one of the following forms:
    12 Mar 1833
     5 Mar 1883
       Mar 1883
           1883
              ? (when a date is present but unreadable)
    Dates should not contain ? in other positions, since the meaning of the question mark is easily ambiguous.
  • Whenever possible, place most important information at the left of each record, least important information at the end.

Updating Data

No documents will be published in the Odessa Library without the written consent of the document and copyright owners. Documents may be updated or removed at any time by those who submitted them.

The procedure for updating a document is simple. The document author should first download the current copy from the library by clicking View/PageSource on their web browser and then saving the document. Document edits should be made using a standard text editor without disturbing data that is to be retained. No tabs should be present in the edited data. The revised text file data should be returned with the same filename to the librarian by email attachment. The email address can be found by querying the author database on the search page.

Technical Stuff: How Odessa 3 Works

Odessa 3 is the third technical generation of the Odessa search program. Odessa uses four principal files, called the document dictionary, the lexicon, the index, and the stoplist. Each document in Odessa is described in a document dictionary which provides the program with the document's name and location, latest revision date, and formatting instructions for the result presentation code. Since documents may be extremely long and since important information may turn up in obscure places, Odessa returns retrieved information in context so that in most cases the documents themselvs need not be downloaded.

Odessa stores an alphabetic list of every term it finds in the Odessa documents in a file called the lexicon. The lexicon also tells Odessa where it can find a list of all documents that contain at least one instance of that term. These document lists are stored in another file called the index, which is much like the index of a book. At the present time, the lexicon stores some 343,000 unique terms extracted from Odessa's 1300 data files. Each new or misspelled word finds its way into the lexicon. However, if a word contains only a single character, if it is entirely numeric, or if it is in a list of useless words called a stoplist, it is not included in the lexicon. Click on the stoplist if you wish to see the words that are excluded from the lexicon. The lexicon contains only words that begin with a letter, followed by letters, numerals, or underscores. Accented characters count as letters.

When any changes are made to the Odessa document collection or to the document information stored in the document dictionary, all Odessa's information must be regenerated. First a program is executed to extract information about all the document files and to create the file index and list of recent library additions that are linked from the Odessa "content" page. Then a program called an indexer is executed to rebuild the entire lexicon and index. Indexing Odessa's 1/3 Gigabytes takes less than a minute, during which time users are not permitted to do any searching.

When a query is made to Odessa, a check is made to ensure that at least one term in the query is in the lexicon. Using the lexicon and the index, Odessa creates a list of all the documents that contain all of the query terms that are in the lexicon. Other terms are acceptable, too, if they are entirely numeric (such as a year) or if they are in the stoplist. Any query containing a term that is not in the lexicon, in the stoplist, or numeric, is rejected.

Each of the selected documents is searched sequentially to determine whether the query is satisfied. What makes Odessa relatively fast is that each candidate document is scanned only once, regardless of the complexity of the query.


Back to Odessa