Transcription Rules for JewishGen Databases
Here are the guidelines for the data to be submitted to a JewishGen
searchable database, as per the standard
Contributing Databases to JewishGen
procedures.
Data sent to the JewishGen database managers should be in a
database or spreadsheet format.
Any standard database or spreadsheet format (rows of records and
columns of fields) is acceptable, such as: Microsoft Excel (any version),
Microsoft Access, Corel Paradox, or Quattro Pro spreadsheets, etc.
Word processor files are more difficult to work with than spreadsheet
or database files, but may be acceptable if the data is in a regular
format (one record per line, with each field separated by commas,
or tabs, or otherwise delimited).
In all cases, please be sure that each field in your database is
clearly labelled, and that a full database description is provided,
using the guidelines.
I. Templates
The manager of each transcription project should create a data entry
template to contain the transcribed data. The template design
and data entry instructions should be reviewed by JewishGen before
proceeding with data entry. The template may evolve over time,
as you gain experience with transcribing the original records.
Templates for certain standard types of records may be found at
http://www.jewishgen.org/databases/templates.
-
Surnames and Given Names:
- Surnames should be in ALL CAPITAL LETTERS and be in
separate fields.
Each "Surname" field should contain only a surname.
Place given names and other items it their own separate fields.
Having the surname in its own field will allow the surnames to
be searchable, via a variety of methods.
- All other proper nouns (given names, town names, etc.)
should be in Mixed Case (i.e. Initial Capital Letters,
with all subsequent letters in lower case).
- If surnames appear in other fields (e.g.: maiden names,
alternate surnames, etc.) — and if you want these surnames
to be searchable as surnames — then copy those surnames
to an additional field, called "Other Surnames".
This "Other Surnames" field will not appear in the displayed
search results, but is used only for database indexing.
(See Section V, below).
- If there are optional surnames (multiple names, such as
"SMITH or JONES") in a surname field, then code it as
"SMITH / JONES".
(The same applies to the "Other Surnames" field, which can hold
multiple names, separated by spaces and '/' delimiters).
- If a person has no surname or givenname in this record (e.g. records that use
only patronymics), then indicate this with a dash ("-") character,
rather than leaving a name field empty or using any other
indicator.
-
Town Names:
- Place town names in separate fields.
Each “Town” field should contain only a town name.
Place any qualifying county / district / province / state / country
names each in their own separate fields.
Having the town name in its own field will allow the town names
to be searchable, via a variety of methods.
- Transcribe town names exactly as they are written in the
original source document.
For example, in a German-language record, the capital city of
Poland would be written as “Warschau”.
Do not transform it to “Warszawa” or
“Warsaw” — preserve the original.
- If you wish, you may also include the modern native
town name, as per the
JGFF model —
either in an additional separate column, or together
in the same column as a conjecture
in square brackets, separated by spaces and '/' delimiters,
e.g.: “Warschau / [Warszawa]”.
Our search engines will then be able to pick up both names.
- If town names also appear in other fields — and if
you want these town names to be searchable as town names,
then copy the town names to an additional field, called
“Other Towns”.
This “Other Towns” field will not appear in the
displayed search results, but is used only for database indexing.
(See Section V, below).
- If there is more than one town for a field (multiple towns, such
as 'places of former residence' = “Vilnius and Keidanai”),
then encode this field as “Vilnius / Keidanai”.
(The same applies to the “Other Towns” field,
which can hold multiple town names, each separated by spaces
and '/' delimiters).
-
Dates:
- If supplying the data in dBase format, all fields must be
CHARACTER fields; do not use DATE fields.
- If supplying the data in Microsoft Excel format, all fields
must be generic TEXT fields; do not use DATE fields,
as Excel can not handle historical dates.
- Make sure that all years contain four digits, to avoid
ambiguity.
- Make sure that the day and month fields are distinguishable.
Europeans and Americans interpret dates differently.
If possible, use "DD-MMM-YYYY" format, using the three-letter
English month abbreviation, for example "21-APR-1892".
-
Sparse columns:
Columns which rarely contain data should be avoided, because they can
take up considerable horizontal space when displaying search results.
The more columns a spreadsheet has, the more difficult it is to display
to the data meaningfully.
Try to have as few columns as is reasonably possible.
Consider combining several sparse columns into a single more generic
"Comments" or "Notes" column.
-
Source Indicator:
Every record (i.e. each row) should have some type of source
information — column(s) containing an identifier by which a
researcher using the database can independently find this record
in the original source: A page number, a record number, a line number,
etc., or any necessary combination thereof.
II. Data Entry
All data should be transcribed as faithfully as possible
to the original source document, with as little interpretation as possible.
Interpretation is the job of the researcher using the resulting
database, not the job of the transcriber.
The data transcriber should write only what is in the original source
document.
If the transcriber or editor of the database wishes to add conjectures,
interpretation, or editorial comments, these all should be made within
square brackets (“[ ]”), to indicate that these
comments are not part of the original source.
(See Section II.4, below).
-
Missing Data:
If a data item is missing in the original source, indicate this with a
dash or hyphen (“-”) character, rather than leaving
a blank field or using any other indicator.
-
Illegible Data:
If a data item is illegible or questionable in the original record,
transcribe as much as you can, and use the following indicators:
- Questionable entries should be followed by a question mark
(“?”).
- If a data item is totally illegible, just place a single question
mark (“?”) in the cell.
- Use an ellipsis (“...”) to indicate illegible parts
of a name.
For example, write “SM...TH”, if you can't determine
what the letters are between the “SM” and
“TH”.
- If you can't decide which of two possibilities a partially legible
name represents, write both interpretations, separated by
a slash and spaces.
For example, write “STEIN? / STERN?” or
“PERL? / BERL?”.
Our search engines will then be able to pick up both names.
-
Ditto fields:
Data which is the same as the previous row must be filled in;
you can not leave any cell blank, or use a ditto mark (") or
other indicator — because when the data is sorted by a different
criteria, the context is lost.
For example:
Incorrect |
Correct |
Year | # | Surname | Given Name |
1847 | 1 | SCHWARTZ | Moshe |
| 2 | KOHEN | Ryfka |
| 3 | LEVIN | Shmuel |
|
Year | # | Surname | Given Name |
1847 | 1 | SCHWARTZ | Moshe |
1847 | 2 | KOHEN | Ryfka |
1847 | 3 | LEVIN | Shmuel |
|
-
Conjectural Information:
Should always be indicated within square brackets
(“[ ]”).
Conjectural information is information which is not in the
original source document, but has been conjectured by a
database transcriber or editor.
For example, a conjectured surname of “EPSZTEIN”
for a record which has no surname — but for which this surname
has been deduced from other sources — should appear within
square brackets, as “[EPSZTEIN]”.
Other uses of conjectural data include:
- the expansion of abbreviations, e.g.:
—— “AEF [American Expeditionary Forces]”
- corrections to items obviously misspelled in the original source, e.g.:
—— “GOLDCERG / [GOLDBERG]”
- the nominative form of a declined name, e.g.:
—— “KAGANIENE / [KAGAN]”
- the addition of modern town names
(see Section I.2.b.i. above), e.g.:
—— “Lemberg / [L'viv]”
All other editorial comments and explanations should also appear
within square brackets, to indicate that those items are not
in the original record.
-
Prohibited Characters:
Avoid the use of the double-quote character ("),
and line breaks.
- Double-Quote character:
The inclusion of double-quote characters causes problems with our
internal data conversion routines (the procedures which convert
data from Excel to dBase format).
Use single quote characters (') instead.
- Line-Break character:
Do not use linebreaks in your data entry.
Line-Breaks are normally added in Excel by holding the ALT key
and pressing the ENTER or RETURN key, and results in the
contents of a cell spreading over more than one line.
-
Maximum Field Size:
The maximum size of any field is 254 characters.
III. Grouped Records
Some sources, such as Census Records, Czarist Revision Lists, etc.,
group people together into households or families. When transcribing
data like this, each person in the data should still have their own
record (i.e. their own row in the spreadsheet), but we can also group the
family/household together in the database's results display, if
a "Glue" field in used in the spreadsheet to group rows together.
For example, here's an input spreadsheet containing the two family groups:
Family # |
Surname |
Forename |
Patronymic |
Age |
Relation |
Birthplace |
Gubernia |
District |
Town |
Address |
Fond # |
4118 |
LEWIN |
Haim |
Mowscha |
40 |
head |
Jekapils |
Vitebsk |
Dvinsk |
Rezekne |
Soldatskaya 12-3 |
2706-1-156 |
4118 |
LEWIN |
Rocha |
Shmuel |
38 |
wife |
Jekapils |
Vitebsk |
Dvinsk |
Rezekne |
Soldatskaya 12-3 |
2706-1-156 |
4118 |
GLEBERMAN |
Pesia |
Haim |
21 |
daughter |
Ludza |
Vitebsk |
Dvinsk |
Rezekne |
Soldatskaya 12-3 |
2706-1-156 |
4119 |
DORFMANN |
Simon |
Itzik |
28 |
head |
Rezekne |
Vitebsk |
Dvinsk |
Rezekne |
Ludzenskaya 45-6 |
2706-1-156 |
4119 |
DORFMANN |
Esther |
Abram |
25 |
wife |
Rezekne |
Vitebsk |
Dvinsk |
Rezekne |
Ludzenskaya 45-6 |
2706-1-156 |
4119 |
DORFMANN |
Gita |
Mowscha |
50 |
mother |
Rezekne |
Vitebsk |
Dvinsk |
Rezekne |
Ludzenskaya 45-6 |
2706-1-156 |
4119 |
KAGANSKI |
Hana |
Mowscha |
60 |
aunt |
Rezekne |
Vitebsk |
Dvinsk |
Rezekne |
Ludzenskaya 45-6 |
2706-1-156 |
4119 |
LEWIN |
Malka Sura |
Rachmiel |
30 |
cousin |
Ludza |
Vitebsk |
Dvinsk |
Rezekne |
Ludzenskaya 45-6 |
2706-1-156 |
which could be displayed as:
Town District Gubernia |
Surname, Forename |
Patronymic |
Age |
Relation |
Birthplace |
Address Fond # |
Rezekne Dvinsk Vitebsk |
LEWIN, Haim | Mowscha |
40 | head |
Jekapils |
Soldatskaya 12-3 2706-1-156 |
LEWIN, Rocha | Shmuel |
38 | wife |
Jekapils |
GLEBERMAN, Pesia | Haim |
21 | daughter |
Ludza |
|
Rezekne Dvinsk Vitebsk |
DORFMANN, Simon | Itzik |
28 | head |
Rezekne |
Ludzenskaya 45-6 2706-1-156 |
DORFMANN, Esther | Abram |
25 | wife |
Rezekne |
DORFMANN, Gita | Mowscha |
50 | mother |
Rezekne |
KAGANSKI, Hana | Mowscha |
60 | aunt |
Rezekne |
LEWIN, Malka Sura | Rachmiel |
30 | cousin |
Ludza |
Here we are using the "Family #" column as the "glue" field, to glue
all members of the household together, for a more attractive and
meaningful display of the data.
Note how the common fields (data which is common to every member
of the household/family) are "banded" together, in the yellow row-spanning
fields on the left and right. This redundant data is displayed
only once per family/household, in a vertically "stacked" fashion,
thus saving considerable display space.
The "glue" field is also needed to ensure that the entire
family group is presented together, when only one member of the family
matches the search criteria. The entire family group (i.e. all
rows with the same "glue" field) is displayed when only one member
of a family matches the search criteria.
For example, the above display would result from a search for the
surname "LEWIN" — When only one member of a family has
the surname "LEWIN", the entire family group is displayed, because the
"glue" field keeps the entire family together.
The simplest use of a "glue" field is in a marriage record —
to tie the bride and groom together. If the groom and bride are
each entered in their own row in the spreadsheet, the use of a "glue"
field will ensure that both rows are displayed when a user
searches for either one of the parties' surnames.
Also note that the "glue" field is not necessarily a displayed field.
(In the example above, the "Family #" is not displayed in the
search results screen). The "glue" field can be a hidden
column, which is not displayed in the search results — this
column is used only for the internal purpose of creating the database
indexes.
IV. Transliteration
JewishGen has established no universal transliteration standards
for data written in non-Latin alphabets (i.e. Hebrew, Cyrillic alphabets)
since each database is different, and there are so many languages,
alphabets, dialects, and regional variants across the wide scope of
Jewish genealogical data. Each database is free to use their
own transliteration methods, as long as they are reasonable.
The introductory remarks
for each database should indicate or explain which transliteration
method has been used for that database.
Here are some general ideas and guidelines:
-
Reflect the original:
The transliteration should reflect the original document, to the
degree possible. Names should not be 'standardized';
they should be entered exactly as written on the original document.
For example: 'Movsha', 'Moishe', etc., should not become 'Moshe';
and should certainly never be 'translated' or 'transformed' to 'Moses'.
-
Pronunciation should reflect local use, e.g. distinctions between
Litvak and Galitzianer pronunciations can be retained.
-
Soundex:
Since Daitch-Mokotoff Soundex searching will find most evident name
variations, we needn't worry excessively about standard transliteration
of Cyrillic-to-English, Yiddish-to-English, or Hebrew-to-English.
- Transliteration Guides:
For Yiddish (Hebrew letters), you can use the
YIVO Romanization Standard, but
the Library of Congress Standard and others are acceptable as well.
For Russian (Cyrillic letters) into English (Latin letters),
you can use the tables in the
Wikipedia
article "Romanization of Russian", especially the table
BGN/PCGN
Romanization of Russian.
-
Cyrillic:
Transliteration from Cyrillic to Latin characters should reflect the
local language, if that local language uses the Latin
alphabet. For example, civil records in the Kingdom of Poland
(Congress Poland) after 1868 were written in Cyrillic, and should be
transliterated into Polish spelling rather than English spelling
(as JRI-Poland does).
Where the local language does not use the Latin alphabet (e.g. Belarus,
Ukraine), Cyrillic should be transliterated into English phonetics.
-
If your original source data is in Russian (Cyrillic alphabet),
you may do your data entry directly in Cyrillic, if you are more
comfortable in that language, and have the appropriate keyboard.
We have Excel macros that can transliterate data in Cyrillic into
the Latin alphabet.
-
Retain the Original:
If possible, data in Latin characters should be transcribed in
the original language (i.e., leave occupations written in German in
German), rather than translated; and then provide a separate table of
translations. It is always best to keep the transcript as close
to the original as possible, without any interpretation —
and let the end-users of the database do that interpretation.
V. The “Other Surnames”,
“Other GivenNames” and “Other Towns” Columns:
As mentioned above in sections I.1.c and
I.2.c, certain datasets might want to make use
of the special hidden columns called “Other Surnames”,
“Other GivenNames” and/or “Other Towns”.
These columns are needed when there are surnames, given-names or
town names embedded within the text of other columns, and you wish
those items to be fully searchable.
-
Example #1: If you have a column entitled "Survived by" which
contains "his daughter Mollie SMITH, and his brother Robert BERNSTEIN",
and you want the surnames SMITH and BERNSTEIN to be searchable as
surnames, then you will need to copy those surnames into a
separate column, called "Other Surnames".
In this case, the "Other Surnames" column should contain
"SMITH / BERNSTEIN".
Similarly copy those given-names into a separate column,
called "Other GivenNames".
In this case, the "Other GivenNames" column should contain
"Mollie / Robert".
-
Example #2: If you have a "Comments" columns which contains
miscellaneous information, such as "Father was born in Minsk, is
currently residing in Pinsk, and working in Linsk", and you want
the town names Minsk, Pinsk and Linsk to be searchable as town names,
then you will need to copy those town names into a separate
column, called "Other Towns".
In this case, the "Other Towns" column should contain
"Minsk / Pinsk / Linsk".
The sole purpose of the hidden "Other Surnames", "Other GivenNames"
and "Other Towns" columns is for database indexing only —
so that the database search engine will know that a particular
word is a surname, given-name or town name, respectively, and thus can
locate it when doing a Soundex or Phonetic search.
These hidden columns will not be displayed in the search results.
When a surname, given-name or town name is contained within a larger
text field (such as a "Comments" field), the database search engine
has no way of knowing that that particular word is a surname, given-name
or town name.
Copying these words into an "Other Surnames" or "Other Towns" column
makes this association explicit.
While a search for "BERNSTEIN" using a global text search would find
a record with the word "BERNSTEIN" anywhere within any column,
a Soundex or Phonetic search would not.
So if a Soundex or Phonetic search is made for the surname "BURNSTINE",
it wouldn't find "BERNSTEIN" contained within a "Comments" field.
To enable its Soundex searchability, the word "BERNSTEIN" needs to
be copied into an "Other Surnames" column.
The database creator/editor should copy all surnames,
given-names and town names contained within the text of these
other fields into a separate "Other Surnames", "Other GivenNames"
or "Other Towns" column, respectively.
This action allows those words to be identifyable and fully searchable
as surnames, given-names and/or town names.
[Note that if a particular town name is already in another indexed
Town column, then you don't really need to copy it into the
"Other Towns" column — although it doesn't hurt, it's
simply redundant.
For example, if you have a "Town of Birth" column which contains
"Minsk", and you also have a "Comments" column that contains the
words "Father is a resident of Minsk", then in this instance you
don't need to copy "Minsk" into the "Other Towns" column,
because this record already contains "Minsk" in the searchable
"Town of Birth" column — a search for "Minsk" would
already find this record. However, it does no harm to
have "Minsk" in the "Other Towns" column in this instance.]
There should never be anything in the "Other Surnames",
"Other GivenNames" or "Other Towns" columns which doesn't
also appear somewhere else in the row.
The "Other Surnames", "Other GivenNames" and "Other Towns" columns
are hidden columns, which are not displayed in the
search results — these columns are only used internally,
for the purpose of creating database indexes.
VI. Other guides to data transcription:
Excel templates for some common types of records (e.g.: Czarist
vital records and revision lists, cemetery records) have been created,
so there is no need to re-invent them.
There are instructions and examples included with each template.
The "JewishGen Database Templates" can be found at
http://www.jewishgen.org/databases/templates.
Last Revised Jun 18, 2012.
|