DIDO — Design Specification
DIDO Design Specification
Version 1.01
|
Warren Blatt
April 2005
Last Update: December 2011
|
JewishGen needs a way to accomodate small datasets and miscellaneous
lists of names in various formats, make them soundex-searchable, and
integrate them into geographic-based “All Country” databases.
Many years ago (circa 1997), we envisioned a database system
for managing small sets of data, which we nicknamed “DIDO”
(“Data In, Data Out”).
This system would allow for small user-contributed datasets to be
placed online easily, and become searchable in a central database.
Any user can contribute a "dataset" of any size or type —
even just a handful of records — as long as it conforms to
the DIDO data specifications.
In some ways, DIDO is similar to Ed Rosenbaum's
"Belarus Static Index",
"Belarus Names Database"
and
"Galicia Surname Index"
— except that DIDO data will be in an actual database,
and thus be phonetically and soundex-searchable, sortable, filterable,
etc., and with the data integrated into the JewishGen
“All Country”
databases.
DIDO is intended for simple lists of data — it is
not for extensive lists with many data fields per record.
Examples of simple lists are: membership lists of landsmanshaftn
and other organizations, voter/electors' lists, prenumerantn or
other donor lists, name indexes to books, etc. —
data which contains only a person's name plus perhaps one or two
other data fields. There are many such lists throughout
the JewishGen web site, in static form — on the
Yizkor Book Project site,
JewishGen KehilaLinks
pages (which the KehilaLinks coordinator is currently inventorying),
on various SIG pages, etc.
[Volunteers for the JewishGen Yizkor Book Project
had identified about 40,000 names on various lists on the website
(see internal list)
which could be included in such an index.
In July 2010, this data became the
Yizkor Book Master Name Index
(YBMNI), which as of July 2011 includes 57,000 names].
DIDO is to be used only for data which does not
fall into one of our pre-existing categories of
“All Topic”
databases, i.e.:
Cemetery/burial data (JOWBR),
Yizkor Book Necrologies,
Russian Business Directories,
Polish vital records (JRI-Poland),
Duma Voter Lists, Czarist Revision Lists, Czarist Vital
Records (see templates),
etc.
DIDO is for miscellaneous data only.
Structure of the DIDO Database
Internally, the DIDO database consists of two related tables:
- Dataset Table —
describes the datasets.
One row for each dataset.
- Data Table —
the actual individual data.
One row for each person.
This data structure organization is conceptually similar
to that of the JOWBR database,
as described at
original JOWBR Design Spec, where there's one table for the
cemeteries, and one table for the individual burials within the
cemeteries.
Dataset Table:
The Dataset Table contains
one row for each dataset. Its columns are:
Dataset ID# |
An ID# (arbitrary, never displayed),
which is used only to tie the two tables together. |
Dataset Title |
A short description of the dataset, to be used for display.
Should be limited to 50 or so characters. |
Dataset Description |
A long description of the dataset.
Should contain a complete and thorough description of
this dataset, its source, and any interpretation needed. |
URL |
Web address of a page with additional information
about this dataset. Optional? Or maybe
this should be part of the “Dataset Description” field.
Or maybe the entire “Dataset Description” should
be an external HTML page...
TBD. |
Region |
Used to determine which (if any) of the “All Country”
databases this dataset should be incorporated, and into which
sub-region(s).
See regions as defined in
http://www.jewishgen.org/databases/Cemetery/JOWBR_Regions.htm.
Issue: Should this be specified as a
Region Name, a set of Region Names, and/or a set of
“All Country” database names? |
Contributor |
Contact information about the person who submitted this database.
Use the submitter's JGID Number, to link to their full contact
information in CURE. |
Data Table:
The Data Table contains the actual data. Because
of the wide variety of data which the DIDO database can accept
(membership lists, book indexes, etc.), we are keeping the number
of columns to a minimum, and having the columns be extremely generic,
so that all of the varied data can fit. Any data that does
not fit should be placed in the last column, "Other".
The Data Table contains
one row for each individual. Its columns are:
Dataset ID# |
Used to identify the dataset
A link to the information in the
Dataset Table.
It will be the same number for all rows in the dataset.
Non-displayed; used only to create linkage to the Dataset Table info. |
Name |
Surname |
Last Name of the individual. |
Given Name(s) |
First Name(s) of the individual. |
Patronymic |
??? — Could perhaps be placed in the GivenName field. |
Location |
Town |
The name of the locality associated with this record,
as indicated in the original record.
If there are multiple towns, separate each with a slash ["/"].
(as per Transcription
Rule I.2.d).
|
District | ??? —
The town's state / province / uyezd / gubernia —
optional, as provided in the original record. |
Country |
Country where the town was located, as of the time of the record
— optional, as provided in the original record. |
USBGN |
??? The town's USBGN Feature Code Number for linkage to the
"JewishGen Locality Page".
Non-displayed. |
Date |
Date of the record.
Can be a complete date, or just a year —
whatever is in the original record.
Can be blank if unknown.
The "Dataset Description" field of the corresponding row in
the Dataset Table should state what this date represents,
e.g.: a date of birth, date of voter registration, date of
membership, date of immigration, date of publication, etc.
|
Other |
A large text field to contain all of the other
data which doesn't fit into any other column.
Maximum of 254 characters.
The contents should be described in the "Dataset Description"
field of the corresponding row in the Dataset Table.
It could be a page number, an age, an address,
amount of a donation, an occupation, etc., etc.
|
-
The "Dataset ID#" field will be filled in by the DIDO Database
Coordinator, not by the original contributor.
-
Any of the "Location" fields could be blank, or may often all
contain the same data for every row in the table, for some datasets.
Open Issues / Questions:
-
USBGN Feature Codes:
Perhaps the "USBGN" field in the Data Table should be filled in
by the DIDO Database Coordinator, rather than by the dataset's
contributor. This will be less onerous for the contributor.
Or maybe we allow only one USBGN Feature Code for each dataset,
and associate it with the Dataset, rather than with each
individual item in the Data Table... TBD.
-
"Other Surnames" and "Other Towns":
Do we want to have an "Other Surnames" and/or "Other Towns"
column in the Data Table, to include items mentioned within the
"Other" column?
(See Transcription
Rules for JewishGen Databases, Section V).
This can complicate things for the average submitter.
We want to keep things as simple as possible in DIDO.
Perhaps these could be optional 'advanced' columns... or
we could just forego them, and assume that there is only one
surname and one town for each item in a DIDO record.
If multiple surnames/towns do exist, they could be entered into
the existing fields, separated with slashes, as specified in
Section I.1.d
of the "Transcription Rules for JewishGen Databases".
-
Regions:
How should the associated Region(s) be specified?
And how to specify which “All Country” Database(s) the
dataset is placed?
Should the "Region" column in the Dataset Table be a single
Region Name, a set of Region names, and/or a set of “All Country”
database names?
Or should there be a set of boolean “All Country” and
“All Topic” database flags?
This needs to be coordinated with our existing
"REGIONS"
SQL database.
What about inclusion in
“All Topic”
databases? As of now, the
“JewishGen Holocaust Database”
is the only applicable “All Topic” database...
but a "Sephardic Database" is potentially in the works.
Can this also somehow be specified within the "Regions" field...
or should we use separate boolean indicators in the Dataset Table?
Search Results Display
The search results data display will contain the following columns:
Name (Surname, Given Name(s), Patronymic) | "WHO" |
Location (Town, District, Country) | "WHERE" |
Date | "WHEN" |
Other (Comments) | "HOW" |
Source (ID# → Dataset Title) | "WHAT" |
-
The "Source" column's data will display the dataset's
"Dataset Title", and be a hyperlink to the "Dataset
Description", the full description of the Dataset.
-
The "Location" column's data will be hyperlinked to that location's
"JewishGen Locality Page",
and have the Communities Database's Ajax mouse-over feature
,
if the "USBGN" field in the Data Table is filled in.
We would need a second-level display page —
similar to JOWBR's "Cemetery Information" page (for example,
Vienna's "Wiener Zentralfriedhof") —
to display full information about the Dataset.
Integration into “All Country” and
“All Topic” Databases
Component datasets of DIDO can then be incorporated into the
various “All Country” and “All Topic”
systems, as appropriate, as controlled by the "Region" field
of the Dataset Table.
(See Open Issue #3, above).
Display of entire datasets
The facility should also include a programmatic mechanism
to display an entire dataset, i.e. a search based
on the "Dataset ID#" — so that a KehilaLinks page
could link to a list, displayed much like one in its current
static form. This allows datasets to be "browsable"
by the user, just like the static datasets are today.
In order to deter data-mining, the "Dataset ID#" should
probably be a random set of alphanumeric characters, rather
than a sequential integer.
Perhaps this should be an optional feature, on a
case-by-case basis per dataset, as determined by the dataset's
contributor and/or JewishGen.
This feature would require an additional boolean field in
the Dataset Table.
Procedures
The method for submitting a dataset to DIDO should be relatively
straight-forward, and not as onerous as that for a full-blown
JewishGen database to encourage people to submit data.
This will require a DIDO Database Coordinator and a
DIDO Admin Panel.
-
A DIDO Database Coordinator will supervise the
entire operation, ongoing: for correspondence, quality control,
database maintenance, etc.
-
A DIDO Admin Panel interface should be developed, for
the DIDO Database Coordinator to manage the DIDO Dataset Table,
similar to the
JOWBR
Admin Panel.
The DIDO Database Coordinator should be able to add / remove / replace
datasets in the live search engine with little or no intervention by the
JewishGen staff.
The procedures for data preparation and submission for DIDO
could be modeled after the "Database Factory" concept that we
discussed and began prototyping in 2002-2003.
There should be a downloadable data template with instructions
for submitters, similar to the way we've done the
JOWBR Template
and the other templates currently available at
http://www.jewishgen.org/databases/templates.
The DIDO Database Coordinator would receive the completed
templates, do some minimal QA, and add them to the DIDO Admin Panel.
The Coordinator should be able to put the data into "test" mode, and to
make the data "live" — all without any intervention by the
JewishGen staff or highly-techincal volunteers.
|