Gigatrees Blog

The GEDCOM Validator

The GEDCOM Validator

GEDCOM Validation, in general, requires that a GEDCOM file meets the specification requirements of both the GEDCOM Grammar (line and file syntax) and the GEDCOM Dictionary (record format, data types, data formats and data values). Data consistency is not part of a GEDCOM Validation. The GEDCOM Validator is only a partial validator in that it does not in all cases, validate data formats or data values. It also does not have 100% coverage of the validation tests listed below.

The GEDCOM Validator will validate that files meet the GEDCOM 5.5 grammar specification and whichever GEDCOM dictionary is appropriate. The following dictionaries are currently supported:

  • GEDCOM 5.5 Rev. 1 (January 2, 1996)
  • GEDCOM 5.5 Rev. 2 (January 10, 1996)
  • GEDCOM 5.5.1
  • GEDCOM 5.6
The following sections provide technical details on how Gigatrees performs GEDCOM Validation.

Validation Legend

 

Symbols

() parentheses  = grouped components
[] brackets = optional components
* astricks = multiple occurrences of a component
- dash = range of values of a component
| pipe = component or

Characters

Character               ASCII value
========= ===========
tab = 0x09
line feed = 0x0A
carriage return = 0x0D
space = 0x20
exclamation point (!) = 0x21
cross hatch (#) = 0x23
colon (:) = 0x3A
ampersand (@) = 0x40
underscore (_) = 0x5F

Character Sets

Character set           ASCII range
============= ===========
number digit (0-9) = (0x30 - 0x39)
alpha char (a-zA-Z_) = (0x41 - 0x5A) | (0x61 - 0x7A) | 0x5F
non-alpha char = (0x21 - 0x2F) | (0x3A - 0x3F) | (0x5B - 0x5E) | (0x7B - 0x7E) | (0x80 - 0xFE) | 0x60

Character Groups

alphanum                = (alpha char | number digit)
printable character = alphanum | non-alpha char | space | cross hatch

Strings

double-at string (@)   = ampersand + ampersand
number string = number digit + [number digit]*
alphanum string = alphanum + [alphanum]*
pointer id = (alphanum | exclamation point) + [printable character]*
pointer string = ampersand + pointer id + ampersand
embedded id string = ampersand + [pointer id +] exclamation point + pointer_id + ampersand
escape string = ampersand + cross hatch + (printable character | double-at string)* + ampersand + [space] + (printable character)*
value string = printable character + [printable character]*
data string = (value string | escape string) [+ (value string | escape string)]*
delimiter = space
terminator = carriage return | line feed | (carriage return + line feed) | (line feed + carriage return)
whitespace = ([tab]* + [space]* + [terminator]*)*

Validation Tests

GEDCOM Validation testing includes two types of tests, GEDCOM Grammar and the GEDCOM Dictionary.

Grammar Line Syntax

All of the supported GEDCOM Dictionaries use the same GEDCOM 5.5 Grammar, which defines a line as having the following syntax:

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + reference_id +] terminator

or

line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + line_value +] terminator

Grammar Tests

The following is a list of requirements of the GEDCOM 5.5 Grammar. Unsupported tests will be noted. String lengths are measured in characters, not bytes.

  1. The level is a number string.
  2. Level numbers should not contain leading zeroes.
  3. The minimum level number is 0.
  4. The maximum level number is 99.
  5. The maximum level number increment is 1.
  6. The level must be followed by a delimiter.

  7. A record_id can be a pointer string or an embedded id string.
  8. The length of a record_id is between 3 and 22 characters
  9. The record_id must be followed by a delimiter.
  10. The record_id must be unique to the file.

  11. for example:

    0 @I1@ INDI
    1 @!O1@ OBJE (I1 is implied)
    1 @I1!O1@ OBJE (duplicates not allowed)

    0 @I1@ INDI (duplicates not allowed)

  12. The tag is a alphanum string.
  13. The length of the tag is between 1 and 31 characters.
  14. The first 15 characters of the tag must be unique.

  15. A reference_id is a pointer string.
  16. The length of a reference_id is between 3 and 22 characters
  17. The reference_id must be preceded by a delimiter.
  18. The reference_id must be followed by a terminator.
  19. The presence of a reference_id implies that the record_id exists in the file unless a colon is present.
  20. If the reference_id contains an exclamation point, the record_id must exist in an embedded record contained within the same logical record.

    for example:

    0 @I1@ INDI
    1 @I1!O1@ OBJE
    1 OBJE @I1!O1@
    1 OBJE @!O1@ (I1 is implied)

    0 @I2@ INDI
    1 OBJE @I1!O1@     (not allowed)

  21. A line_value is a data string.
  22. The line_value must be preceded by a delimiter.
  23. The line_value must be followed by a terminator.
  24. If an ampersand is desired as part of the line_value, it must be included as a double-at string (i.e. name@school.edu).

  25. The maximum length of a line is 255 characters.
  26. The maximum length of a logical record is 32 kilobytes. Logical records are delineated by level numbers equal to 0 (zero). [NOT SUPPORTED]

Dictionary Tests

To validate the dictionary, Gigatrees compares the structure of the logical records to the dictionary template associated with its GEDCOM version. It also validates general dictionary constructs common to all supported GEDCOM versions.

  1. The GEDCOM version must be either "5.5", "5.5.1" or "5.6".
  2. Each line should match the dictionary template unless the line has a user defined tag beginning with an underscore.
  3. Each record_id should be referenced from within the same file.
  4. If the template expects a record_id, then the line must have a record_id of the same type.
  5. If the template expects no record_id, then the line must not have a record_id.
  6. If the template expects a reference_id, then the line must have a reference_id of the same type.
  7. If the template expects no reference_id, then the line must not have a reference_id.
  8. If the template defines a minimum number of record occurrences, then the record should not have fewer.
  9. If the template defines a maximum number of record occurrences, then the record should not have more.
  10. If the template defines a minimum line_value length, then the line_value should not be shorter.
  11. If the template defines a maximum line_value length, then the line_value should not be longer.

Validation Statuses

Gigatrees somewhat arbitrarily, divides its validation statuses into three categories, Errors, Warnings, and Alerts. Errors are critical line failures that will more than likely prevent the line from being usable by importing applications. Warnings violate the letter of the specification, but are likely to not interfere with their usability by importing applications. Alerts are not violations and are provided for information purposes only. All warnings and alerts can be easily ignored by disabling the ShowValidationWarnings and ShowValidationAlerts options. Additional options are available for controlling how and which statuses are provided in a Validation Report.

Errors

  • Unsupported GEDCOM version detected
  • Level number expected
  • Level number gap
  • Invalid ID length
  • ID missing
  • Invalid ID reference length
  • Tag Expected
  • Data contains non-printable characters
  • ID reference missing
  • Unexpected ID reference
  • Invalid ID reference type
  • ID reference substitution
  • Duplicate record found
  • Referenced record not found

Warnings

  • Level number exceeds limit
  • Level has leading zero
  • ID delimiter missing
  • Invalid ID length
  • Invalid ID character
  • Invalid ID reference length
  • Invalid ID reference character
  • Invalid tag length
  • Invalid tag character
  • Too few occurrences of tag
  • Too many occurrences of tag
  • Data contains tabs
  • Maximum line length exceeded
  • Data missing
  • Insufficient data
  • Maximum data length exceeded
  • Data not expected
  • Trailing spaces not expected
  • Trailing data not expected
  • Unpaired ampersand (@)
  • Undefined record found
  • Record not referenced

Alerts

  • User defined record found
Built with Innuendo (0.1.4)