GEDCOM Validation, in general, requires that a GEDCOM file meets the specification requirements of both the GEDCOM Grammar (line and file syntax) and the GEDCOM Dictionary (record format, data types, data formats and data values). Data consistency is not part of a GEDCOM Validation. The #Gigatrees GEDCOM Validator is only a partial validator in that it does not in all cases, validate data formats or data values. It also does not have 100% coverage of the validation tests listed below.
The GEDCOM Validator will validate that files meet the GEDCOM 5.5 grammar specification and whichever GEDCOM dictionary is appropriate. The following dictionaries are currently supported:
- GEDCOM 5.5 Rev. 1 (January 2, 1996)
- GEDCOM 5.5 Rev. 2 (January 10, 1996)
- GEDCOM 5.5.1
- GEDCOM 5.6
() parentheses = grouped components  brackets = optional components * astricks = multiple occurrences of a component - dash = range of values of a component | pipe = component or
Character ASCII value ========= =========== tab = 0x09 line feed = 0x0A carriage return = 0x0D space = 0x20 exclamation point (!) = 0x21 cross hatch (#) = 0x23 colon (:) = 0x3A ampersand (@) = 0x40 underscore (_) = 0x5F
Character set ASCII range ============= =========== number digit (0-9) = (0x30 - 0x39) alpha char (a-zA-Z_) = (0x41 - 0x5A) | (0x61 - 0x7A) | 0x5F non-alpha char = (0x21 - 0x2F) | (0x3A - 0x3F) | (0x5B - 0x5E) | (0x7B - 0x7E) | (0x80 - 0xFE) | 0x60
alphanum = (alpha char | number digit) printable character = alphanum | non-alpha char | space | cross hatch
double-at string (@) = ampersand + ampersand number string = number digit + [number digit]* alphanum string = alphanum + [alphanum]* pointer id = (alphanum | exclamation point) + [printable character]* pointer string = ampersand + pointer id + ampersand embedded id string = ampersand + [pointer id +] exclamation point + pointer_id + ampersand escape string = ampersand + cross hatch + (printable character | double-at string)* + ampersand + [space] + (printable character)* value string = printable character + [printable character]* data string = (value string | escape string) [+ (value string | escape string)]* delimiter = space terminator = carriage return | line feed | (carriage return + line feed) | (line feed + carriage return) whitespace = ([tab]* + [space]* + [terminator]*)*
Validation TestsGEDCOM Validation testing includes two types of tests, GEDCOM Grammar and the GEDCOM Dictionary.
Grammar Line Syntax
All of the supported GEDCOM Dictionaries use the same GEDCOM 5.5 Grammar, which defines a line as having the following syntax:
line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + reference_id +] terminator
line = [whitespace +] level + [delim + record_id +] delim + tag + [delim + line_value +] terminator
The following is a list of requirements of the GEDCOM 5.5 Grammar. Unsupported tests will be noted. String lengths are measured in characters, not bytes.
- The level is a number string.
- Level numbers should not contain leading zeroes.
- The minimum level number is 0.
- The maximum level number is 99.
- The maximum level number increment is 1.
- The level must be followed by a delimiter.
- A record_id can be a pointer string or an embedded id string.
- The length of a record_id is between 3 and 22 characters
- The record_id must be followed by a delimiter.
- The record_id must be unique to the file.
- The tag is a alphanum string.
- The length of the tag is between 1 and 31 characters.
- The first 15 characters of the tag must be unique.
- A reference_id is a pointer string.
- The length of a reference_id is between 3 and 22 characters
- The reference_id must be preceded by a delimiter.
- The reference_id must be followed by a terminator.
- The presence of a reference_id implies that the record_id exists in the file unless a colon is present.
- If the reference_id contains an exclamation point, the record_id must exist in an embedded record contained within the same logical record.
0 @I1@ INDI
1 @I1!O1@ OBJE 1 OBJE @I1!O1@ 1 OBJE @!O1@ (I1 is implied)
0 @I2@ INDI
1 OBJE @I1!O1@ (not allowed)
- A line_value is a data string.
- The line_value must be preceded by a delimiter.
- The line_value must be followed by a terminator.
- If an ampersand is desired as part of the line_value, it must be included as a double-at string (i.e. email@example.com).
- The maximum length of a line is 255 characters.
- The maximum length of a logical record is 32 kilobytes. Logical records are delineated by level numbers equal to 0 (zero). [NOT SUPPORTED]
0 @I1@ INDI
1 @!O1@ OBJE (I1 is implied) 1 @I1!O1@ OBJE (duplicates not allowed)
0 @I1@ INDI (duplicates not allowed)
To validate the dictionary, Gigatrees compares the structure of the logical records to the dictionary template associated with its GEDCOM version. It also validates general dictionary constructs common to all supported GEDCOM versions.
- The GEDCOM version must be either "5.5", "5.5.1" or "5.6".
- Each line should match the dictionary template unless the line has a user defined tag beginning with an underscore.
- Each record_id should be referenced from within the same file.
- If the template expects a record_id, then the line must have a record_id of the same type.
- If the template expects no record_id, then the line must not have a record_id.
- If the template expects a reference_id, then the line must have a reference_id of the same type.
- If the template expects no reference_id, then the line must not have a reference_id.
- If the template defines a minimum number of record occurrences, then the record should not have fewer.
- If the template defines a maximum number of record occurrences, then the record should not have more.
- If the template defines a minimum line_value length, then the line_value should not be shorter.
- If the template defines a maximum line_value length, then the line_value should not be longer.
Gigatrees somewhat arbitrarily, divides its validation statuses into three categories, Errors, Warnings, and Alerts. Errors are critical line failures that will more than likely prevent the line from being usable by importing applications. Warnings violate the letter of the specification, but are likely to not interfere with their usability by importing applications. Alerts are not violations and are provided for information purposes only. All warnings and alerts can be easily ignored by disabling the ShowValidationWarnings and ShowValidationAlerts options. Additional options are available for controlling how and which statuses are provided in a Validation Report.
- Unsupported GEDCOM version detected
- Level number expected
- Level number gap
- Invalid ID length
- ID missing
- Invalid ID reference length
- Tag Expected
- Data contains non-printable characters
- ID reference missing
- Unexpected ID reference
- Invalid ID reference type
- ID reference substitution
- Duplicate record found
- Referenced record not found
- Level number exceeds limit
- Level has leading zero
- ID delimiter missing
- Invalid ID length
- Invalid ID character
- Invalid ID reference length
- Invalid ID reference character
- Invalid tag length
- Invalid tag character
- Too few occurrences of tag
- Too many occurrences of tag
- Data contains tabs
- Maximum line length exceeded
- Data missing
- Insufficient data
- Maximum data length exceeded
- Data not expected
- Trailing spaces not expected
- Trailing data not expected
- Unpaired ampersand (@)
- Undefined record found
- Record not referenced
- User defined record found