[QUIZ] GEDCOM Parser (#6)

Ruby Quiz · Nov 5, 2004

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.grayproductions.net/ruby_quiz/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Jamis Buck

GEDCOM Parser

GEDCOM is the "GEnealogical Data COMmunication" file format. It is a plain-text
electronic format used to transfer genealogical data. (The purpose of this quiz
is not to debate whether it is a particularly *good* file format or not--but it
is certainly more compact than the corresponding XML would be, and bandwidth was
particularly important back when the standard was developed.)

The purpose of this quiz is to develop a simple parser than can convert a GEDCOM
file to XML.

GEDCOM Format

The GEDCOM file format is very straightforward. Each line represents a node in a
tree. It looks something like this:

0 @I1@ INDI
1 NAME Jamis Gordon /Buck/
2 SURN Buck
2 GIVN Jamis Gordon
1 SEX M
...

In general, each line is formatted thus:

LEVEL TAG-OR-ID [DATA]

The LEVEL is an integer, representing the current depth in the tree. If
subsequent lines have greater levels than the current node, they are children of
the current node.

TAG-OR-ID is either a tag that identifies the type of data in that node, or it
is a unique identifier. Tags are 3- or 4-letter words in uppercase. The unique
identifiers are always text surrounded by "@" characters (i.e., "@I54@"). If an
ID is given, the DATA is the type of the subtree that is identified.

So, to take the example given above apart:

1) "0 @I1@ INDI". This starts a new subtree of type INDI (individual). The id
for this individual is "@I1@".

2) "1 NAME Jamis Gordon /Buck/". This starts a NAME subtree with a value of
"Jamis Gordon /Buck/".

3) "2 SURN Buck". This is a subelement of the NAME subtree, of type SURN
("surname").

4) "2 GIVN Jamis Gordon". As SURN, but specifies the given name of the
individual.

5) "1 SEX M". Creates a new subelement of the INDI element, of type "SEX" (i.e.,
"gender").

And so forth.

Variable whitespace is allowed between the level and the tag. Blank lines are
ignored.

The Challenge

The challenge, then, is to create a parser that takes a GEDCOM file as input and
converts it to XML. The snippet of GEDCOM given above would become:

<gedcom>
<indi id="@I1@">
<name value="Jamis Gordon /Buck/">
<surn>Buck</surn>
<givn>Jamis Gordon</givn>
</name>
<sex>M</sex>
...
</indi>
...
</gedcom>

Sample Input

There is a large GEDCOM file online containing the lineage of various European
royalty. You may download it from
http://search.cpan.org/src/PJCJ/Gedcom-1.11/royal.ged (yah, it's a CPAN link,
but it had the highest bandwidth of any other URL I found via Google). (This
particular link makes generous use of whitespace to increase the readability of
the file.)

Jim Menard · Nov 5, 2004

said:
<indi id="@I1@">
<name value="Jamis Gordon /Buck/">
<surn>Buck</surn>
<givn>Jamis Gordon</givn>
</name>
<sex>M</sex>
...
</indi>
...
</gedcom>

What determines whether the value should be an attribute or a sub-element?
It's not level because, for example, sex and name are at the same level but
name's value is an attribute where sex's is a text sub-element.

Is the rule that the value should be an attribute only if the element has
children?

Jim

James Edward Gray II · Nov 5, 2004

What determines whether the value should be an attribute or a
sub-element? It's not level because, for example, sex and name are at
the same level but name's value is an attribute where sex's is a text
sub-element.

Is the rule that the value should be an attribute only if the element
has children?

That's how I took it, but maybe we can drag Jamis out to clarify this
for us...

James Edward Gray II

James Britt · Nov 5, 2004

Jamis said:
Yup, that was the rule. If a node has children, it's value should be an
attribute. Otherwise, it's value should be a sub-element.

Why?

In this example

<name value="Jamis Gordon /Buck/">
<surn>Buck</surn>
<givn>Jamis Gordon</givn>
</name>

it appears that the value attribute can be derived from the immediate
children. Or vice versa. Why the duplication?

And why the embedded pseudo-markup in the value attribute ( e.g., the
use of '/')?

value="Jamis Gordon /Buck/"

Is there a spec for this XML format, or is this deliberately tricky to
make it more challenging?

James

Dave Burt · Nov 5, 2004

Jim Menard said:
What determines whether the value should be an attribute or a sub-element?
It's not level because, for example, sex and name are at the same level
but name's value is an attribute where sex's is a text sub-element.

Is the rule that the value should be an attribute only if the element has
children?

Jim

There's a GEDCOM spec at:
http://homepages.rootsweb.com/~pmcbride/gedcom/55gctoc.htm

It's a bit of a mess, somewhat contradictory...

Anyway, XML.

For the above, I think I prefer the following to Jamis' example XML:
<name>Jamis Gordon /Buck/
<surn>Buck</surn>
....

Interestingly, the sample GED file given doesn't have any names like this,
nor even any SURN elements.

"Those using the optional name pieces should assume that few systems will
process them, and most will not provide the name pieces. "
http://homepages.rootsweb.com/~pmcbride/gedcom/55gcch2.htm#PERSONAL_NAME_STRUCTURE

It does have:
1 NOTE Line 1
2 CONT Line 2
2 CONT Lin
2 CONC e 3
2 CONT Line
2 CONC 4

and:
1 SOUR @S1@
2 PAGE 1
....
0 @S1@ SOUR
1 TEXT Hello

I think these are the two interesting cases converting to XML.
The CONTinuation tag represents just a continuation of the data in the
parent element, as lines are of limited length; it has no semantic value. I
think these tags need to be understood by a GEDCOM->XML parser.

I'm thinking the above should probably turn out something like:
<note>Line 1
Line 2
Line 3
Line 4
</note>

The second of the two fragments shows a tag (SOURce) with a value (the link
to xref-id @S1@) as well as a sub-tree. Same thing as Jamis' NAME example,
also common elsewhere in the spec. The use of the id attribute for ids is
obvious, but I'm not sure the value attribute is ideal, especially
considering that the spec states that source description (the value of the
SOURce tag) may be continued with CONT or CONC, thus may be multi-line.

Thus:
<sour>@S1@
<page>1</page>
</sour>
....
<sour id="@S1@">
<text>Hello</text>
</sour>

Jim Menard · Nov 5, 2004

1) There are some records near the end of the data like this:

0 @F46@ F100 24709 100 24709 0 0 18316 0 0:00:01 0:00:01
0:00:00 21636100 24709 100 24709 0 0 18316 0 0:00:01 0:00:01
0:00:00 21636

It has an ID, but the data value is definitely NOT a legal XML tag name. What
does the Ruby Quiz spec have to say about that?

2) Looking at the royal.ged example, it seems that--except for the special
INDI case--if a node has children then it does not have a value.

0 @I1@ INDI
1 NAME Edward_VII /Wettin/
1 TITL King of England
1 SEX M
1 BIRT
2 DATE Tuesday, 9th November 1841
2 PLAC Buckingham,Palace,London,England
1 DEAT
2 DATE Friday, 6th May 1910
2 PLAC Buckingham,Palace,London,England
1 BURI
2 DATE Friday, 20th May 1910
2 PLAC Windsor,Berkshire,England
1 FAMS @F2@
1 FAMC @F1@
1 RIN 2

Can anybody else confirm this? I'd like to propose that this become

<indi id="@I1@">
<titl>King of England</titl>
<sex>M</sex>
<birt>
<date>Tuesday, 9th November 1841</date>
<plac>Buckingham,Palace,London,England</plac>
</birt>

</indi>

3) A warning: the first element doesn't have to be level zero.

Jim

James Britt · Nov 5, 2004

Jamis said:
Yup, it can. The duplication is because that's the way that the GEDCOM
spec was written. The NAME element includes the fullname as a value,
with the parts of the name specified as subelements. I imagine the spec
was done this way to make it easy to get the fullname without having to
search subtrees and concatenate values.

Again, that's the way GEDCOM does it. I imagine it is to make it easier
to identify the surname in situations where either (a) the surname is
not given (ie, "Jamis Gordon //"), or (b) the surname consists of
multiple words (ie, "Dick /Van Dyke/").

Ah. Still, it comes off as the sort of XML people invent when they want
to show why XML is hard to work with

.

(As an aside, I think it's SVG or XUL that has attribute values that
consist of long strings of name=value pairs. Ick. It's like taking a
CSV file, wrapping it in start/end tags, and calling it XML.)

Ideally, attribute values should be semantically atomic.

No, and yes.

Honestly, if you don't like the way I've specified the values, feel free
to invent your own. I won't be hurt. I was just trying to find a way to
represent the GEDCOM-formatted data in XML, and keep the data as close
to the original as possible.

I would suggest just normalizing the format, such that there is no
mixed-markup or data duplication. value='some /thing/' jut looks wrong
to me; leave out that attribute and using named elements for each chunk:

<name>
<surn>Buck</surn>
<givn>Jamis Gordon</givn>
</name>

Values then are always child elements, or the text content of the
element if the value does not require any additional semantic demarcation.

FWIW, I believe there IS a standard for representing genealogical data
in XML. But I can guarantee it will not be a one-to-one mapping between
GEDCOM to that format... I figured this would be more fun than poring
over volumes of GEDCOM and XML specifications to "get it right".

Quite true.

Just have fun with it. That's the important thing.

Very much so!

Thanks,

James

Jim Menard · Nov 5, 2004

Jamis said:
I can only assume this got garbled in transmission, because I can't make
heads or tails out of it. I certainly can't find anything in the
royal.ged file that looks like that. Can you please clarify?

Around 16 lines from the bottom of the file, I see that line. There is one
embedded ^M, telling me that the line endings have been screwed up somewhere.
(I used curl to retrieve the file to a Windows box. Emacs says the line
endings are Unix, but every line has a ^M at the end.)

Jim

James Edward Gray II · Nov 5, 2004

Around 16 lines from the bottom of the file, I see that line.

Here's what the actual end of the file looks like:

0 @F45@ FAM
1 HUSB @I72@
1 WIFE @I73@
1 MARR
2 DATE AFT 1989
1 RIN 137

0 @F46@ FAM
1 WIFE @I46@
1 RIN 138

0 @F47@ FAM
1 CHIL @I31@
1 RIN 139
1 SOUR @S1@
2 PAGE 1

0 @S1@ SOUR
1 TEXT Hello

0 TRLR

__END__

Hope that helps.

James

Jim Menard · Nov 5, 2004

Jamis said:
I can only imagine that something went terribly wrong when you
downloaded the file.

I figured it out: it's from curl. To download the file, I did

curl http://whatever >royal.ged

Curl's output got mixed in with the file. (There was a curl line at the top
that I had stripped out manually; that should have clued me in.) When I use
the -o flag instead, like this:

curl -o royal.ged http://whatever

then I don't see any cruft in the file and the line endings are fine.

Jim

Florian Gross · Nov 5, 2004

Is it valid for CONT/CONC tags to use pointers like in this sample?

0 INDI
1 NAME Jamis Gordon /
2 CONC @SURN@
2 CONC /
2 @SURN@ SURN Buck
2 GIVN Jamis Gordon
1 SEX M

James Britt · Nov 6, 2004

Jamis said:
Very true. As long as what your program emits is sensible and accurately
represents the data that was input, it is acceptable. In fact, Florian
Gross just mentioned that his will even output YAML.

I was thinking that, were I to try this, I would likely first read the
data and create an internal object format, them serialize that object as
XML. And, given an object, one could have alternate serialization
formats, including GEDCOM, YAML, CSV, ANS.1, and so on.

An SVG rendering might be nice.

James

Dave Burt · Nov 6, 2004

Hans Fugal said:
Yes, and fun! http://www.fugal.net/fh/hans_pedigree.svg

Hello:
<pedigree:individual id="pedigree_AN1"
xmlns

edigree="http://rivit.cs.byu.edu/svgpedigree">

Is an XSD or other doctype spec available?

Dave Burt · Nov 7, 2004

Hans Fugal said:
No, I'm afraid not. I don't even have the code that generates this
anymore; I wrote it when working as a research assistant in that lab and
didn't take it with me, and I think they backed out and went another
direction (postscript or pdf, I think).

Is this something you would find useful? Jamis? Others?

My solution so far is not going to be useful (it doesn't grok, it just
rewrites bits), but I'm sure a lot of useful thought is going into this for
the quiz.

Quiz -> Rubyforge? I suppose we have to wait until some answers are
submitted first

Cheers,
Dave

Florian Gross · Nov 7, 2004

Here's my solution. It builds a tree of the Gedcom nodes.

It supports a broad subset of the Gedcom specification, can output XML,
YAML and pretty-print, has error checks and is reasonable short.

Note that the YAML representation will not reuse the IDs that were
specified in the original Gedcom file, but rather create its own. I
don't know if there is an easy way of making YAML use pre-specified IDs.

The XML representation uses <ref to="@ID@" /> for representing links.

The YAML and pp emitters blow up the stack when given the CPAN sample
data. There's not too much I can do about this.

The XML emitter tries hard to make the output as pretty as possible.
This includes trying to use value="" when appropriate. (It won't get
used when the value contains multi-line data.)

Data is read from ARGF which means either standard input or filenames
that where given on the command line.

I've also attached sample output for the file given on
http://heiner-eichmann.de/gedcom/simple.ged

module Gedcom
class ParseError < ArgumentError; end

class Node < Hash
attr_accessor :value,

rigin, :special_type, :id
def special?() not @special_type.nil? end

def initialize(origin = nil)
@value, @origin = nil, origin
@as_plain_hash_cache = Hash.new

super() do |hash, key|
hash[key] = Array.new
end
end

def hash
[@value.is_a?(Node) ? :recursive : @value, super].hash
end

def ==(other) self.hash == other.hash end

def replace(other)
super(other)
@value, origin = other.value, other.origin
end

# YAML detects self-referencing structures by comparing object_ids.
# as_plain_hash() needs to cache the Hash it creates to make that
# check work.
def as_plain_hash
if @as_plain_hash_cache.include?(self.hash)
@as_plain_hash_cache[self.hash]
else
result = {}.merge(self)

result.each do |key, values|
if values.size == 1 then
result[key] = values.first
end
end

if not @value.nil? then
result[:value] = @value
end

@as_plain_hash_cache[self.hash] = result
end
end
private :as_plain_hash

def as_value
if @value.is_a?(String) and empty? then
@value
else
as_plain_hash
end
end

def to_yaml_type() "!map" end

def to_yaml(opts = {}) as_value.to_yaml(opts) end
def inspect() as_value.inspect end
def pretty_print(q) as_value.pretty_print(q) end

def to_xml(level = 0)
require 'cgi'
indent = " " * (level + 1)

result = if @value.is_a?(Node) then
"#{indent}<ref to=\"#{@value.id}\" />"
else
self.map do |tag, nodes|
nodes.map do |node|
escaped_value = if node.value.is_a?(String) then
CGI.escapeHTML(node.value.to_s)
end
id_attr = node.id.nil? ? "" : " id=\"#{node.id}\""
xml_tag = tag.downcase

if node.value.nil? and node.empty? then
"#{indent}<#{xml_tag}#{id_attr} />"
elsif node.empty? and escaped_value then
"#{indent}<#{xml_tag}#{id_attr}>" + escaped_value + "</#{xml_tag}>"
else
if node.value.is_a?(String) and node.value["\n"] then
"#{indent}<#{xml_tag}#{id_attr}>\n" +
"#{indent} #{node.value}\n" +
node.to_xml(level + 1) + "\n" +
"#{indent}</#{xml_tag}>"
else
val_attr = node.value.is_a?(String) ? " value=\"#{escaped_value}\"" : ""
"#{indent}<#{xml_tag}#{id_attr}#{val_attr}>\n" +
node.to_xml(level + 1) + "\n" +
"#{indent}</#{xml_tag}>"
end
end
end.join("\n")
end.join("\n")
end

if level == 0 then
result = "<gedcom>\n#{result}\n</gedcom>"
end

return result
end
end

LineRegexp = /^\s*(\d+)\s+(?

@\w[^@]*@)\s+)?(\w+)(?:\s+(?

@\w[^@]*@)|(.+)))?\s*$/

def parse(data)
nodes = Node.new(1)
stack = [nodes]
node_by_id = Hash.new
nodes_with_refs = Array.new

data.each_with_index do |line, index|
line_no = index + 1

if md = LineRegexp.match(line) then
level, id, tag, value_id, value = *md.captures
level = level.to_i
value.gsub!("@@", "@") if value

if level > stack.size - 1 then
raise(ParseError, "Inconsistent nesting at line #{line_no}")
elsif level != stack.size - 1 then
(stack.size - level - 1).times { stack.pop }
end

if stack.last.special? then
raise(ParseError, "Can't create sub node for special node " +
"of type #{stack.last.special_type} " +
"(defined at #{stack.last.origin}) at #{line_no}")
end

new_node = Node.new(line_no)

if id and not id.empty? then
node_by_id[id] = new_node
new_node.id = id
end

if value and not value.empty? then
new_node.value = value
elsif value_id and not value_id.empty? then
nodes_with_refs << new_node
# id is temporarily stored in value
new_node.value = value_id
end

case tag
when "CONC", "CONT" then
new_node.special_type = tag

if id and not id.empty? then
raise(ParseError, "#{tag} node can't have id at line #{line_no}")
end

str_value = (value and not value.empty?) ? value : value_id
separator = case tag
when "CONC" then ""
when "CONT" then "\n"
end
stack.last.value = stack.last.value.to_s + separator + str_value.to_s
end

unless new_node.special?
stack.last[tag] << new_node
end
stack << new_node
elsif line.strip.empty? then
# Ignore, line contains whitespace only
else
raise(ParseError, "Parse error at line #{line_no}")
end
end

nodes_with_refs.each do |node|
id = node.value
if node_by_id.include?(id) then
node.value = node_by_id[id]
else
raise(ParseError, "Pointer to undefined node `#{id}' at line #{node.origin}")
end
end

return nodes
end
module_function

arse
end

if __FILE__ == $0 then
data = ARGF.read

require 'pp'
puts "Pretty-printed:"
begin
pp Gedcom.parse(data)
rescue SystemStackError
puts "Sorry, pp blowed up the stack."
end

require 'yaml'
puts "", "As YAML:"
begin
y Gedcom.parse(data)
rescue SystemStackError
puts "Sorry, YAML blowed up the stack."
end

puts "", "As XML:"
puts Gedcom.parse(data).to_xml
end

Florian Gross · Nov 7, 2004

Florian said:
I've also attached sample output for the file given on
http://heiner-eichmann.de/gedcom/simple.ged

....to this follow-up posting.

Pretty-printed:
{"HEAD"=>
{"CHAR"=>"ASCII",
"SUBM"=>
{:value=>
{"NAME"=>"/Submitter/",
"ADDR"=>"Submitters address\naddress continued here"}},
"SOUR"=>"ID_OF_CREATING_FILE",
"GEDC"=>{"FORM"=>"Lineage-Linked", "VERS"=>"5.5"}},
"SUBM"=>
{"NAME"=>"/Submitter/",
"ADDR"=>"Submitters address\naddress continued here"},
"TRLR"=>{},
"FAM"=>
{"HUSB"=>
{:value=>
{"NAME"=>"/Father/",
"FAMS"=>{:value=>{...}},
"SEX"=>"M",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}},
"CHIL"=>
{:value=>
{"NAME"=>"/Child/",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"31 JUL 1950"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"29 FEB 2000"},
"FAMC"=>{:value=>{...}}}},
"MARR"=>{"PLAC"=>"marriage place", "DATE"=>"1 APR 1950"},
"WIFE"=>
{:value=>
{"NAME"=>"/Mother/",
"FAMS"=>{:value=>{...}},
"SEX"=>"F",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}}},
"INDI"=>
[{"NAME"=>"/Father/",
"FAMS"=>
{:value=>
{"HUSB"=>{:value=>{...}},
"CHIL"=>
{:value=>
{"NAME"=>"/Child/",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"31 JUL 1950"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"29 FEB 2000"},
"FAMC"=>{:value=>{...}}}},
"MARR"=>{"PLAC"=>"marriage place", "DATE"=>"1 APR 1950"},
"WIFE"=>
{:value=>
{"NAME"=>"/Mother/",
"FAMS"=>{:value=>{...}},
"SEX"=>"F",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}}}},
"SEX"=>"M",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}},
{"NAME"=>"/Mother/",
"FAMS"=>
{:value=>
{"HUSB"=>
{:value=>
{"NAME"=>"/Father/",
"FAMS"=>{:value=>{...}},
"SEX"=>"M",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}},
"CHIL"=>
{:value=>
{"NAME"=>"/Child/",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"31 JUL 1950"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"29 FEB 2000"},
"FAMC"=>{:value=>{...}}}},
"MARR"=>{"PLAC"=>"marriage place", "DATE"=>"1 APR 1950"},
"WIFE"=>{:value=>{...}}}},
"SEX"=>"F",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}},
{"NAME"=>"/Child/",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"31 JUL 1950"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"29 FEB 2000"},
"FAMC"=>
{:value=>
{"HUSB"=>
{:value=>
{"NAME"=>"/Father/",
"FAMS"=>{:value=>{...}},
"SEX"=>"M",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}},
"CHIL"=>{:value=>{...}},
"MARR"=>{"PLAC"=>"marriage place", "DATE"=>"1 APR 1950"},
"WIFE"=>
{:value=>
{"NAME"=>"/Mother/",
"FAMS"=>{:value=>{...}},
"SEX"=>"F",
"BIRT"=>{"PLAC"=>"birth place", "DATE"=>"1 JAN 1899"},
"DEAT"=>{"PLAC"=>"death place", "DATE"=>"31 DEC 1990"}}}}}}]}

As YAML:
---
HEAD:
CHAR: ASCII
SUBM:
:value: &id001
NAME: "/Submitter/"
ADDR: >-
Submitters address

address continued here

SOUR: ID_OF_CREATING_FILE
GEDC:
FORM: Lineage-Linked
VERS: "5.5"
SUBM: *id001
TRLR: {}
FAM: &id002
HUSB:
:value: &id003
NAME: "/Father/"
FAMS:
:value: *id002
SEX: M
BIRT:
PLAC: birth place
DATE: 1 JAN 1899
DEAT:
PLAC: death place
DATE: 31 DEC 1990
CHIL:
:value: &id005
NAME: "/Child/"
BIRT:
PLAC: birth place
DATE: 31 JUL 1950
DEAT:
PLAC: death place
DATE: 29 FEB 2000
FAMC:
:value: *id002
MARR:
PLAC: marriage place
DATE: 1 APR 1950
WIFE:
:value: &id004
NAME: "/Mother/"
FAMS:
:value: *id002
SEX: F
BIRT:
PLAC: birth place
DATE: 1 JAN 1899
DEAT:
PLAC: death place
DATE: 31 DEC 1990
INDI:
- *id003
- *id004
- *id005

As XML:
<gedcom>
<head>
<char>ASCII</char>
<subm>
<ref to="@SUBMITTER@" />
</subm>
<sour>ID_OF_CREATING_FILE</sour>
<gedc>
<form>Lineage-Linked</form>
<vers>5.5</vers>
</gedc>
</head>
<subm id="@SUBMITTER@">
<name>/Submitter/</name>
<addr>Submitters address
address continued here</addr>
</subm>
<trlr />
<fam id="@FAMILY@">
<husb>
<ref to="@FATHER@" />
</husb>
<chil>
<ref to="@CHILD@" />
</chil>
<marr>
<plac>marriage place</plac>
<date>1 APR 1950</date>
</marr>
<wife>
<ref to="@MOTHER@" />
</wife>
</fam>
<indi id="@FATHER@">
<name>/Father/</name>
<fams>
<ref to="@FAMILY@" />
</fams>
<sex>M</sex>
<deat>
<plac>death place</plac>
<date>31 DEC 1990</date>
</deat>
<birt>
<plac>birth place</plac>
<date>1 JAN 1899</date>
</birt>
</indi>
<indi id="@MOTHER@">
<name>/Mother/</name>
<fams>
<ref to="@FAMILY@" />
</fams>
<sex>F</sex>
<deat>
<plac>death place</plac>
<date>31 DEC 1990</date>
</deat>
<birt>
<plac>birth place</plac>
<date>1 JAN 1899</date>
</birt>
</indi>
<indi id="@CHILD@">
<name>/Child/</name>
<deat>
<plac>death place</plac>
<date>29 FEB 2000</date>
</deat>
<birt>
<plac>birth place</plac>
<date>31 JUL 1950</date>
</birt>
<famc>
<ref to="@FAMILY@" />
</famc>
</indi>
</gedcom>

Dennis Ranke · Nov 7, 2004

Here is a very simple solution. It doesn't try to understand much of the
contents of the .ged file, it just builds a tree of nodes and dumps them
to xml.

#!/usr/bin/env ruby

require 'CGI'

class String
def indent(width)
map {|line| ' ' * width + line.chomp}.join("\n")
end
end

class Node
def initialize(type, value)
@type = type
@value = CGI.escapeHTML(value)
@children = []
end

def <<(child)
@children << child
end

def to_xml
children = @children.map{|c| c.to_xml}.join("\n").indent(2)
if @type[0] == ?@
return "<%s id=\"%s\">\n%s\n</%s>" %
[@value.downcase, @type, children, @value.downcase]
elsif children.empty?
return "<%s>%s</%s>" % [@type.downcase, @value, @type.downcase]
else
if @value.empty?
return "<%s>\n%s\n</%s>" %
[@type.downcase, children, @type.downcase]
else
return "<%s value=\"%s\">\n%s\n</%s>" %
[@type.downcase, @value, children, @type.downcase]
end
end
end
end

if ARGV.size != 2
puts "Usage: gedparser.rb [input.ged] [ouput.xml]"
exit
end

root = []
stack = [root]
File.readlines(ARGV[0]).each do |line|
line = line.strip
depth, type, value = line.split(/\s+/, 3)
next unless type
value ||= ''
node = Node.new(type, value)
stack[depth.to_i] << node
stack[depth.to_i + 1] = node
end

xml = root.map {|node| node.to_xml}.join("\n")
xml = "<gedcom>\n" + xml.indent(2) + "\n</gedcom>"

File.open(ARGV[1], 'w') {|f| f.puts xml}

James Edward Gray II · Nov 7, 2004

Here is a very simple solution. It doesn't try to understand much of
the contents of the .ged file, it just builds a tree of nodes and
dumps them to xml.

I took a similar approach, parse and print. My code doesn't really
understand GEDCOM.

James Edward Gray II

#!/usr/bin/env ruby

class GEDCOMTree
def self.parse( io )
prev = -1
root = cur = GEDCOMTree.new("gedcom")
while line = ARGF.gets
if md = /^\s*(\d+)\s+(@[^@]+@)\s+(.+?)\s*$/.match(line)
tree = GEDCOMTree.new(md[3], md[2])
elsif md = /^\s*(\d+)\s+([A-Z]{3,4})\s*(.*?)\s*$/.match(line)
tree = GEDCOMTree.new(md[2], md[3])
else
next
end

if md[1].to_i == prev
cur = cur.parent
elsif md[1].to_i < prev
count = md[1].to_i
while count <= prev
cur = cur.parent
count += 1
end
end

cur << tree
cur = tree
prev = md[1].to_i
end

root
end

attr_accessor

arent

def initialize( type, value = nil )
@type = type
@value = value

@subtrees = [ ]
end

def <<( subtree )
subtree.parent = self

@subtrees << subtree
end

def to_xml( indent = 0 )
if @subtrees.size == 0 and (@value.nil? or @value.length == 0)
return "\t" * indent + "<#{@type.downcase} />\n"
end

tag = "\t" * indent + "<#{@type.downcase}"
if @subtrees.size > 0
if @value.nil? or @value.length == 0
tag += ">\n"
else
if @value[0] == ?@
tag += " id=\"#@value\">\n"
else
tag += " value=\"#@value\">\n"
end
end
@subtrees.each { |e| tag += e.to_xml(indent + 1) }
else
tag += ">#@value"
end
if tag[-1, 1] == "\n"
tag + "\t" * indent + "</#{@type.downcase}>\n"
else
tag + "</#{@type.downcase}>\n"
end
end
end

if $0 == __FILE__
puts GEDCOMTree.parse(ARGF).to_xml
end

Dave Burt · Nov 7, 2004

Solution: http://dave.burt.id.au/ruby/gedcom.rb
Sample input: http://dave.burt.id.au/ruby/royal.ged
Sample output: http://dave.burt.id.au/ruby/royal.xml

My solution doesn't build any trees, just an XML string, but it does do its
best to represent IDs and x-refs and things like CONTs and CONCS, and even
indents the XML nicely.

* @s are removed from all IDs
* CONTs and CONCs are aggregated into a single text node, semantically
required whitespece preserved
* cross-references look like <xref>id</xref>

Dave Burt · Nov 8, 2004

Hans Fugal said:
Jamis actually wrote a GEDCOM parser once, which you can find online. He
recently rewrote it in response to some questions I had about it (and/or
my prodding?), and he's talked about putting it on rubyforge. It's quite
slick, much slicker than his original one if you've seen it. It's
callback-based, and doesn't do any validation at this point. I've tinkered
along the road to making a validating parser based on his that produces a
populated object model (a la DOM) but it's quite a bit more work than just
parsing. That would also be useful though, so if working on this kind of
thing is interesting to you I say let's collaborate.

Sounds slightly interesting.

I for one am interested in generating interesting wall charts, e.g.
GEDCOM->SVG->PostScript.

Sounds a lot more useful.
I'm not sure how I can be useful to the project, though.

[SUMMARY] GEDCOM Parser (#6)	0	Nov 11, 2004
Deep YAML	5	Aug 2, 2007
[QUIZ] Gathering Ruby Quiz 2 Data (#189)	10	Jan 23, 2009
Ruby Weekly News 8th-14th November 2004	8	Nov 16, 2004
[QUIZ] Finding Quiz Responses (#200)	2	Apr 10, 2009
[QUIZ] Statistician II (#168)	4	Jul 4, 2008
[QUIZ] IRC Teams (#221)	1	Oct 16, 2009
[QUIZ] LSRC Name Picker (#129)	42	Jun 24, 2007

[QUIZ] GEDCOM Parser (#6)

Ruby Quiz

Jim Menard

James Edward Gray II

James Britt

Dave Burt

Jim Menard

James Britt

Jim Menard

James Edward Gray II

Jim Menard

Florian Gross

James Britt

Dave Burt

Dave Burt

Florian Gross

Florian Gross

Dennis Ranke

James Edward Gray II

Dave Burt

Dave Burt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads