[SUMMARY] GEDCOM Parser (#6)

R

Ruby Quiz

This quiz generated some interesting discussion on Ruby Talk. Some details of
the GEDCOM format were talked about and the proper way to handle XML output was
debated. I don't want to fill this summary with that conversation (visit the
archives for the "[QUIZ]" thread, if you missed it), but I do want to point out
the best point made about the XML format. Here's what Hans Fugal had to say
about it:

I take issue with the example XML in the quiz because I am from the
"data in text, metadata in attributes" camp, and the name is not
metadata. Here is a snippet of the output that I am generating:

<INDI id='@I11@'>
<NAME>Itha /Steele/<SURN>Steele</SURN>
<GIVN>Itha</GIVN>
</NAME>
<SEX>F</SEX>
<_UID>38CC16658231D511ACB8E07C9CE21378E1AF</_UID>
<FAMS>@F12@</FAMS>
<FAMC>@F4@</FAMC>
</INDI>

I have yet to fall in love with XML like the rest of the world, but I think Hans
makes a good point here.

The solution submitted by Florian Gross also supported YAML and "pretty print"
output.

Formats aside, submitted solutions varied in how much they interpreted from the
GEDCOM file as opposed to simple translation. The most common change between
input and output was to build a single entity out of GEDCOM's CONC and CONT
fields.

XML generation techniques were also varied among submissions. Some of us
built-up our own Strings, others used REXML, Cedric Foll used XmlSimple and Jim
Weirch used his Builder package, described here:

http://onestepback.org/index.cgi/Tech/Ruby/BuilderObjects.rdoc

Now let's look at a solution. Here's the code submitted by Hans Fugal:

#! /usr/bin/ruby
require 'rexml/document'

doc = REXML::Document.new "<gedcom/>"
stack = [doc.root]

ARGF.each_line do |line|
next if line =~ /^\s*$/

# parse line
line =~ /^\s*(\d+)\s+(@\S+@|\S+)(\s(.*))?$/ or raise "Invalid GEDCOM"
level = $1.to_i
tag = $2
data = $4

# pop off the stack until we get the parent
while (level+1) < stack.size
stack.pop
end
parent = stack.last

# create XML tag
el = nil
if tag =~ /@.+@/
el = parent.add_element data
el.attributes['id'] = tag
else
el = parent.add_element tag
el.text = data
end

stack.push el
end
doc.write($stdout,0)
puts

The above starts by creating a REXML document and a stack for managing
parent/child relationships. With setup out of the way, the code reads from
STDIN or files specified as command-line arguments, line by line.

The processing of each line is a three stage process: Parse the line, unwind
the stack to the parent for this element, and finally add the element to the
parent through the REXML API and push the new element onto the stack.

When it's all been read, the complete XML is dumped to STDOUT.

Obviously, Hans' solution doesn't do any special handling of the GEDCOM format.
It's a simple parse and print solution.

If aren't going to use a great library like REXML to generate XML output,
remember to handle your own escaping. (See my submission for an example of code
that forgot to do this! Oops.)

If you are a person who deals with GEDCOM files outside of this quiz, you may
want to check out these links passed to me by Jamis Buck:

If you go to http://www.familysearch.com, you can search for
ancestors/relatives and download their information in GEDCOM format.

There are also a variety of tools available for taking a GEDCOM file
and creating a website from it. In fact, if you go to
http://www.onepagegenealogy.com you can have a wall-chart sized
pedigree chart printed from a GEDCOM file for $20. (That particular
project is a research project here at BYU.)

My thanks go out to Jamis for our second contributed quiz, and a good topic at
that. I've got two more contributed quizzes waiting in the wings, which is
great news. Keep 'em coming!

Stay tuned folks, because it's Game Show time with tomorrow's Ruby Quiz...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top