REXML libraries and parsing issues

BA · Jun 24, 2005

First off, let me say right up front that I am a newbie wrt Ruby.

I am trying to parse an XML file, however, am having all kinds of
trouble. I am using the REXML libraries and the sax2parser/listener.
In the sax2listener, I can use the character/text part of the method,
however, I cannot for the life of me figure out how to parse out JUST
WHAT I WANT. Here is what the file looks like as follows:

<B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>

If I use :character, %w{PDAT} {|text| puts text} ... I get the text
"this is the text I need" printed out. If I use the B110 or any
combination, I cannot get it to work. Anyone know how to get the
sax2parser/listener to parse the file and allow me to be selective
about what I parse out of the file? Thanks for any/all help in this
endeavor!!!!!!!!!!

-Bob Angell-
(e-mail address removed)

James Britt · Jun 24, 2005

BA said:
First off, let me say right up front that I am a newbie wrt Ruby.

I am trying to parse an XML file, however, am having all kinds of
trouble. I am using the REXML libraries and the sax2parser/listener.
In the sax2listener, I can use the character/text part of the method,
however, I cannot for the life of me figure out how to parse out JUST
WHAT I WANT. Here is what the file looks like as follows:

<B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>

If I use :character, %w{PDAT} {|text| puts text} ... I get the text
"this is the text I need" printed out. If I use the B110 or any
combination, I cannot get it to work. Anyone know how to get the
sax2parser/listener to parse the file and allow me to be selective about
what I parse out of the file? Thanks for any/all help in this
endeavor!!!!!!!!!!

What, exactly, do you want? To extract the text from the PDAT element?

How predictable is the XML?

Are the files as small as your example?

Are regular expressions an option? Or using a DOM and XPath?

How did you decide to use the listner?

James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys

BA · Jun 24, 2005

Yes, I want to extract the PDAT element, however, I want to use the
B110 tag to find this element. The XML *is* predictable, however,
there are variations in the placement of the elements (there could be
several different address fields and/or many paragraphs that need to be
parsed/searched). The files are *extremely* large (some could be as
large as 1-2GB). I would prefer to do all of the processing in Ruby if
this is possible (want to use the OO functionality for the text
processing I want to do) and would like to also incorporate regex if
possible (started doing this by parsing the file line by line, however,
ran into malformed XML where I decided that I needed to use the
database functionality of XML. Not sure if DOM would work. Could not
get XPath to work. The listener was, quite frankly, a SWAG. Thanks.

Bucco · Jun 24, 2005

How about something like:

require 'rexml/document'
doc = REXML:

ocument.new(File.open('someXMLFile.xml'))
info = doc.elements["//B110/DNUM/PDAT"].text
puts info

SA

James Edward Gray II · Jun 24, 2005

How about something like:

require 'rexml/document'
doc = REXML:ocument.new(File.open('someXMLFile.xml'))
info = doc.elements["//B110/DNUM/PDAT"].text
puts info

For 2 Gig files?! Good luck!

James Edward Gray II

James Britt · Jun 24, 2005

BA said:
Yes, I want to extract the PDAT element, however, I want to use the B110
tag to find this element. The XML *is* predictable, however, there are
variations in the placement of the elements (there could be several
different address fields and/or many paragraphs that need to be
parsed/searched). The files are *extremely* large (some could be as
large as 1-2GB). I would prefer to do all of the processing in Ruby if
this is possible (want to use the OO functionality for the text
processing I want to do) and would like to also incorporate regex if
possible (started doing this by parsing the file line by line, however,
ran into malformed XML where I decided that I needed to use the database
functionality of XML. Not sure if DOM would work. Could not get XPath
to work. The listener was, quite frankly, a SWAG. Thanks.

OK, I got the picture.

I would suggest the pull parser. Open up a file stream and keep pulling
events. When you get a start_element event, check the element name.
If it is B110, then, loop and pull events until the PDAT element.
Then pull until text event.
Grab text and store it or whatever.
Go back to main loop, looking again for that B110 element.

Something like this:

#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'

include REXML:

arsers

$text = []

def pdat( parser )
while parser.has_next?
pull_event = parser.pull
$text.push( pull_event[0] ) if pull_event.text?
end
end

def get_text parser
while parser.has_next?
pull_event = parser.pull
b110( parser ) if pull_event.start_element? and
pull_event[0] =~ /B110/
end
end

def b110( parser )
while parser.has_next?
pull_event = parser.pull
pdat( parser ) if pull_event.start_element? and
pull_event[0] =~ /PDAT/
end
end

File.open( "pdat.xml", "r") { |f|
parser = PullParser.new( f )
b110( parser )

}

puts $text.join( "\n" )

James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys

REXML and Empty-Elements	1	Oct 21, 2008
What libraries should I use for MIME parsing, XML parsing, and MySQL ?	0	Feb 2, 2012
rexml system error on parse	0	Dec 29, 2008
REXML raw all doesn't seem to work	0	Apr 14, 2009
REXML and entities	0	Mar 31, 2006
REXML::Document parsing	2	Nov 10, 2007
ruby rexml stream mode	4	Jun 22, 2010
Stream Parsing with REXML	12	Jan 12, 2008

REXML libraries and parsing issues

BA

James Britt

BA

Bucco

James Edward Gray II

James Britt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads