REXML libraries and parsing issues

Discussion in 'Ruby' started by BA, Jun 24, 2005.

  1. BA

    BA Guest

    First off, let me say right up front that I am a newbie wrt Ruby.

    I am trying to parse an XML file, however, am having all kinds of
    trouble. I am using the REXML libraries and the sax2parser/listener.
    In the sax2listener, I can use the character/text part of the method,
    however, I cannot for the life of me figure out how to parse out JUST
    WHAT I WANT. Here is what the file looks like as follows:

    <B110><DNUM><PDAT> this is the text I need </PDAT></DNUM></B110>

    If I use :character, %w{PDAT} {|text| puts text} ... I get the text
    "this is the text I need" printed out. If I use the B110 or any
    combination, I cannot get it to work. Anyone know how to get the
    sax2parser/listener to parse the file and allow me to be selective
    about what I parse out of the file? Thanks for any/all help in this
    endeavor!!!!!!!!!!

    -Bob Angell-
     
    BA, Jun 24, 2005
    #1
    1. Advertisements

  2. BA

    James Britt Guest

    What, exactly, do you want? To extract the text from the PDAT element?

    How predictable is the XML?

    Are the files as small as your example?

    Are regular expressions an option? Or using a DOM and XPath?

    How did you decide to use the listner?

    James

    --

    http://www.ruby-doc.org - The Ruby Documentation Site
    http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
    http://www.rubystuff.com - The Ruby Store for Ruby Stuff
    http://www.jamesbritt.com - Playing with Better Toys
     
    James Britt, Jun 24, 2005
    #2
    1. Advertisements

  3. BA

    BA Guest

    Yes, I want to extract the PDAT element, however, I want to use the
    B110 tag to find this element. The XML *is* predictable, however,
    there are variations in the placement of the elements (there could be
    several different address fields and/or many paragraphs that need to be
    parsed/searched). The files are *extremely* large (some could be as
    large as 1-2GB). I would prefer to do all of the processing in Ruby if
    this is possible (want to use the OO functionality for the text
    processing I want to do) and would like to also incorporate regex if
    possible (started doing this by parsing the file line by line, however,
    ran into malformed XML where I decided that I needed to use the
    database functionality of XML. Not sure if DOM would work. Could not
    get XPath to work. The listener was, quite frankly, a SWAG. Thanks.
     
    BA, Jun 24, 2005
    #3
  4. BA

    Bucco Guest

    How about something like:

    require 'rexml/document'
    doc = REXML::Document.new(File.open('someXMLFile.xml'))
    info = doc.elements["//B110/DNUM/PDAT"].text
    puts info

    SA :)
     
    Bucco, Jun 24, 2005
    #4
  5. For 2 Gig files?! Good luck!

    James Edward Gray II
     
    James Edward Gray II, Jun 24, 2005
    #5
  6. BA

    James Britt Guest

    OK, I got the picture.

    I would suggest the pull parser. Open up a file stream and keep pulling
    events. When you get a start_element event, check the element name.
    If it is B110, then, loop and pull events until the PDAT element.
    Then pull until text event.
    Grab text and store it or whatever.
    Go back to main loop, looking again for that B110 element.


    Something like this:

    #!/usr/bin/env ruby
    require 'rexml/parsers/pullparser'

    include REXML::parsers

    $text = []

    def pdat( parser )
    while parser.has_next?
    pull_event = parser.pull
    $text.push( pull_event[0] ) if pull_event.text?
    end
    end

    def get_text parser
    while parser.has_next?
    pull_event = parser.pull
    b110( parser ) if pull_event.start_element? and
    pull_event[0] =~ /B110/
    end
    end

    def b110( parser )
    while parser.has_next?
    pull_event = parser.pull
    pdat( parser ) if pull_event.start_element? and
    pull_event[0] =~ /PDAT/
    end
    end



    File.open( "pdat.xml", "r") { |f|
    parser = PullParser.new( f )
    b110( parser )

    }

    puts $text.join( "\n" )




    James

    --

    http://www.ruby-doc.org - The Ruby Documentation Site
    http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
    http://www.rubystuff.com - The Ruby Store for Ruby Stuff
    http://www.jamesbritt.com - Playing with Better Toys
     
    James Britt, Jun 24, 2005
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.