XML to CSV with REXML - I'm sure this should be easy...

Discussion in 'Ruby' started by Sandy Thomson, Mar 17, 2009.

  1. Ok, I'm taking a fairly simple xml file containing a series of events
    and I want to convert it to csv - nothing new there.

    However, some events have two or more dates listed and I'd like to
    display each as individual lines. My ruby skills are fairly limited but
    from googling around I can extract everything up to the dates, but I'm
    banging my head against a wall to get any further...

    Here's the XML:

    <event id='1234'>

    <title>Event Title</title>

    <category>Event Category</category>

    <venue>

    <name>Venue Name</name>

    <address>

    <address1>1 Some Street</address1>

    <town>Some Town</town>

    </address>

    </venue>

    <performances>

    <performance date='2009-04-01 18:00:00' />

    <performance date='2009-04-03 18:00:00' />

    </performances>

    </event>

    This is my extraction code:

    require 'rexml/document'
    xml = REXML::Document.new(File.open("data.xml"))
    csv_file = File.new("data.csv", "w")
    xml.elements.each("//event") do |e|
    csv_file.puts e.attributes['id'] << "|" <<
    e.elements['title'].text << "|" <<
    e.elements['category'].text << "|" <<
    e.elements['venue/name'].text << "|" <<
    e.elements['venue/address/address1'].text << "|" <<
    e.elements['venue/address/town'].text
    end

    Which gives me:
    1234|Event Title|Event Category|Venue Name|1 Some Street|Some Town

    But what I really want is:
    1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    Town|2009-04-01 18:00:00
    1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    Town|2009-04-03 18:00:00

    I'm sure this should be fairly simple but any help would be appreciated.
    Cheers!
    --
    Posted via http://www.ruby-forum.com/.
     
    Sandy Thomson, Mar 17, 2009
    #1
    1. Advertising

  2. On 17.03.2009 13:03, Sandy Thomson wrote:
    > Ok, I'm taking a fairly simple xml file containing a series of events
    > and I want to convert it to csv - nothing new there.
    >
    > However, some events have two or more dates listed and I'd like to
    > display each as individual lines. My ruby skills are fairly limited but
    > from googling around I can extract everything up to the dates, but I'm
    > banging my head against a wall to get any further...
    >
    > Here's the XML:
    >
    > <event id='1234'>
    >
    > <title>Event Title</title>
    >
    > <category>Event Category</category>
    >
    > <venue>
    >
    > <name>Venue Name</name>
    >
    > <address>
    >
    > <address1>1 Some Street</address1>
    >
    > <town>Some Town</town>
    >
    > </address>
    >
    > </venue>
    >
    > <performances>
    >
    > <performance date='2009-04-01 18:00:00' />
    >
    > <performance date='2009-04-03 18:00:00' />
    >
    > </performances>
    >
    > </event>
    >
    > This is my extraction code:
    >
    > require 'rexml/document'
    > xml = REXML::Document.new(File.open("data.xml"))
    > csv_file = File.new("data.csv", "w")
    > xml.elements.each("//event") do |e|
    > csv_file.puts e.attributes['id'] << "|" <<
    > e.elements['title'].text << "|" <<
    > e.elements['category'].text << "|" <<
    > e.elements['venue/name'].text << "|" <<
    > e.elements['venue/address/address1'].text << "|" <<
    > e.elements['venue/address/town'].text


    Here you need to iterate through all the "performance" elements _below
    the current event_ and concatenate the individual performance's date
    with what you have built so far.

    You should probably also take measures to emit a line without a date in
    case zero performances can be found in input XML.

    > end
    >
    > Which gives me:
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some Town
    >
    > But what I really want is:
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    > Town|2009-04-01 18:00:00
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    > Town|2009-04-03 18:00:00
    >
    > I'm sure this should be fairly simple but any help would be appreciated.
    > Cheers!


    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
     
    Robert Klemme, Mar 17, 2009
    #2
    1. Advertising

  3. Sandy Thomson

    matt neuburg Guest

    Sandy Thomson <> wrote:

    > Ok, I'm taking a fairly simple xml file containing a series of events
    > and I want to convert it to csv - nothing new there.
    >
    > However, some events have two or more dates listed and I'd like to
    > display each as individual lines. My ruby skills are fairly limited but
    > from googling around I can extract everything up to the dates, but I'm
    > banging my head against a wall to get any further...
    >
    > Here's the XML:
    >
    > <event id='1234'>
    >
    > <title>Event Title</title>
    >
    > <category>Event Category</category>
    >
    > <venue>
    >
    > <name>Venue Name</name>
    >
    > <address>
    >
    > <address1>1 Some Street</address1>
    >
    > <town>Some Town</town>
    >
    > </address>
    >
    > </venue>
    >
    > <performances>
    >
    > <performance date='2009-04-01 18:00:00' />
    >
    > <performance date='2009-04-03 18:00:00' />
    >
    > </performances>
    >
    > </event>
    >
    > This is my extraction code:
    >
    > require 'rexml/document'
    > xml = REXML::Document.new(File.open("data.xml"))
    > csv_file = File.new("data.csv", "w")
    > xml.elements.each("//event") do |e|
    > csv_file.puts e.attributes['id'] << "|" <<
    > e.elements['title'].text << "|" <<
    > e.elements['category'].text << "|" <<
    > e.elements['venue/name'].text << "|" <<
    > e.elements['venue/address/address1'].text << "|" <<
    > e.elements['venue/address/town'].text
    > end
    >
    > Which gives me:
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some Town
    >
    > But what I really want is:
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    > Town|2009-04-01 18:00:00
    > 1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    > Town|2009-04-03 18:00:00


    What you are really interested in is each performance (each performance
    generates one line of output). So simply deepen your loop:

    xml.elements.each("//event") do |e|
    e.elements.each("//performance") do |p|

    Now do exactly what you're doing and just append the performance date.
    So, for example:

    require 'rexml/document'
    include REXML
    output = ""
    class REXML::Element
    def textof(xpaths_arr); xpaths_arr.map {|x| elements[x].text}; end
    end
    xml = Document.new(s)
    xp = %w{title category venue/name
    venue/address/address1 venue/address/town}
    xml.elements.each("//event") do |e|
    e.elements.each("//performance") do |p|
    output <<
    [e.attributes['id'],
    e.textof(xp),
    p.attributes['date']].flatten.join("|") + "\n"
    end
    end

    m.

    --
    matt neuburg, phd = , http://www.tidbits.com/matt/
    Leopard - http://www.takecontrolbooks.com/leopard-customizing.html
    AppleScript - http://www.amazon.com/gp/product/0596102119
    Read TidBITS! It's free and smart. http://www.tidbits.com
     
    matt neuburg, Mar 17, 2009
    #3
  4. Robert Klemme wrote:
    >
    > Here you need to iterate through all the "performance" elements _below
    > the current event_ and concatenate the individual performance's date
    > with what you have built so far.
    >
    > You should probably also take measures to emit a line without a date in
    > case zero performances can be found in input XML.
    >
    >
    > Kind regards
    >
    > robert


    Thank you Robert, I'm getting closer...

    I modified it as below, but for some reason the dates are now stacking
    up on each other as so:
    1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    Town|2009-04-01 18:00:00
    1234|Event Title|Event Category|Venue Name|1 Some Street|Some
    Town|2009-04-01 18:00:00|2009-04-03 18:00:00

    So, I'm still missing something - any ideas?

    xml.elements.each("//event") do |e|
    detail =
    (
    e.attributes['id'] << "|" <<
    e.elements['title'].text << "|" <<
    e.elements['category'].text << "|" <<
    e.elements['venue/name'].text << "|" <<
    e.elements['venue/address/address1'].text << "|" <<
    e.elements['venue/address/town'].text
    )

    xml.elements.each("//performances/performance") do |f|
    csv_file.puts detail << "|" << f.attributes['date']
    end
    end
    --
    Posted via http://www.ruby-forum.com/.
     
    Sandy Thomson, Mar 17, 2009
    #4
  5. > xml.elements.each("//performances/performance") do |f|

    That should read e.elements.each.... but the result is the same

    - cheers for the reply matt, just looking at that now
    --
    Posted via http://www.ruby-forum.com/.
     
    Sandy Thomson, Mar 17, 2009
    #5
  6. > What you are really interested in is each performance (each performance
    > generates one line of output). So simply deepen your loop:
    >
    > xml.elements.each("//event") do |e|
    > e.elements.each("//performance") do |p|
    >
    > Now do exactly what you're doing and just append the performance date.
    > So, for example:
    >
    > require 'rexml/document'
    > include REXML
    > output = ""
    > class REXML::Element
    > def textof(xpaths_arr); xpaths_arr.map {|x| elements[x].text}; end
    > end
    > xml = Document.new(s)
    > xp = %w{title category venue/name
    > venue/address/address1 venue/address/town}
    > xml.elements.each("//event") do |e|
    > e.elements.each("//performance") do |p|
    > output <<
    > [e.attributes['id'],
    > e.textof(xp),
    > p.attributes['date']].flatten.join("|") + "\n"
    > end
    > end
    >
    > m.


    Ok, so this is plainly much neater, thanks Matt :)

    ...but, whilst it works for one event with multiple dates, as soon as I
    add a second event it iterates through all of the dates against every
    event, so for 2 events each with 2 dates it outputs 8 lines...

    Here's the code as it now stands:

    require 'rexml/document'
    include REXML
    output = ""
    class REXML::Element
    def textof(xpaths_arr); xpaths_arr.map {|x| elements[x].text}; end
    end
    xml = REXML::Document.new(File.open("data.xml"))
    csv_file = File.new("data.csv", "w")
    xp = %w{title category venue/name venue/address/address1
    venue/address/town}

    xml.elements.each("//event") do |e|
    e.elements.each("//performance") do |p|

    csv_file.puts output + [e.attributes['id'], e.textof(xp),
    p.attributes['date']].flatten.join("|") + "\n"

    end
    end
    --
    Posted via http://www.ruby-forum.com/.
     
    Sandy Thomson, Mar 17, 2009
    #6
  7. Sandy Thomson

    matt neuburg Guest

    Sandy Thomson <> wrote:

    > ..but, whilst it works for one event with multiple dates, as soon as I
    > add a second event it iterates through all of the dates against every
    > event, so for 2 events each with 2 dates it outputs 8 lines...


    Cool! :)

    > xml.elements.each("//event") do |e|
    > e.elements.each("//performance") do |p|


    Yeah, sorry about that. I wasn't thinking about the XPath here.
    Obviously "//performance" is wrong. I shoulda said
    "descendant::performance" or "performances/performance" or similar.

    Of course one could also argue that Ruby and REXML are more heavyweight
    than you need; you're just dumpster-diving in simple XML and outputting
    text, so you could write this whole thing as an XSLT template. Choices,
    choices...!

    m.

    --
    matt neuburg, phd = , http://www.tidbits.com/matt/
    Leopard - http://www.takecontrolbooks.com/leopard-customizing.html
    AppleScript - http://www.amazon.com/gp/product/0596102119
    Read TidBITS! It's free and smart. http://www.tidbits.com
     
    matt neuburg, Mar 17, 2009
    #7
  8. matt neuburg wrote:
    > Sandy Thomson <> wrote:
    >
    >> ..but, whilst it works for one event with multiple dates, as soon as I
    >> add a second event it iterates through all of the dates against every
    >> event, so for 2 events each with 2 dates it outputs 8 lines...

    >
    > Cool! :)
    >
    >> xml.elements.each("//event") do |e|
    >> e.elements.each("//performance") do |p|

    >
    > Yeah, sorry about that. I wasn't thinking about the XPath here.
    > Obviously "//performance" is wrong. I shoulda said
    > "descendant::performance" or "performances/performance" or similar.
    >
    > Of course one could also argue that Ruby and REXML are more heavyweight
    > than you need; you're just dumpster-diving in simple XML and outputting
    > text, so you could write this whole thing as an XSLT template. Choices,
    > choices...!
    >
    > m.


    Choices indeed, but it works perfectly now so I'll go with it :)

    Thanks!
    --
    Posted via http://www.ruby-forum.com/.
     
    Sandy Thomson, Mar 17, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?dHJ1ZmF1eA==?=

    I'm sure this is easy

    =?Utf-8?B?dHJ1ZmF1eA==?=, Dec 13, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    351
    =?Utf-8?B?dHJ1ZmF1eA==?=
    Dec 13, 2005
  2. Replies:
    0
    Views:
    566
  3. Damphyr
    Replies:
    2
    Views:
    147
    Damphyr
    Jul 16, 2003
  4. Daniel Berger

    rexml error - REXML::Validation

    Daniel Berger, Oct 12, 2004, in forum: Ruby
    Replies:
    2
    Views:
    157
    Henrik Horneber
    Oct 12, 2004
  5. Phlip
    Replies:
    0
    Views:
    148
    Phlip
    Jan 15, 2008
Loading...

Share This Page