XML Parsing Speed - ruby libxml & REXML

S

subimage

Hey all...

I'm working on a massive Rails site that does heavy data import daily.
A lot of this data is in XML files of various sizes ranging from 100k
to 400mb, and totaling around 2gb for all sources. I'd like to keep the
entire project using Ruby.

At first, I wrote my parsers using REXML, but found that to be DOG
SLOW, especially for the large files. I tried REXML::parse_stream but
couldn't find any good documentation for handling parsing that way. It
was taking around 30 minutes to an hour to even _open_ the larger files
on a p4 1.8ghz test machine.

After that exercise I switched to libxml, which is a lot speedier, but
still slow (no numbers to back it up yet, just can tell by the speed of
data insert in my DB)

I'm wondering if there's some other lib out there that I'm missing? Can
someone point me in the right direction? Is there anything faster I'm
missing out on?

Are there any "gotchas" with using libxml that I should be aware of
speed-wise?

Any and all help is much appreciated...thanks!
 
R

Robert Klemme

subimage said:
Hey all...

I'm working on a massive Rails site that does heavy data import daily.
A lot of this data is in XML files of various sizes ranging from 100k
to 400mb, and totaling around 2gb for all sources. I'd like to keep the
entire project using Ruby.

At first, I wrote my parsers using REXML, but found that to be DOG
SLOW, especially for the large files. I tried REXML::parse_stream but
couldn't find any good documentation for handling parsing that way. It
was taking around 30 minutes to an hour to even _open_ the larger files
on a p4 1.8ghz test machine.

After that exercise I switched to libxml, which is a lot speedier, but
still slow (no numbers to back it up yet, just can tell by the speed of
data insert in my DB)

I'm wondering if there's some other lib out there that I'm missing? Can
someone point me in the right direction? Is there anything faster I'm
missing out on?

Are there any "gotchas" with using libxml that I should be aware of
speed-wise?

Any and all help is much appreciated...thanks!

Since you insert data into a DB: are you absolutely positive about the
fact that it's the XML parsing part that's slow? Here's what I'd do:
use two threads connected with a bounded queue, one thread for reading
XML with REXML's stream parser and one thread for inserting into the DB.
That way you can utilize CPU for parsing XML while your process waits
for the DB call to return. If possible use bulk insertions.
Alternatively, write out a CVS file and use the DB's bulk loader to pump
the data into the DB. HTH

Kind regards

robert
 
S

subimage

Robert thanks for the response...

I definitely _know_ it's the XML parsing that's slow. As mentioned,
even opening the file with REXML or libxml takes some time, then
finding all of my nodes (and nodes within) is even longer. Could it be
because I'm using doc.root.element.find("path") inside of my loop?
Anyone know a better way to go about grabbing specific nodes within a
document using libxml?

Insertion to the db is simple and quick - although your idea of a
bounded queue with 2 threads is interesting. I'll have to look into
that (have any example code I might start from?)

Also - I was unable to get stream parsing working properly for REXML so
I just gave up and moved to libxml. Do you have any resources on REXML
stream parsing you can share? A tutorial or reference? Anything would
be helpful.

Everything I've read online says libxml is much faster than REXML, so I
thought I made the best choice available.
 
R

Robert Klemme

subimage said:
Robert thanks for the response...

I definitely _know_ it's the XML parsing that's slow. As mentioned,
even opening the file with REXML or libxml takes some time, then
finding all of my nodes (and nodes within) is even longer. Could it be
because I'm using doc.root.element.find("path") inside of my loop?

We would have to see the code. Normally you would use find once on the
top level and have an XPath expression in place that selects all the
nodes that you need. At the moment I'm not sure whether it's any of the
XML libs or the way you use them.
Anyone know a better way to go about grabbing specific nodes within a
document using libxml?

Insertion to the db is simple and quick - although your idea of a
bounded queue with 2 threads is interesting. I'll have to look into
that (have any example code I might start from?)

Not handy. But it's fairly simple: you create the queue (see in thread)
and then create two threads, one for reading and one for writing.

require 'thread'
Q = SizedQueue.new 100

Thread.new do
# open file
# read XML
# loop
Q.enc "something"
# end loop
# close file
# signal finish:
Q.enc Q
end

# open DB
until Q == (task = Q.deq)
# insert task into DB
end
# commit TX

Also - I was unable to get stream parsing working properly for REXML so
I just gave up and moved to libxml. Do you have any resources on REXML
stream parsing you can share? A tutorial or reference? Anything would
be helpful.

http://www.germane-software.com/software/rexml/doc/classes/REXML/StreamListener.html

If you want to see what happens you can use this class as callback:

class Dummy
def method_missing(s,*a,&b)
print s, " ", a.inspect, b, "\n"
end
end

You'll see on the console which methods are called with which arguments.
Everything I've read online says libxml is much faster than REXML, so I
thought I made the best choice available.

I've never used libxml myself. I'd rather start with REXML because it
usually comes pre installed. If documents are large I'd use the stream API.

Kind regards

robert
 
R

rcoder

libxml will definitely be faster, but with either parser you'll want to
avoid loading the entire file into RAM -- even using the fastest C++ or
Java parsers, wrapping every bit of the XML tree structure in object
instances is going to involve a huge amount of overhead. Using XPath to
traverse the entire document tree will further slow things, as most
XPath implementations (including incomplete ones like REXML's) are
horribly inefficient.

Keep in mind that *every* object value in Ruby uses something like 12
bytes of RAM, so your 400MB XML document is probably also ending up
having a larger footprint than your system RAM and hitting swap, at
which point nothing can save you from, as you put it, "dog-slow"
performance.

Can you be a little more specfic about the problems you had when you
were "unable to get stream parsing working"? Event-driven parsing can
be somewhat more complex to implement, but especially with large
datasets offers *huge* performance gains, because it can help avoid the
memory footprint issues I mentioned above.
 
S

subimage

I guess that's where I'm going wrong - loading everything into ram.

RE: Stream parsing

I didn't know even where to begin to get it working. It was very
confusing for me, so I just stuck with what worked...I guess I'm more
of a "learn from existing code or tutorial" kind of person.

If this is the way to go I guess I need to spend some more time working
on my parser. For reference, here's my entire parse method using libxml
as it's currently working...

def parse_files
files = Dir['*.xml']
# Loop through all files
for file in files
puts "Parsing #{file}"
# Open XML file
doc = XML::Document.file(file)
puts "...file opened"
doc_root = doc.root
# Get Merchant information
merchant_name = doc_root.find("//header/merchantName").to_a[0].to_s
puts "Merchant: #{merchant_name}"
# Loop through each product in the document
puts "...finding products"
doc_root.find("product").each do |product|
unique_id = product['product_id']

# Find by unique product id and data source id
p = Product.find:)first,
:conditions => ["data_unique_id = ? AND data_source_id = ?",
unique_id, DATA_SOURCE_ID])
# If we didn't find a product that matches create a new one
p = Product.new if !p

price = product.find("price/sale").to_a[0]

# Set all object properties
p.data_unique_id = unique_id
p.name = product['name']
p.data_source_id = DATA_SOURCE_ID
p.link_url = product.find("URL/product").to_a[0].to_s
p.image_url = product.find("URL/productImage").to_a[0].to_s
p.short_description = product.find("description/short").to_a[0].to_s
p.long_description = product.find("description/long").to_a[0].to_s
p.price = price.content
p.msrp = product.find("//price/retail").to_a[0].to_s
p.start_date = price['begin_date']
p.end_date = price['end_date']
# Make sure dates are null if we get nothing
p.start_date = nil if p.start_date.blank?
p.end_date = nil if p.end_date.blank?

puts p.name

# Set the merchant up
# Create a new merchant if this one doesn't exist
merchant = Merchant.find_or_create_by_name(merchant_name)
p.merchant_id = merchant.id

puts p.inspect

begin
p.save!
rescue ActiveRecord::RecordInvalid => err
puts "!!!ERROR - #{err} : #{p.errors.full_messages}"
puts p.inspect
next
end

# Add categories to the product
category_1 = product.find("category/primary").to_a.to_s
p.add_category_by_name(category_1.strip) if !category_1.blank?
category_2 = product.find("category/secondary").to_a.to_s
p.add_category_by_name(category_2.strip) if !category_2.blank?

puts category_1
puts category_2

end # end each product
end
end
 
S

subimage

WHOAH!

Ok so I finally dug into the stream parser and this is lightning fast!

Thanks everyone for the advice...this is really sweet.

PS: I learned a lot from the tutorial available here:

http://www.rubyxml.com/articles/REXML/Stream_Parsing_with_REXML

I wrote a BasicStreamListener that throws each item into a hash
complete with pseudo xPaths...

Let me know if anyone would be interested in it, or a tutorial. Might
write something up for my blog as well on the subject, since there
doesn't seem to be a wealth of information out there on the subject.
 
K

Kenosis

The book "Enterprise Integration with Ruby" has some information on the
use of rexml and stream parsing, as I recall (don't have my copy at
hand at the moment.)

Ken
 
A

Adam Sanderson

Yeah, I ran into similar problems ealier using xpath. Streaming the
xml and plucking out what you need is a little more complicated but it
is more efficent on three counts:

1) It will probably consume less memory since you will likely only
store a small subset of the data
2) You either don't need to build a full DOM tree, or you can build a
very light weight one
3) Parsing and executing xpath expressions takes some time, if you're
doing a ton of them it might have a noticeable effect.

It might be best for people test it out using xpath and such, if that
works keep it, but if not, you can always fall back on building a
stream parser. Wish I saw your post earlier ;)

.adam
 
M

Mathieu Blondel

For large file, stream parsers are faster and have a smaller memory
footprint.

A few months ago, I also tested the expat bindings for ruby which
turned out to be up to 20 times faster than the stream parser provided
by REXML.

subimage a écrit :
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top