Efficient parsing of large Excel documents in Ruby

W

Wes Gamble

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

When I run it against an Excel file with ~42000 rows and 11 columns (the
spreadsheet is about 10MB in size), it takes the better part of 10
minutes to parse the file and uses about 200 MB of memory. I'm not
actually sure if the parsing completes successfully - the app. stops
right after the parsing appears to finish.

It would seem that I have two issues - memory usage and performance.

Is anyone aware of a much faster way to parse large Excel files?

Is anyone aware of a "windowing" scheme that uses a set amount of memory
to transfer portions of the data to a client requesting the data so as
to keep the memory required at a fixed level?

Should I look into using the Win32OLE libraries with the hope that it
would be more efficient?

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Any thoughts or advice is appreciated.

Thanks,
Wes Gamble
 
S

Sam Smoot

I agree with Christian, but with a twist.

I like to use Excel Automation (w/ WIN32OLE) to convert the document to
a format that's easier to work with. That way, after the initial
conversion, you can handle CSV, XML, etc, and be working with a fast
native library instead of the slow interop.

Another reason I prefer this method is that Excel interop is sometimes
hard to get "just right". There's processes I've written with it
running for over a year that I've never had to touch, but there's
another that used to require I login to the server and kill orphaned
Excel processes. Someone changes a column, you don't code defensively
enough, and BAM!

So the simplicity of the Load/Convert process, and getting out of Excel
as quickly as possible, conserving resources and avoiding potential
bugs that might result in orphaned processes is a big advantage IMO.

Plus it's almost gauranteed to be the fastest, and simplest to develop.
(As long as you're OK with having to run on Windows and having Excel
installed anyways.)
 
W

Wes Gamble

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes
 
M

M. Edward (Ed) Borasky

Wes said:
I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.
 
M

M. Edward (Ed) Borasky

Wes said:
I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Yes, there is a native way of doing this in VBA from a macro, which you
can store either in your personal macro workbook on in the spreadsheets
themselves. I have a colleague where I work that built an application
that does it.

But unfortunately, I have no idea what the technique is called. Almost
anyone who's been through some advanced Microsoft training ought to be
able to help you find it, or you might search the Microsoft knowledge
base for "efficient transfer of data to SQL Server".

If you don't have an answer by Tuesday, let me know and I'll ask her
what it's called.

And no, I don't think it involves installing CygWin. :)
 
W

Wes Gamble

Ed,

Not a bad idea. I'm already using unixODBC to connect to SQL Server
from the Linux box. I totally forgot that you can get to Excel via
ODBC.

Don't I need an ODBC driver for Excel though?

What about a DBI driver for Excel - does one exist?

Thanks,
Wes
 
C

Charles Oliver Nutter

Wes said:
All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.
 
W

Wes Gamble

Charles said:
Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.

Charles,

I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

Wes
 
W

Wes Gamble

M. Edward (Ed) Borasky said:
Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.

Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes
 
W

Wes Gamble

Wes said:
Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

Ah, are you saying that I can "publish" the Excel file into HTML through
some IIS thing and then just parse that with Hpricot? That's an
interesting solution.

Wes
 
C

Charles Oliver Nutter

Wes said:
I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

We appreciate the vote of confidence, and you're probably right about
JRuby not quite being production ready ("production ready" in the sense
of having something that won't require a bit of tinkering...there are
people using it for production apps, but they're comfortable on the
frontier :)

Don't forget the Java bridge options...from what I hear they work
reasonably well for neatly-encapsulated libraries like this.

RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works too

And there's RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.
 
W

Wes Gamble

Charles said:
RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works
too

I'm looking at RJB and it appears that it only loads the JVM into memory
once until it is explicitly unloaded.

Can anyone corroborate this (or correct me on it)?

Thanks,
Wes
 
S

Sam Smoot

Charles said:
And there's RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.

Read: None. :) You'd just reference the Excel COM object in .NET.
Using WIN32OLE is actually simpler than c# since you have access to the
VB-ish interfaces too (such as Range#Value).
 
W

Wes Gamble

Some more info:

Recently, I was confronted with a task in one of the apps. I'm building
that would allow the parsing of data in an Excel spreadsheet where the
number of rows could be on the order of 30000/40000/50000 or higher.

Originally, I was using the parseexcel gem to handle the parsing -
however, it proved to be fairly slow and consumed a lot of memory. When
I presented it with a > 42000 row spreadsheet, it basically cratered.
So I had to figure out another way to handle this problem. Someone
mentioned that there was a nice open source Java - based Excel parser
called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
Java test showed that the performance and memory footprint would be much
much better.

In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
still had problems implementing that (and I didn't want to run this app.
on it yet since it's still so young), so I took a look at some of the
Java - Ruby bridges. I gave one called Rjb
(http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
was very pleasantly surprised - it was really easy to use this to
integrate with the JExcelAPI.

If I understand correctly, Rjb uses JNI to start, and then interact with
an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
basically embed a JVM in your Ruby interpreter and then load classes
into it and start using them. Basic type casting to/from Java types is
done for you. The documentation is terrible but there's just enough of
it to get you started.

Here's what I did:

0) Get the Rjb gem using "gem install rjb"
1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
directory.
2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
the JVM for startup.
3) Load classes using Rjb::import(classname)

Here's an example of using it in my app.:

file_class = Rjb::import('java.io.File')
workbook_class = Rjb::import('jxl.Workbook')
workbook = workbook_class.getWorkbook(file_class.new(filename))

Some things to notice:
* filename is a Ruby string - that's being passed to the File.new() Java
method.
* The return of the call to file_class.new is a wrapped Java File object
and can be immediately passed to the getWorkbook method.
* workbook is a Java object that can then be used in other parts of the
app.

The good Once you get past loading a class and/or instantiating an
object, doing method calls is as simple as just calling the methods on
the Java objects you've instantiated or received from other method
calls.

The bad This is so seamless, it would be very easy to forget that
some of the objects that you're dealing with are effectively Java
objects, and then you might forget how to use them correctly.

For production for this app, I may need to change approaches since I'll
prob. be running multiple Mongrel processes and I don't know if I want
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I'm doing Apache/FastCGI so I know it's a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes
 
M

M. Edward (Ed) Borasky

Wes said:
Some more info:

Recently, I was confronted with a task in one of the apps. I'm building
that would allow the parsing of data in an Excel spreadsheet where the
number of rows could be on the order of 30000/40000/50000 or higher.

Originally, I was using the parseexcel gem to handle the parsing -
however, it proved to be fairly slow and consumed a lot of memory. When
I presented it with a > 42000 row spreadsheet, it basically cratered.
So I had to figure out another way to handle this problem. Someone
mentioned that there was a nice open source Java - based Excel parser
called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
Java test showed that the performance and memory footprint would be much
much better.

In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
still had problems implementing that (and I didn't want to run this app.
on it yet since it's still so young), so I took a look at some of the
Java - Ruby bridges. I gave one called Rjb
(http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
was very pleasantly surprised - it was really easy to use this to
integrate with the JExcelAPI.

If I understand correctly, Rjb uses JNI to start, and then interact with
an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
basically embed a JVM in your Ruby interpreter and then load classes
into it and start using them. Basic type casting to/from Java types is
done for you. The documentation is terrible but there's just enough of
it to get you started.

Here's what I did:

0) Get the Rjb gem using "gem install rjb"
1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
directory.
2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
the JVM for startup.
3) Load classes using Rjb::import(classname)

Here's an example of using it in my app.:

file_class = Rjb::import('java.io.File')
workbook_class = Rjb::import('jxl.Workbook')
workbook = workbook_class.getWorkbook(file_class.new(filename))

Some things to notice:
* filename is a Ruby string - that's being passed to the File.new() Java
method.
* The return of the call to file_class.new is a wrapped Java File object
and can be immediately passed to the getWorkbook method.
* workbook is a Java object that can then be used in other parts of the
app.

The good Once you get past loading a class and/or instantiating an
object, doing method calls is as simple as just calling the methods on
the Java objects you've instantiated or received from other method
calls.

The bad This is so seamless, it would be very easy to forget that
some of the objects that you're dealing with are effectively Java
objects, and then you might forget how to use them correctly.

For production for this app, I may need to change approaches since I'll
prob. be running multiple Mongrel processes and I don't know if I want
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I'm doing Apache/FastCGI so I know it's a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes

That seems to me to be an awfully roundabout way of doing things. I
think you could hack something up with ODBC to just take your whole
spreadsheet and upload it to a database, then use Rails or DBI to read
the database.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top