Efficient parsing of large Excel documents in Ruby

Wes Gamble · Oct 28, 2006

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

When I run it against an Excel file with ~42000 rows and 11 columns (the
spreadsheet is about 10MB in size), it takes the better part of 10
minutes to parse the file and uses about 200 MB of memory. I'm not
actually sure if the parsing completes successfully - the app. stops
right after the parsing appears to finish.

It would seem that I have two issues - memory usage and performance.

Is anyone aware of a much faster way to parse large Excel files?

Is anyone aware of a "windowing" scheme that uses a set amount of memory
to transfer portions of the data to a client requesting the data so as
to keep the memory required at a fixed level?

Should I look into using the Win32OLE libraries with the hope that it
would be more efficient?

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Any thoughts or advice is appreciated.

Thanks,
Wes Gamble

matt neuburg · Oct 28, 2006

Christian Madsen said:
See this link: http://wiki.rubygarden.org/Ruby/page/show/ScriptingExcel

Interesting. Just as a note, everything that happens on that page can
now be done on Mac using almost the same language, thanks to
rb-appscript (or RubyOSA, which I have not yet tried). It would be
interesting to write a Mac version of the page. m.

Sam Smoot · Oct 28, 2006

I agree with Christian, but with a twist.

I like to use Excel Automation (w/ WIN32OLE) to convert the document to
a format that's easier to work with. That way, after the initial
conversion, you can handle CSV, XML, etc, and be working with a fast
native library instead of the slow interop.

Another reason I prefer this method is that Excel interop is sometimes
hard to get "just right". There's processes I've written with it
running for over a year that I've never had to touch, but there's
another that used to require I login to the server and kill orphaned
Excel processes. Someone changes a column, you don't code defensively
enough, and BAM!

So the simplicity of the Load/Convert process, and getting out of Excel
as quickly as possible, conserving resources and avoiding potential
bugs that might result in orphaned processes is a big advantage IMO.

Plus it's almost gauranteed to be the fastest, and simplest to develop.
(As long as you're OK with having to run on Windows and having Excel
installed anyways.)

Wes Gamble · Oct 28, 2006

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

M. Edward (Ed) Borasky · Oct 28, 2006

Wes said:
I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.

M. Edward (Ed) Borasky · Oct 28, 2006

Wes said:
I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Yes, there is a native way of doing this in VBA from a macro, which you
can store either in your personal macro workbook on in the spreadsheets
themselves. I have a colleague where I work that built an application
that does it.

But unfortunately, I have no idea what the technique is called. Almost
anyone who's been through some advanced Microsoft training ought to be
able to help you find it, or you might search the Microsoft knowledge
base for "efficient transfer of data to SQL Server".

If you don't have an answer by Tuesday, let me know and I'll ask her
what it's called.

And no, I don't think it involves installing CygWin.

Wes Gamble · Oct 28, 2006

Ed,

Not a bad idea. I'm already using unixODBC to connect to SQL Server
from the Linux box. I totally forgot that you can get to Excel via
ODBC.

Don't I need an ODBC driver for Excel though?

What about a DBI driver for Excel - does one exist?

Thanks,
Wes

Charles Oliver Nutter · Oct 28, 2006

Wes said:
All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.

Wes Gamble · Oct 29, 2006

Charles said:
Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.

Charles,

I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

Wes

Wes Gamble · Oct 29, 2006

M. Edward (Ed) Borasky said:
Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.

Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

Wes Gamble · Oct 29, 2006

Wes said:
Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

Ah, are you saying that I can "publish" the Excel file into HTML through
some IIS thing and then just parse that with Hpricot? That's an
interesting solution.

Wes

Charles Oliver Nutter · Oct 29, 2006

Wes said:
I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

We appreciate the vote of confidence, and you're probably right about
JRuby not quite being production ready ("production ready" in the sense
of having something that won't require a bit of tinkering...there are
people using it for production apps, but they're comfortable on the
frontier

Don't forget the Java bridge options...from what I hear they work
reasonably well for neatly-encapsulated libraries like this.

RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works too

And there's RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.

Wes Gamble · Oct 30, 2006

Charles said:
RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works
too

I'm looking at RJB and it appears that it only loads the JVM into memory
once until it is explicitly unloaded.

Can anyone corroborate this (or correct me on it)?

Thanks,
Wes

Sam Smoot · Oct 30, 2006

Charles said:
And there's RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.

Read: None.

You'd just reference the Excel COM object in .NET.
Using WIN32OLE is actually simpler than c# since you have access to the
VB-ish interfaces too (such as Range#Value).

Wes Gamble · Nov 2, 2006

Some more info:

Recently, I was confronted with a task in one of the apps. I'm building
that would allow the parsing of data in an Excel spreadsheet where the
number of rows could be on the order of 30000/40000/50000 or higher.

Originally, I was using the parseexcel gem to handle the parsing -
however, it proved to be fairly slow and consumed a lot of memory. When
I presented it with a > 42000 row spreadsheet, it basically cratered.
So I had to figure out another way to handle this problem. Someone
mentioned that there was a nice open source Java - based Excel parser
called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
Java test showed that the performance and memory footprint would be much
much better.

In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
still had problems implementing that (and I didn't want to run this app.
on it yet since it's still so young), so I took a look at some of the
Java - Ruby bridges. I gave one called Rjb
(http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
was very pleasantly surprised - it was really easy to use this to
integrate with the JExcelAPI.

If I understand correctly, Rjb uses JNI to start, and then interact with
an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
basically embed a JVM in your Ruby interpreter and then load classes
into it and start using them. Basic type casting to/from Java types is
done for you. The documentation is terrible but there's just enough of
it to get you started.

Here's what I did:

0) Get the Rjb gem using "gem install rjb"
1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
directory.
2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
the JVM for startup.
3) Load classes using Rjb::import(classname)

Here's an example of using it in my app.:

file_class = Rjb::import('java.io.File')
workbook_class = Rjb::import('jxl.Workbook')
workbook = workbook_class.getWorkbook(file_class.new(filename))

Some things to notice:
* filename is a Ruby string - that's being passed to the File.new() Java
method.
* The return of the call to file_class.new is a wrapped Java File object
and can be immediately passed to the getWorkbook method.
* workbook is a Java object that can then be used in other parts of the
app.

The good Once you get past loading a class and/or instantiating an
object, doing method calls is as simple as just calling the methods on
the Java objects you've instantiated or received from other method
calls.

The bad This is so seamless, it would be very easy to forget that
some of the objects that you're dealing with are effectively Java
objects, and then you might forget how to use them correctly.

For production for this app, I may need to change approaches since I'll
prob. be running multiple Mongrel processes and I don't know if I want
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I'm doing Apache/FastCGI so I know it's a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes

M. Edward (Ed) Borasky · Nov 2, 2006

Wes said:
Some more info:

Recently, I was confronted with a task in one of the apps. I'm building
that would allow the parsing of data in an Excel spreadsheet where the
number of rows could be on the order of 30000/40000/50000 or higher.

Originally, I was using the parseexcel gem to handle the parsing -
however, it proved to be fairly slow and consumed a lot of memory. When
I presented it with a > 42000 row spreadsheet, it basically cratered.
So I had to figure out another way to handle this problem. Someone
mentioned that there was a nice open source Java - based Excel parser
called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
Java test showed that the performance and memory footprint would be much
much better.

In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
still had problems implementing that (and I didn't want to run this app.
on it yet since it's still so young), so I took a look at some of the
Java - Ruby bridges. I gave one called Rjb
(http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
was very pleasantly surprised - it was really easy to use this to
integrate with the JExcelAPI.

If I understand correctly, Rjb uses JNI to start, and then interact with
an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
basically embed a JVM in your Ruby interpreter and then load classes
into it and start using them. Basic type casting to/from Java types is
done for you. The documentation is terrible but there's just enough of
it to get you started.

Here's what I did:

0) Get the Rjb gem using "gem install rjb"
1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
directory.
2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
the JVM for startup.
3) Load classes using Rjb::import(classname)

Here's an example of using it in my app.:

file_class = Rjb::import('java.io.File')
workbook_class = Rjb::import('jxl.Workbook')
workbook = workbook_class.getWorkbook(file_class.new(filename))

Some things to notice:
* filename is a Ruby string - that's being passed to the File.new() Java
method.
* The return of the call to file_class.new is a wrapped Java File object
and can be immediately passed to the getWorkbook method.
* workbook is a Java object that can then be used in other parts of the
app.

The good Once you get past loading a class and/or instantiating an
object, doing method calls is as simple as just calling the methods on
the Java objects you've instantiated or received from other method
calls.

The bad This is so seamless, it would be very easy to forget that
some of the objects that you're dealing with are effectively Java
objects, and then you might forget how to use them correctly.

For production for this app, I may need to change approaches since I'll
prob. be running multiple Mongrel processes and I don't know if I want
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I'm doing Apache/FastCGI so I know it's a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes

That seems to me to be an awfully roundabout way of doing things. I
think you could hack something up with ODBC to just take your whole
spreadsheet and upload it to a database, then use Rails or DBI to read
the database.

I want to Display Excel As HTML In js	2	Feb 24, 2023
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
Collect Excel Data from Website	5	Apr 30, 2022
Excel Parsing : extract values of cells with formulae	4	Oct 15, 2007
efficient access to a large int32 array	6	Jan 16, 2007
Excel and Ruby	16	Oct 13, 2010
Optimize write of large file	6	May 12, 2011
Efficient way of looging in python	2	Apr 25, 2013

Efficient parsing of large Excel documents in Ruby

Wes Gamble

matt neuburg

Sam Smoot

Wes Gamble

M. Edward (Ed) Borasky

M. Edward (Ed) Borasky

Wes Gamble

Charles Oliver Nutter

Wes Gamble

Wes Gamble

Wes Gamble

Charles Oliver Nutter

Wes Gamble

Sam Smoot

Wes Gamble

M. Edward (Ed) Borasky

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads