Efficient parsing of large Excel documents in Ruby

Discussion in 'Ruby' started by Wes Gamble, Oct 28, 2006.

  1. Wes Gamble

    Wes Gamble Guest

    All,

    I am currently using the parseexcel gem to parse an Excel file so that I
    can save it's data into a database (in a Rails app.).

    When I run it against an Excel file with ~42000 rows and 11 columns (the
    spreadsheet is about 10MB in size), it takes the better part of 10
    minutes to parse the file and uses about 200 MB of memory. I'm not
    actually sure if the parsing completes successfully - the app. stops
    right after the parsing appears to finish.

    It would seem that I have two issues - memory usage and performance.

    Is anyone aware of a much faster way to parse large Excel files?

    Is anyone aware of a "windowing" scheme that uses a set amount of memory
    to transfer portions of the data to a client requesting the data so as
    to keep the memory required at a fixed level?

    Should I look into using the Win32OLE libraries with the hope that it
    would be more efficient?

    I am transferring this data into a SQL Server database, so perhaps I
    could look into some M$-native method of moving this data.

    Any thoughts or advice is appreciated.

    Thanks,
    Wes Gamble

    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 28, 2006
    #1
    1. Advertising

  2. Wes Gamble

    matt neuburg Guest

    Christian Madsen <> wrote:

    > See this link: http://wiki.rubygarden.org/Ruby/page/show/ScriptingExcel


    Interesting. Just as a note, everything that happens on that page can
    now be done on Mac using almost the same language, thanks to
    rb-appscript (or RubyOSA, which I have not yet tried). It would be
    interesting to write a Mac version of the page. m.

    --
    matt neuburg, phd = , http://www.tidbits.com/matt/
    Tiger - http://www.takecontrolbooks.com/tiger-customizing.html
    AppleScript - http://www.amazon.com/gp/product/0596102119
    Read TidBITS! It's free and smart. http://www.tidbits.com
     
    matt neuburg, Oct 28, 2006
    #2
    1. Advertising

  3. Wes Gamble

    Sam Smoot Guest

    I agree with Christian, but with a twist.

    I like to use Excel Automation (w/ WIN32OLE) to convert the document to
    a format that's easier to work with. That way, after the initial
    conversion, you can handle CSV, XML, etc, and be working with a fast
    native library instead of the slow interop.

    Another reason I prefer this method is that Excel interop is sometimes
    hard to get "just right". There's processes I've written with it
    running for over a year that I've never had to touch, but there's
    another that used to require I login to the server and kill orphaned
    Excel processes. Someone changes a column, you don't code defensively
    enough, and BAM!

    So the simplicity of the Load/Convert process, and getting out of Excel
    as quickly as possible, conserving resources and avoiding potential
    bugs that might result in orphaned processes is a big advantage IMO.

    Plus it's almost gauranteed to be the fastest, and simplest to develop.
    (As long as you're OK with having to run on Windows and having Excel
    installed anyways.)
     
    Sam Smoot, Oct 28, 2006
    #3
  4. Wes Gamble

    Wes Gamble Guest

    I was definitely thinking about using win32ole/Excel to convert this
    spreadsheet to CSV and then process it with a CSV parser.

    I'm currently running on Linux though, so now I have to figure out any
    issues around a Windows production platform.

    Thanks,
    Wes


    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 28, 2006
    #4
  5. Wes Gamble wrote:
    > I was definitely thinking about using win32ole/Excel to convert this
    > spreadsheet to CSV and then process it with a CSV parser.
    >
    > I'm currently running on Linux though, so now I have to figure out any
    > issues around a Windows production platform.
    >
    > Thanks,
    > Wes
    >
    >


    Have you tried unixODBC? It seems to work fairly well, although there
    are some issues with most Linux/Unix tools when the spreadsheet has tabs
    or column headers with spaces in them.

    It's fairly easy to leave the Excel spreadsheets on a Windows system and
    hack together a simple "server" for them. I think all you have to do is
    "publish them to the intranet", but I've never done it. Then you can use
    something like Hpricot to parse them.
     
    M. Edward (Ed) Borasky, Oct 28, 2006
    #5
  6. Wes Gamble wrote:

    > I am transferring this data into a SQL Server database, so perhaps I
    > could look into some M$-native method of moving this data.


    Yes, there is a native way of doing this in VBA from a macro, which you
    can store either in your personal macro workbook on in the spreadsheets
    themselves. I have a colleague where I work that built an application
    that does it.

    But unfortunately, I have no idea what the technique is called. Almost
    anyone who's been through some advanced Microsoft training ought to be
    able to help you find it, or you might search the Microsoft knowledge
    base for "efficient transfer of data to SQL Server".

    If you don't have an answer by Tuesday, let me know and I'll ask her
    what it's called.

    And no, I don't think it involves installing CygWin. :)
     
    M. Edward (Ed) Borasky, Oct 28, 2006
    #6
  7. Wes Gamble

    Wes Gamble Guest

    Ed,

    Not a bad idea. I'm already using unixODBC to connect to SQL Server
    from the Linux box. I totally forgot that you can get to Excel via
    ODBC.

    Don't I need an ODBC driver for Excel though?

    What about a DBI driver for Excel - does one exist?

    Thanks,
    Wes

    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 28, 2006
    #7
  8. Wes Gamble wrote:
    > All,
    >
    > I am currently using the parseexcel gem to parse an Excel file so that I
    > can save it's data into a database (in a Rails app.).


    Another rough option would be using JRuby or one of the Java bridges to
    call out to POI, Apache's Office-document library.

    --
    Charles Oliver Nutter, JRuby Core Developer
    Blogging on Ruby and Java @ headius.blogspot.com
    Help spec out Ruby today! @ www.headius.com/rubyspec
    --
     
    Charles Oliver Nutter, Oct 28, 2006
    #8
  9. Wes Gamble

    Wes Gamble Guest

    Charles Oliver Nutter wrote:
    > Wes Gamble wrote:
    >> All,
    >>
    >> I am currently using the parseexcel gem to parse an Excel file so that I
    >> can save it's data into a database (in a Rails app.).

    >
    > Another rough option would be using JRuby or one of the Java bridges to
    > call out to POI, Apache's Office-document library.


    Charles,

    I would love to use JRuby for this app., but don't feel it's quite ready
    yet (although you guys are moving it forward at an incredible rate and I
    expect it will be a viable production option in less than a year - keep
    up the good work).

    Does anyone know if the POI Ruby bindings are a potential solution? I
    see
    "Implement support for reading Excel files (easy)" under the list of
    TODOs so not sure if I could use POI-Ruby to read...

    Anyone have any experience with POI-Ruby?

    Wes


    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 29, 2006
    #9
  10. Wes Gamble

    Wes Gamble Guest

    M. Edward (Ed) Borasky wrote:
    > Wes Gamble wrote:
    >> I was definitely thinking about using win32ole/Excel to convert this
    >> spreadsheet to CSV and then process it with a CSV parser.
    >>
    >> I'm currently running on Linux though, so now I have to figure out any
    >> issues around a Windows production platform.
    >>
    >> Thanks,
    >> Wes
    >>
    >>

    >
    > Have you tried unixODBC? It seems to work fairly well, although there
    > are some issues with most Linux/Unix tools when the spreadsheet has tabs
    > or column headers with spaces in them.
    >
    > It's fairly easy to leave the Excel spreadsheets on a Windows system and
    > hack together a simple "server" for them. I think all you have to do is
    > "publish them to the intranet", but I've never done it. Then you can use
    > something like Hpricot to parse them.


    Hpricot? I thought that only parsed HTML? Can you say a little more
    about this? Would the "server" you mention above parse the Excel into
    some intermediate format that I would then process with something else
    (like Hpricot or ???)?

    Thanks,
    Wes

    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 29, 2006
    #10
  11. Wes Gamble

    Wes Gamble Guest

    Wes Gamble wrote:
    > M. Edward (Ed) Borasky wrote:
    >> Wes Gamble wrote:
    >>> I was definitely thinking about using win32ole/Excel to convert this
    >>> spreadsheet to CSV and then process it with a CSV parser.
    >>>
    >>> I'm currently running on Linux though, so now I have to figure out any
    >>> issues around a Windows production platform.
    >>>
    >>> Thanks,
    >>> Wes
    >>>
    >>>

    >>
    >> Have you tried unixODBC? It seems to work fairly well, although there
    >> are some issues with most Linux/Unix tools when the spreadsheet has tabs
    >> or column headers with spaces in them.
    >>
    >> It's fairly easy to leave the Excel spreadsheets on a Windows system and
    >> hack together a simple "server" for them. I think all you have to do is
    >> "publish them to the intranet", but I've never done it. Then you can use
    >> something like Hpricot to parse them.

    >
    > Hpricot? I thought that only parsed HTML? Can you say a little more
    > about this? Would the "server" you mention above parse the Excel into
    > some intermediate format that I would then process with something else
    > (like Hpricot or ???)?
    >
    > Thanks,
    > Wes


    Ah, are you saying that I can "publish" the Excel file into HTML through
    some IIS thing and then just parse that with Hpricot? That's an
    interesting solution.

    Wes

    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 29, 2006
    #11
  12. Wes Gamble wrote:
    > I would love to use JRuby for this app., but don't feel it's quite ready
    > yet (although you guys are moving it forward at an incredible rate and I
    > expect it will be a viable production option in less than a year - keep
    > up the good work).
    >
    > Does anyone know if the POI Ruby bindings are a potential solution? I
    > see
    > "Implement support for reading Excel files (easy)" under the list of
    > TODOs so not sure if I could use POI-Ruby to read...
    >
    > Anyone have any experience with POI-Ruby?


    We appreciate the vote of confidence, and you're probably right about
    JRuby not quite being production ready ("production ready" in the sense
    of having something that won't require a bit of tinkering...there are
    people using it for production apps, but they're comfortable on the
    frontier :)

    Don't forget the Java bridge options...from what I hear they work
    reasonably well for neatly-encapsulated libraries like this.

    RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
    active project, though YAJB (Yet Another Java Bridge) apparently works too

    And there's RubyCLR, which can hook into the CLR and any capacity it has
    for working with Excel.

    --
    Charles Oliver Nutter, JRuby Core Developer
    Blogging on Ruby and Java @ headius.blogspot.com
    Help spec out Ruby today! @ www.headius.com/rubyspec
    --
     
    Charles Oliver Nutter, Oct 29, 2006
    #12
  13. Wes Gamble

    Wes Gamble Guest

    Charles Oliver Nutter wrote:
    > RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
    > active project, though YAJB (Yet Another Java Bridge) apparently works
    > too


    I'm looking at RJB and it appears that it only loads the JVM into memory
    once until it is explicitly unloaded.

    Can anyone corroborate this (or correct me on it)?

    Thanks,
    Wes

    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Oct 30, 2006
    #13
  14. Wes Gamble

    Sam Smoot Guest

    Charles Oliver Nutter wrote:
    > And there's RubyCLR, which can hook into the CLR and any capacity it has
    > for working with Excel.


    Read: None. :) You'd just reference the Excel COM object in .NET.
    Using WIN32OLE is actually simpler than c# since you have access to the
    VB-ish interfaces too (such as Range#Value).
     
    Sam Smoot, Oct 30, 2006
    #14
  15. Wes Gamble

    Wes Gamble Guest

    Some more info:

    Recently, I was confronted with a task in one of the apps. I'm building
    that would allow the parsing of data in an Excel spreadsheet where the
    number of rows could be on the order of 30000/40000/50000 or higher.

    Originally, I was using the parseexcel gem to handle the parsing -
    however, it proved to be fairly slow and consumed a lot of memory. When
    I presented it with a > 42000 row spreadsheet, it basically cratered.
    So I had to figure out another way to handle this problem. Someone
    mentioned that there was a nice open source Java - based Excel parser
    called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
    Java test showed that the performance and memory footprint would be much
    much better.

    In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
    still had problems implementing that (and I didn't want to run this app.
    on it yet since it's still so young), so I took a look at some of the
    Java - Ruby bridges. I gave one called Rjb
    (http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
    was very pleasantly surprised - it was really easy to use this to
    integrate with the JExcelAPI.

    If I understand correctly, Rjb uses JNI to start, and then interact with
    an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
    basically embed a JVM in your Ruby interpreter and then load classes
    into it and start using them. Basic type casting to/from Java types is
    done for you. The documentation is terrible but there's just enough of
    it to get you started.

    Here's what I did:

    0) Get the Rjb gem using "gem install rjb"
    1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
    directory.
    2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
    ['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
    the JVM for startup.
    3) Load classes using Rjb::import(classname)

    Here's an example of using it in my app.:

    file_class = Rjb::import('java.io.File')
    workbook_class = Rjb::import('jxl.Workbook')
    workbook = workbook_class.getWorkbook(file_class.new(filename))

    Some things to notice:
    * filename is a Ruby string - that's being passed to the File.new() Java
    method.
    * The return of the call to file_class.new is a wrapped Java File object
    and can be immediately passed to the getWorkbook method.
    * workbook is a Java object that can then be used in other parts of the
    app.

    The good news: Once you get past loading a class and/or instantiating an
    object, doing method calls is as simple as just calling the methods on
    the Java objects you've instantiated or received from other method
    calls.

    The bad news: This is so seamless, it would be very easy to forget that
    some of the objects that you're dealing with are effectively Java
    objects, and then you might forget how to use them correctly.

    For production for this app, I may need to change approaches since I'll
    prob. be running multiple Mongrel processes and I don't know if I want
    to have one embedded JVM per process (if I understand Mongrel deployment
    correctly - currently I'm doing Apache/FastCGI so I know it's a problem
    there). That may force using DRb in a separate process to host this
    Excel parser component and allow it to be used from anywhere (if that
    happens - could also do a Web service-y thing on top of a JRuby process
    or whatever).

    Hope this is useful for someone.

    Wes




    --
    Posted via http://www.ruby-forum.com/.
     
    Wes Gamble, Nov 2, 2006
    #15
  16. Wes Gamble wrote:
    > Some more info:
    >
    > Recently, I was confronted with a task in one of the apps. I'm building
    > that would allow the parsing of data in an Excel spreadsheet where the
    > number of rows could be on the order of 30000/40000/50000 or higher.
    >
    > Originally, I was using the parseexcel gem to handle the parsing -
    > however, it proved to be fairly slow and consumed a lot of memory. When
    > I presented it with a > 42000 row spreadsheet, it basically cratered.
    > So I had to figure out another way to handle this problem. Someone
    > mentioned that there was a nice open source Java - based Excel parser
    > called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
    > Java test showed that the performance and memory footprint would be much
    > much better.
    >
    > In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
    > still had problems implementing that (and I didn't want to run this app.
    > on it yet since it's still so young), so I took a look at some of the
    > Java - Ruby bridges. I gave one called Rjb
    > (http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
    > was very pleasantly surprised - it was really easy to use this to
    > integrate with the JExcelAPI.
    >
    > If I understand correctly, Rjb uses JNI to start, and then interact with
    > an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
    > basically embed a JVM in your Ruby interpreter and then load classes
    > into it and start using them. Basic type casting to/from Java types is
    > done for you. The documentation is terrible but there's just enough of
    > it to get you started.
    >
    > Here's what I did:
    >
    > 0) Get the Rjb gem using "gem install rjb"
    > 1) Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
    > directory.
    > 2) Start the JVM using Rjb::load("#{RAILS_ROOT}/lib/jxl.jar",
    > ['-Xms256M', '-Xmx512M']) - the array is a set of parameters to send to
    > the JVM for startup.
    > 3) Load classes using Rjb::import(classname)
    >
    > Here's an example of using it in my app.:
    >
    > file_class = Rjb::import('java.io.File')
    > workbook_class = Rjb::import('jxl.Workbook')
    > workbook = workbook_class.getWorkbook(file_class.new(filename))
    >
    > Some things to notice:
    > * filename is a Ruby string - that's being passed to the File.new() Java
    > method.
    > * The return of the call to file_class.new is a wrapped Java File object
    > and can be immediately passed to the getWorkbook method.
    > * workbook is a Java object that can then be used in other parts of the
    > app.
    >
    > The good news: Once you get past loading a class and/or instantiating an
    > object, doing method calls is as simple as just calling the methods on
    > the Java objects you've instantiated or received from other method
    > calls.
    >
    > The bad news: This is so seamless, it would be very easy to forget that
    > some of the objects that you're dealing with are effectively Java
    > objects, and then you might forget how to use them correctly.
    >
    > For production for this app, I may need to change approaches since I'll
    > prob. be running multiple Mongrel processes and I don't know if I want
    > to have one embedded JVM per process (if I understand Mongrel deployment
    > correctly - currently I'm doing Apache/FastCGI so I know it's a problem
    > there). That may force using DRb in a separate process to host this
    > Excel parser component and allow it to be used from anywhere (if that
    > happens - could also do a Web service-y thing on top of a JRuby process
    > or whatever).
    >
    > Hope this is useful for someone.
    >
    > Wes


    That seems to me to be an awfully roundabout way of doing things. I
    think you could hack something up with ODBC to just take your whole
    spreadsheet and upload it to a database, then use Rails or DBI to read
    the database.
     
    M. Edward (Ed) Borasky, Nov 2, 2006
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike
    Replies:
    4
    Views:
    882
    Bob Foster
    Nov 23, 2003
  2. Ruchi Dayal
    Replies:
    1
    Views:
    694
    Peter Gerstbach
    Sep 7, 2004
  3. Replies:
    1
    Views:
    482
    Juan T. Llibre
    Oct 18, 2006
  4. Noozer
    Replies:
    2
    Views:
    795
    Joseph Kesselman
    Sep 21, 2007
  5. =?Utf-8?B?c2hhc2hhbmsga3Vsa2Fybmk=?=

    Problem with Excel reports ::::Excel 2003 Migration To Excel 2007

    =?Utf-8?B?c2hhc2hhbmsga3Vsa2Fybmk=?=, Oct 5, 2007, in forum: ASP .Net
    Replies:
    15
    Views:
    1,525
    =?Utf-8?B?c2hhc2hhbmsga3Vsa2Fybmk=?=
    Oct 24, 2007
Loading...

Share This Page