Fast searching of large files

Discussion in 'Ruby' started by Stuart Clarke, Jul 1, 2010.

  1. Hey all,

    Could anyone advise me on a fast way to search a single, but very large
    file (1Gb) quickly for a string of text? Also, is there a library to
    identify the file offset this string was found within the file?

    Thanks
    --
    Posted via http://www.ruby-forum.com/.
     
    Stuart Clarke, Jul 1, 2010
    #1
    1. Advertising

  2. On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
    <> wrote:
    > Hey all,
    >
    > Could anyone advise me on a fast way to search a single, but very large
    > file (1Gb) quickly for a string of text? Also, is there a library to
    > identify the file offset this string was found within the file?


    You can use IO#grep like this:
    File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
    io.grep(/apiKey/){|m| p io.pos => m } }

    The pos is the position the match ended, so just substract the string length.
    The above example was a file with 700mb, took around 40s the first
    time, 2s subsequently, so disk I/O is the limiting factor in terms of
    speed (as usual).
    Oh, and also don't use binary encoding if you are dealing with another one ;)

    --
    Michael Fellinger
    CTO, The Rubyists, LLC
     
    Michael Fellinger, Jul 1, 2010
    #2
    1. Advertising

  3. 2010/7/1 Michael Fellinger <>:
    > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
    > <> wrote:
    >> Hey all,
    >>
    >> Could anyone advise me on a fast way to search a single, but very large
    >> file (1Gb) quickly for a string of text? Also, is there a library to
    >> identify the file offset this string was found within the file?

    >
    > You can use IO#grep like this:
    > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
    > io.grep(/apiKey/){|m| p io.pos => m } }
    >
    > The pos is the position the match ended, so just substract the string length.
    > The above example was a file with 700mb, took around 40s the first
    > time, 2s subsequently, so disk I/O is the limiting factor in terms of
    > speed (as usual).


    If you only need to know whether the string occurs in the file you can do

    found = File.foreach("foo").any? {|line| /apiKey/ =~ line}

    This will stop searching as soon as the sequence is found.

    "fgrep -l foo" is likely faster.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jul 1, 2010
    #3
  4. Thanks.

    This seems to be pretty much the best logic for me, however it takes a
    good 20 minutes to scan a 2Gb file.

    Any ideas?

    Thanks

    Michael Fellinger wrote:
    > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
    > <> wrote:
    >> Hey all,
    >>
    >> Could anyone advise me on a fast way to search a single, but very large
    >> file (1Gb) quickly for a string of text? Also, is there a library to
    >> identify the file offset this string was found within the file?

    >
    > You can use IO#grep like this:
    > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
    > io.grep(/apiKey/){|m| p io.pos => m } }
    >
    > The pos is the position the match ended, so just substract the string
    > length.
    > The above example was a file with 700mb, took around 40s the first
    > time, 2s subsequently, so disk I/O is the limiting factor in terms of
    > speed (as usual).
    > Oh, and also don't use binary encoding if you are dealing with another
    > one ;)


    --
    Posted via http://www.ruby-forum.com/.
     
    Stuart Clarke, Jul 1, 2010
    #4
  5. Michael Fellinger wrote:
    > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
    > <> wrote:
    >> Hey all,
    >>
    >> Could anyone advise me on a fast way to search a single, but very large
    >> file (1Gb) quickly for a string of text? Also, is there a library to
    >> identify the file offset this string was found within the file?

    >
    > You can use IO#grep like this:
    > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
    > io.grep(/apiKey/){|m| p io.pos => m } }
    >
    > The pos is the position the match ended


    Actually, pos will be the position of the end of the line on which the
    match was found, because #grep works line by line.
     
    Joel VanderWerf, Jul 1, 2010
    #5
  6. Stuart Clarke

    Guest

    On Thu, Jul 1, 2010 at 7:03 AM, Robert Klemme
    <> wrote:
    > 2010/7/1 Michael Fellinger <>:
    >> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
    >> <> wrote:
    >>> Could anyone advise me on a fast way to search a single, but very large
    >>> file (1Gb) quickly for a string of text? Also, is there a library to
    >>> identify the file offset this string was found within the file?

    >>
    >> You can use IO#grep like this:
    >> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
    >> io.grep(/apiKey/){|m| p io.pos => m } }
    >>
    >> The pos is the position the match ended, so just substract the string length.
    >> The above example was a file with 700mb, took around 40s the first
    >> time, 2s subsequently, so disk I/O is the limiting factor in terms of
    >> speed (as usual).

    >
    > If you only need to know whether the string occurs in the file you can do
    > found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
    > This will stop searching as soon as the sequence is found.
    >
    > "fgrep -l foo" is likely faster.


    irb> `fgrep -l waters /usr/share/dict/words`.size > 0
    => true
    irb> `fgrep -l watershed /usr/share/dict/words`.size > 0
    => true
    irb> `fgrep -l watershedz /usr/share/dict/words`.size > 0
    => false

    irb> `fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}
    => ["153088", "153102", "204143", "234643", "472357", "856441",
    "913606", "913613", "913623", "913635", "913646", "913656", "913668",
    "913679", "913690", "913703"]
    irb> `fgrep -ob watershed /usr/share/dict/words`.split.map{|s|
    s.split(':').first}
    => ["913613", "913623", "913635"]
    irb> `fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|
    s.split(':').first}
    => []
     
    , Jul 1, 2010
    #6
  7. Stuart Clarke

    Roger Pack Guest

    Stuart Clarke wrote:
    > Hey all,
    >
    > Could anyone advise me on a fast way to search a single, but very large
    > file (1Gb) quickly for a string of text? Also, is there a library to
    > identify the file offset this string was found within the file?


    a fast way is to do it in C :)

    Here are a few other helpers, though:

    1.9 has faster regexes
    boost regexes: http://github.com/michaeledgar/ruby-boost-regex (you
    could probably optimize it more than it currently is, as well...)

    Rubinius also might help.

    Also make sure to open your file in binary mode if you're on 1.9. That
    reads much faster. If that's an option, anyway.
    GL.
    -rp
    --
    Posted via http://www.ruby-forum.com/.
     
    Roger Pack, Jul 1, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PedroX
    Replies:
    9
    Views:
    1,576
    Bryce K. Nielsen
    Jun 28, 2005
  2. Yi Xing
    Replies:
    6
    Views:
    488
    Simon Forman
    Jul 26, 2006
  3. Catherine Moroney

    fast copying of large files in python

    Catherine Moroney, Nov 2, 2011, in forum: Python
    Replies:
    1
    Views:
    981
    Dave Angel
    Nov 2, 2011
  4. Devesh Agrawal
    Replies:
    18
    Views:
    279
  5. Philip Rhoades
    Replies:
    6
    Views:
    291
    Brian Candler
    Feb 27, 2011
Loading...

Share This Page