how to quickly find a string towards the end of a large io object

B

bwv549

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!
 
A

ara.t.howard

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

binary search using seek. in all honestly though, it'd have to be
really big to beat a regex if the entire file contents. define big?

a @ http://codeforpeople.com/
 
S

Sebastian Hungerecker

ara.t.howard said:
binary search using seek.

He didn't say his data was sorted, did he?

=2D-=20
Jabber: (e-mail address removed)
ICQ: 205544826
 
R

Robert Klemme

2008/11/6 bwv549 said:
How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I

What is a "massive IO object"? Are we talking about a large file? A
socket or pipe which yields a lot data?
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Start scanning at the beginning and travel forward in time. :)

No seriously, we need a bit more detail.

Kind regards

robert
 
L

Lloyd Linklater

bwv549 said:
How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely *had* to run faster, in other languages I
would break up the string and have separate threads search.
 
H

Hugh Sasse

bwv549 said:
How do I scan starting at the end of a big io object to find a string
that is always closer to then end? [...]
In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely *had* to run faster, in other languages I
would break up the string and have separate threads search.

Other things to take into account would include:
* Will be big string be kept for some time?
* Will there be multiple searches on it?
* Is it bigger than RAM + swap?

You might find that if you're doing multiple searches that a suffix array
(see Programming Pearls) is useful. You may wish to look at Ferret. You
may want to call out to another program to do this if is is obtained from
a file.

ri String#rindex gives:

---------------------------------------------------------- String#rindex
str.rindex(substring [, fixnum]) => fixnum or nil
str.rindex(fixnum [, fixnum]) => fixnum or nil
str.rindex(regexp [, fixnum]) => fixnum or nil
------------------------------------------------------------------------
Returns the index of the last occurrence of the given _substring_,
character (_fixnum_), or pattern (_regexp_) in _str_. Returns +nil+
if not found. If the second parameter is present, it specifies the
position in the string to end the search---characters beyond this
point will not be considered.

"hello".rindex('e') #=> 1
"hello".rindex('l') #=> 3
"hello".rindex('a') #=> nil
"hello".rindex(101) #=> 1
"hello".rindex(/[aeiou]/, -2) #=> 1

Does this actually search from the end backwards? Not checked the source
(recently enough || at all) to know, but I'd expect so.

HTH
Hugh
 
B

bwv549

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Again, the basic layout of the file:

.................................my_param = X........

I need the value 'X' following the 'my_para = '

Thanks again!
 
B

bwv549

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
halfway = handle.stat.size / 2
handle.seek halfway
last_half = handle.read
params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.
 
A

ara.t.howard

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
halfway = handle.stat.size / 2
handle.seek halfway
last_half = handle.read
params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.



cfp:~ > cat a.rb
def tail_search io, re, options = {}
io = open io unless io.respond_to?:)read)
percent = Float(options['percent']||options[:percent]||0.10)

buf = ''
size = io.stat.size
pagesize = Integer(size * percent)
pos = 0

loop do
pos -= pagesize
break if pos.abs > size
io.seek(pos, IO::SEEK_END)
buf = buf + io.read(pagesize)
match = re.match(buf)
return match if match
end

return nil
ensure
io.close rescue nil
end


match = tail_search(__FILE__, /(key)=(val)/, :percent => 0.02)
p match.to_a if match

p tail_search(__FILE__, /(key)=(value)/)

__END__
key=val







cfp:~ > ruby a.rb
["key=val", "key", "val"]
nil




you may want to modify to use rindex instead of a regex - up to you.



a @ http://codeforpeople.com/
 
R

Robert Klemme

2008/11/6 bwv549 said:
I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Again, the basic layout of the file:

................................my_param = X........

I need the value 'X' following the 'my_para = '

For simplicity reasons I'd start with this:

def get_param(file_name)
size = File.size file_name
File.open file_name do |io|
io.seek(-(size * 0.1).to_i, IO::SEEK_END)
io.read[%r{my_param\s*=\s*(\S+)}, 1]
end
end

Kind regards

robert
 
A

ara.t.howard

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
halfway = handle.stat.size / 2
handle.seek halfway
last_half = handle.read
params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.



this one never reads more than pagesize into memory and deals with the
fact that a needle could straddle two pages by keeping the current
page plus the previous page's first bit (only need maximum of needle
size bytes) as the search target.

the extra code is just showing you that it does, in fact, find it's
target.

you can up the percent for speed or crank it down to save on memory.


cfp:~ > cat a.rb
def tail_search io, needle, options = {}
io = open io unless io.respond_to?:)read)
percent = Float(options['percent']||options[:percent]||0.10)

buf = ''
size = io.stat.size
pagesize = Integer(size * percent)
pos = 0

loop do
pos -= pagesize
break if pos.abs > size
io.seek(pos, IO::SEEK_END)
buf = io.read(pagesize) + buf[0, needle.size]
relative_index = buf.index(needle)
if relative_index
absolute_index = size + pos + relative_index
return absolute_index
end
end

return nil
ensure
io.close rescue nil
end


needle = 'key=val'
index = tail_search(__FILE__, needle, :percent => 0.02)
if index
open(__FILE__) do |fd|
fd.seek index
puts fd.read(needle.size)
end
end


needle = 'io.close rescue nil'
index = tail_search(__FILE__, needle, :percent => 0.02)
if index
open(__FILE__) do |fd|
fd.seek index
puts fd.read(needle.size)
end
end


__END__
key=val




cfp:~ > ruby a.rb
key=val
io.close rescue nil

a @ http://codeforpeople.com/
 
H

Hugh Sasse

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO

You can do that with read.
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Other languages may be faster of course, but first make it work...
Again, the basic layout of the file:

................................my_param = X........
let the length of your data ^^^^^^^^^^^^ be k

First try reading the whole lot. If it's too big you'll get an exception.
Otherwise use rindex on the buffer from the successful read.

If it's a stream you'll have to read it on order anyway,
so read it in chunks with a 2*k overlap so you can definitely
fit the whole string in a chunk, and boundaries between reads
don't matter. That should work on 10GB files. You might want
to read in multiples of 512 bytes on Unix, maybe 4k (cluster size??)
on Windows, not sure.
I need the value 'X' following the 'my_para = '

Thanks again!
Hugh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top