Opening a large file many times / optimisation

P

Paul Nulty

hello,

I have a method that basically searches through a largish (5mb) text
file for a word. I need to call this method about 1400 times, and i
care about speed.

If i open the file at the start of the script and then pass the file
object as a parameter to my method each time its called, the code runs
quite a bit faster than if i open the file inside the method each
time; but this seems ugly to me.

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

thanks.
 
J

James Edward Gray II

I have a method that basically searches through a largish (5mb) text
file for a word. I need to call this method about 1400 times, and i
care about speed.

If i open the file at the start of the script and then pass the file
object as a parameter to my method each time its called, the code runs
quite a bit faster than if i open the file inside the method each
time; but this seems ugly to me.

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

Well, if you have enough RAM to support pulling it into memory,
that's certainly going to be faster. However, there are some
techniques you could use to speed up and index and query operation.
See this old Ruby Quiz for some ideas:

http://www.rubyquiz.com/quiz54.html

James Edward Gray II
 
M

M. Edward (Ed) Borasky

Paul said:
hello,

I have a method that basically searches through a largish (5mb) text
file for a word. I need to call this method about 1400 times, and i
care about speed.

If i open the file at the start of the script and then pass the file
object as a parameter to my method each time its called, the code runs
quite a bit faster than if i open the file inside the method each
time; but this seems ugly to me.

Is there a standard way to do this in ruby? How much overhead is
involved in opening a large text file?

thanks.
1. You need to define the problem better. Are you searching for a
different word each time, does the file change each time, etc. Why do
you have to call it 1400 times?

2. Searching and indexing are extremely well documented areas of
computer science. Once you've correctly defined your problem, I'm sure
you'll come up with something far more efficient than a brute force
"open a five megabyte file, read the whole enchilada into RAM, and do a
text search for the word, then close the file and wait for the next
request".

3. Do you care about scalability, or is the file *never* going to get
bigger than 5 MBytes? Is the method *always* going to be called "only"
1400 times, or will someone see your success and say, "Great -- here's
20 million words!"?
 
P

Paul Nulty

1. You need to define the problem better. Are you searching for a
different word each time, does the file change each time, etc. Why do
you have to call it 1400 times?

ok here's a few lines from the file i'm searching (its a wordnet file
that holds different senses of words)

concavity%1:07:00:: 05070032 2 0
concavity%1:25:00:: 13864965 1 0
concavo-concave%5:00:00:concave:00 00536008 1 0
concavo-convex%5:00:00:concave:00 00536416 1 0
conceal%2:39:00:: 02146790 2 1
conceal%2:39:01:: 02144835 1 8
concealed%3:00:00:: 02088404 2 1
concealed%5:00:00:invisible:00 02517817 1 2
concealing%1:04:00:: 01048912 1 0
concealing%3:00:00:: 02091020 1 0


i need to search for the first part (e.g. conceal%2:39:00::) and
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i'm unlikely to
need to scale up much past 1400.

here's my code: (senseKey is eg "conceal%2:39:00::")

lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")

#gets a sysnet number from a sense key
def getSense(senseKey,lines)
for line in lines
if line.index(senseKey)==0
words=line.split(" ")
return words[-2]
end
end
end


thanks again!
 
M

M. Edward (Ed) Borasky

Paul said:
1. You need to define the problem better. Are you searching for a
different word each time, does the file change each time, etc. Why do
you have to call it 1400 times?

ok here's a few lines from the file i'm searching (its a wordnet file
that holds different senses of words)

concavity%1:07:00:: 05070032 2 0
concavity%1:25:00:: 13864965 1 0
concavo-concave%5:00:00:concave:00 00536008 1 0
concavo-convex%5:00:00:concave:00 00536416 1 0
conceal%2:39:00:: 02146790 2 1
conceal%2:39:01:: 02144835 1 8
concealed%3:00:00:: 02088404 2 1
concealed%5:00:00:invisible:00 02517817 1 2
concealing%1:04:00:: 01048912 1 0
concealing%3:00:00:: 02091020 1 0


i need to search for the first part (e.g. conceal%2:39:00::) and
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i'm unlikely to
need to scale up much past 1400.

here's my code: (senseKey is eg "conceal%2:39:00::")

lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")

#gets a sysnet number from a sense key
def getSense(senseKey,lines)
for line in lines
if line.index(senseKey)==0
words=line.split(" ")
return words[-2]
end
end
end


thanks again!
Isn't there a Ruby/Wordnet interface? Doctor Google recommended
http://www.deveiate.org/projects/Ruby-WordNet/
 
R

Robert Klemme

yep i'm using that; it's great but as far as i can tell it doesn't use
sense keys, it uses sense numbers. I only have the sense keys, so i
need to get the sense number from the sense key manually.

Try reading the file and storing all combinations in a Hash with sense
key as key and number as value.

robert
 
B

Brian Candler

ok here's a few lines from the file i'm searching (its a wordnet file
that holds different senses of words)

concavity%1:07:00:: 05070032 2 0
concavity%1:25:00:: 13864965 1 0
concavo-concave%5:00:00:concave:00 00536008 1 0
concavo-convex%5:00:00:concave:00 00536416 1 0
conceal%2:39:00:: 02146790 2 1
conceal%2:39:01:: 02144835 1 8
concealed%3:00:00:: 02088404 2 1
concealed%5:00:00:invisible:00 02517817 1 2
concealing%1:04:00:: 01048912 1 0
concealing%3:00:00:: 02091020 1 0


i need to search for the first part (e.g. conceal%2:39:00::) and
return the second last number (eg. 2). (getting the sense from the
sense key, if you know wordnet)

i have 1400 words, the wordnet file will never change. i'm unlikely to
need to scale up much past 1400.

If you're searching a 5MB file 1400 times, it's almost certainly worth
reading it in once and building a hash as you go. Remember that on average,
you are reading half the lines in the file on every search. So you should
speed up by a factor of nearly 700 just by doing this.

If the wordnet file is too big to fit into RAM, then there are ways of
indexing the file on disk to make it quicker to search (external searching)
here's my code: (senseKey is eg "conceal%2:39:00::")

lines=File.readlines("/usr/local/WordNet-3.0/dict/index.sense")

#gets a sysnet number from a sense key
def getSense(senseKey,lines)
for line in lines
if line.index(senseKey)==0
words=line.split(" ")
return words[-2]
end
end
end

Try something like:

class Wordnet
def initialize(filename)
@words = {}
File.open(filename) do |f|
f.each_line do |line|
fields = line.chomp.split(/ /)
key = fields.shift
@words[key] = fields
end
end
end
def sysnet(senseKey)
@words[senseKey][1]
end
end

wn = Wordnet.new("/usr/local/WordNet-3.0/dict/index.sense")
# Now do this 1400 times for different keys
puts wn.sysnet("conceal%2:39:00::")
 
P

Paul Nulty

Thanks!

before:

142.800000 0.100000 142.900000 (156.818797)

after (with hash)

9.900000 0.100000 10.000000 ( 11.259273)

thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top