slow IO

  • Thread starter David King Landrith
  • Start date
D

David King Landrith

Hello everyone,

I'm having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system("cat #{input_file} >> #{outputFile}")
system("echo \f >> #{outputFile}")

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

-- begin code (This has been cleaned a bit and changed to protect the
innocent)

# docInfo object is a wrapper for pages array with some additional info
outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

docInfo.each do | pageInfo |
pageNo = pageInfo.pageNo
start = 0
count = chunk

begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size

# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end

# loop to make sure that no
while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize
output.syswrite(input.sysread(count))
start += count
end
output.syswrite("\f")

ensure
begin
input.close
rescue Exception => err
STDERR << "WARNING: couldn't close #{inputFile}\n"
end
end
end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn't close #{outputFile}\n"
end
end
--end code
 
A

Ara.T.Howard

Date: Sat, 14 Feb 2004 02:16:05 +0900
From: David King Landrith <[email protected]>
Newsgroups: comp.lang.ruby
Subject: slow IO

Hello everyone,

I'm having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system("cat #{input_file} >> #{outputFile}")
system("echo \f >> #{outputFile}")

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

-- begin code (This has been cleaned a bit and changed to protect the
innocent)

# docInfo object is a wrapper for pages array with some additional info
outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

docInfo.each do | pageInfo |
pageNo = pageInfo.pageNo
start = 0
count = chunk

input = output = nil
begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size

# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end

while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize

buf = input.sysread count
output.syswrite buf
buf = nil
start += count
end
output.syswrite("\f")


ensure
# you probably want to _know_ if the system is having probs
# closing files
input.close if input
output.close if output
end
end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn't close #{outputFile}\n"
end
end
--end code

alternatively you might be able to use the

open(path) do |f|
...
end

idiom with 'output'. i suspect that you were grinding to a halt with too many
output files open (they were never closed)...

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================
 
J

J.Herre

This works very quickly with sets of files that are in directories
containing (say) 90K documents or fewer. But when there are 200k +
documents in the directories, it begins a substantial amount of time.

Suffice it to say that few file systems are optimized for the case of >
200k files per directory. From your example, I'm guessing that you're
using Windows which I don't know much about. But my advice is to
restructure your program to using more, smaller directories.

Seriously, that's a lot of files...

-J
 
Y

YANAGAWA Kazuhisa

In Message-Id: <[email protected]>
David King Landrith said:
output.syswrite(input.sysread(count))

Probably here is a problem: many Strings created and discarded may
stimulate GC.

Can you rewrite this to

# buf should be allocated before the loop.
input.sysread(count, buf)
output.syswrite(buf)

and test its performance? Here buf is updated by sysread and no more
extra Strings are created.

# Note that this feature is incorporated from version 1.7.x or later
# where x > ....well, some point :p
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,113
Latest member
Vinay KumarNevatia
Top