slow IO

David King Landrith · Feb 13, 2004

Hello everyone,

I'm having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system("cat #{input_file} >> #{outputFile}")
system("echo \f >> #{outputFile}")

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

-- begin code (This has been cleaned a bit and changed to protect the
innocent)

# docInfo object is a wrapper for pages array with some additional info
outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

docInfo.each do | pageInfo |
pageNo = pageInfo.pageNo
start = 0
count = chunk

begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size

# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end

# loop to make sure that no
while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize
output.syswrite(input.sysread(count))
start += count
end
output.syswrite("\f")

ensure
begin
input.close
rescue Exception => err
STDERR << "WARNING: couldn't close #{inputFile}\n"
end
end
end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn't close #{outputFile}\n"
end
end
--end code

Ara.T.Howard · Feb 13, 2004

Date: Sat, 14 Feb 2004 02:16:05 +0900
From: David King Landrith <[email protected]>
Newsgroups: comp.lang.ruby
Subject: slow IO

Hello everyone,

I'm having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system("cat #{input_file} >> #{outputFile}")
system("echo \f >> #{outputFile}")

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

-- begin code (This has been cleaned a bit and changed to protect the
innocent)

# docInfo object is a wrapper for pages array with some additional info
outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

docInfo.each do | pageInfo |
pageNo = pageInfo.pageNo
start = 0
count = chunk

input = output = nil

begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size

# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end

while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize

buf = input.sysread count
output.syswrite buf
buf = nil

start += count
end
output.syswrite("\f")

ensure

# you probably want to _know_ if the system is having probs
# closing files
input.close if input
output.close if output

end
end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn't close #{outputFile}\n"
end
end
--end code

alternatively you might be able to use the

open(path) do |f|
...
end

idiom with 'output'. i suspect that you were grinding to a halt with too many
output files open (they were never closed)...

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

J.Herre · Feb 13, 2004

This works very quickly with sets of files that are in directories
containing (say) 90K documents or fewer. But when there are 200k +
documents in the directories, it begins a substantial amount of time.

Suffice it to say that few file systems are optimized for the case of >
200k files per directory. From your example, I'm guessing that you're
using Windows which I don't know much about. But my advice is to
restructure your program to using more, smaller directories.

Seriously, that's a lot of files...

-J

YANAGAWA Kazuhisa · Feb 13, 2004

In Message-Id: <[email protected]>

David King Landrith said:
output.syswrite(input.sysread(count))

Probably here is a problem: many Strings created and discarded may
stimulate GC.

Can you rewrite this to

# buf should be allocated before the loop.
input.sysread(count, buf)
output.syswrite(buf)

and test its performance? Here buf is updated by sysread and no more
extra Strings are created.

# Note that this feature is incorporated from version 1.7.x or later
# where x > ....well, some point

Counter-intuitive io vs no-io time readings	6	Apr 9, 2014
Threaded IO trouble	12	Aug 6, 2008
IO#sysread on windows	3	Jun 14, 2006
Win Pipes and IO	1	Sep 14, 2005
Dynamic insert of array data in mysql using DBI	2	Jul 20, 2010
Noob File IO question	8	Apr 4, 2007
Trying to add threading to parse a .txt file.	4	Jun 9, 2008
Optimization help - reading out of /proc on Solaris	4	Sep 16, 2008

slow IO

David King Landrith

Ara.T.Howard

J.Herre

YANAGAWA Kazuhisa

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads