Too many open files

A

AMD

Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
 
J

Jeff

Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.
 
C

Christian Heimes

Jeff said:
Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

Christian
 
S

Steven D'Aprano

The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

http://forums.devx.com/archive/index.php/t-136946.html

It's almost certainly not a Python problem, because under Linux I can
open 1000+ files without blinking.

I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from 512?
 
L

Larry Bates

AMD said:
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes

Not quite sure what you mean by "a hash algorithm" but if you sort the file
(with external sort program) on what you want to split on, then you only have to
have 1 file at a time open.

-Larry
 
D

Duncan Booth

Steven D'Aprano said:
Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

No, the C runtime has a limit of 512 files, the OS limit is actually 2048.
See http://msdn2.microsoft.com/en-us/library/6e3b887c(VS.71).aspx
I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from
512?

Call the C runtime function _setmaxstdio(n) to set the maxmimum the number
of open files to n up to 2048. Alternatively os.open() and os.write()
should bypass the C runtime limit.

It would probably be better though to implement some sort of caching scheme
in memory and avoid having to mess with the limits at all. Or do it in two
passes: creating 100 files on the first pass and splitting each of those in
a second pass.
 
G

Gary Herron

AMD said:
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
Try something like this:

Instead of opening several thousand files:

* Create several thousand lists.

* Open the input file and process each line, dropping it into the
correct list.

* Whenever a single list passes some size threshold, open its file,
write the batch, and immediately close the file.

* Similarly at the end (or when the total of all lists passes sme size
threshold), loop through the several thousand lists, opening, writing,
and closing.

This will keep the open/write/closes operations to a minimum, and you'll
never have more than 2 files open at a time. Both of those are wins for
you.

Gary Herron
 
G

Gabriel Genellina

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

The OP said that he has this problem on Windows. The available methods
that I am aware of are:
- using synchronous (blocking) I/O with multiple threads
- asynchronous I/O using OVERLAPPED and wait functions
- asynchronous I/O using IO completion ports

Python does not (natively) support any of the latter ones, only the first.
I don't have any evidence proving that it's a very bad idea as you claim;
altough I wouldn't use 50 threads as suggested above, but a few more than
the number of CPU cores.
 
A

AMD

Thank you every one,

I ended up using a solution similar to what Gary Herron suggested :
Caching the output to a list of lists, one per file, and only doing the
IO when the list reaches a certain treshold.
After playing around with the list threshold I ended up with faster
execution times than originally and while having a maximum of two files
open at a time! Its only a matter of trading memory for open files.
It could be that using this strategy with asynchronous IO or threads
could yield even faster times, but I haven't tested it.
Again, much appreciated thanks for all your suggestions.

Andre M. Descombes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top