Too many open files

AMD · Feb 4, 2008

Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes

Jeff · Feb 4, 2008

Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.

Christian Heimes · Feb 4, 2008

Jeff said:
Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

Christian

Steven D'Aprano · Feb 4, 2008

The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

http://forums.devx.com/archive/index.php/t-136946.html

It's almost certainly not a Python problem, because under Linux I can
open 1000+ files without blinking.

I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from 512?

Larry Bates · Feb 4, 2008

AMD said:
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes

Not quite sure what you mean by "a hash algorithm" but if you sort the file
(with external sort program) on what you want to split on, then you only have to
have 1 file at a time open.

-Larry

Duncan Booth · Feb 4, 2008

Steven D'Aprano said:
Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

No, the C runtime has a limit of 512 files, the OS limit is actually 2048.
See http://msdn2.microsoft.com/en-us/library/6e3b887c(VS.71).aspx

I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from
512?

Call the C runtime function _setmaxstdio(n) to set the maxmimum the number
of open files to n up to 2048. Alternatively os.open() and os.write()
should bypass the C runtime limit.

It would probably be better though to implement some sort of caching scheme
in memory and avoid having to mess with the limits at all. Or do it in two
passes: creating 100 files on the first pass and splitting each of those in
a second pass.

Gary Herron · Feb 4, 2008

AMD said:
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes

Try something like this:

Instead of opening several thousand files:

* Create several thousand lists.

* Open the input file and process each line, dropping it into the
correct list.

* Whenever a single list passes some size threshold, open its file,
write the batch, and immediately close the file.

* Similarly at the end (or when the total of all lists passes sme size
threshold), loop through the several thousand lists, opening, writing,
and closing.

This will keep the open/write/closes operations to a minimum, and you'll
never have more than 2 files open at a time. Both of those are wins for
you.

Gary Herron

Gabriel Genellina · Feb 4, 2008

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

The OP said that he has this problem on Windows. The available methods
that I am aware of are:
- using synchronous (blocking) I/O with multiple threads
- asynchronous I/O using OVERLAPPED and wait functions
- asynchronous I/O using IO completion ports

Python does not (natively) support any of the latter ones, only the first.
I don't have any evidence proving that it's a very bad idea as you claim;
altough I wouldn't use 50 threads as suggested above, but a few more than
the number of CPU cores.

Dennis Lee Bieber · Feb 4, 2008

* Whenever a single list passes some size threshold, open its file, for append,
write the batch, and immediately close the file.

... and reset the list to empty

<G> Might as well be explicit on the requirements...
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

AMD · Feb 5, 2008

Thank you every one,

I ended up using a solution similar to what Gary Herron suggested :
Caching the output to a list of lists, one per file, and only doing the
IO when the list reaches a certain treshold.
After playing around with the list threshold I ended up with faster
execution times than originally and while having a maximum of two files
open at a time! Its only a matter of trading memory for open files.
It could be that using this strategy with asynchronous IO or threads
could yield even faster times, but I haven't tested it.
Again, much appreciated thanks for all your suggestions.

Andre M. Descombes

Too many open files	1	Feb 9, 2009
script crashes with "Too many open files"	1	Dec 18, 2008
Question about multiple metadata files to one file	0	Feb 14, 2022
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Too many open files	6	May 7, 2009
java.net.SocketException + too many open files	3	Nov 20, 2007
socket.error 24: too many open files	2	Jan 7, 2009
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022

Too many open files

AMD

Jeff

Christian Heimes

Steven D'Aprano

Larry Bates

Duncan Booth

Gary Herron

Gabriel Genellina

Dennis Lee Bieber

AMD

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads