Python 3 read() function

Cro · Dec 4, 2008

Good day.
I have installed Python 3 and i have a problem with the builtin read()
function.

Code:

huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import io
vContent = io.StringIO()
vContent = huge.read() # This line takes hours to process !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
tried...

The same thing, in Python 2.5 :

Code:

huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import StringIO
vContent = StringIO.StringIO()
vContent = huge.read() # This line takes 2 seconds !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few
seconds...

My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.

Thank you very very much.

Cro · Dec 4, 2008

Do you really mean io.StringIO? I guess you want io.BytesIO() ..

Christian

Mmm... i don't know.
I also tried :

Code:

IDLE 3.0

It still waits a lot... i don't have the patience to wait for the file
to load completely... it takes a lot!

Thank you for your reply.

skip · Dec 4, 2008

Why do you want to disable buffering? From the io.open help:

open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)
Open file and return a stream. Raise IOError upon failure.
...
buffering is an optional integer used to set the buffering policy. By
default full buffering is on. Pass 0 to switch buffering off (only
allowed in binary mode), 1 to set line buffering, and an integer > 1
for full buffering.

I think you will get better performance if you open the file without the
third arg:

huge = io.open("C:\HUGE_FILE.pcl",'r+b')

MRAB · Dec 4, 2008

Cro said:
Good day.
I have installed Python 3 and i have a problem with the builtin read()
function.

Code:

huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 ) import io vContent = io.StringIO() vContent = huge.read() # This line takes hours to process !!! vSplitContent = vContent.split ( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve tried...

The same thing, in Python 2.5 :

Code:

huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 ) import StringIO vContent = StringIO.StringIO() vContent = huge.read() # This line takes 2 seconds !!! vSplitContent = vContent.split ( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few seconds...

My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.

Thank you very very much.
>

Can't you read it without StringIO?

huge = open('C:/HUGE_FILE.pcl', 'rb', 0)
vContent = huge.read()
vSplitContent = vContent.split(b'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;')

vContent will contain a bytestring (bytes), so I think you need to split
on a bytestring b'...' (in Python 3 unmarked string literals are Unicode).

Istvan Albert · Dec 4, 2008

I can confirm this,

I am getting very slow read performance when reading a smaller 20 MB
file.

- Python 2.5 takes 0.4 seconds
- Python 3.0 takes 62 seconds

fname = "dmel-2R-chromosome-r5.1.fasta"
data = open(fname, 'rt').read()
print ( len(data) )

Istvan Albert · Dec 4, 2008

Jerry Hill wrote:

Timing of os interaction may depend on os. I verified above on WinXp
with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
think something is wrong that needs fixing in 3.0.1.

http://bugs.python.org/issue4533

I believe that the slowdowns are even more substantial when opening
the file in text mode.

Dec 4, 2008

I don't think it matters. Here's a quick comparison between 2.5 and
3.0 on a relatively small 17 meg file:

C:\>c:\Python30\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 36.8 sec per loop

C:\>c:\Python25\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 33 msec per loop

That's 3 orders of magnitude slower on python3.0!

Isn't this because you have the file cached in memory on the second run?

--
Ð´Ð°Ð¼Ñ˜Ð°Ð½ ( http://softver.org.mk/damjan/ )

"The moment you commit and quit holding back, all sorts of unforseen
incidents, meetings and material assistance will rise up to help you.
The simple act of commitment is a powerful magnet for help." -- Napoleon

George Sakkis · Dec 4, 2008

Isn't this because you have the file cached in memory on the second run?

That's probably it; I see much more modest slowdown (2-3X) if I repeat
many times each run.

George

Terry Reedy · Dec 4, 2008

Ð”Ð°Ð¼Ñ˜Ð°Ð½ Ð“ÐµÐ¾Ñ€Ð³Ð¸ÐµÐ²ÑÐºÐ¸ said:
Isn't this because you have the file cached in memory on the second run?

In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.

Rereading Python30.chm without closing *is* much faster.Closing, reopening, and rereading is slower.

MRAB · Dec 4, 2008

Terry said:
In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.

Rereading Python30.chm without closing *is* much faster.
Closing, reopening, and rereading is slower.

It certainly is faster if you're already at the end of the file.

Albert Hopkins · Dec 4, 2008

Isn't this because you have the file cached in memory on the second run?

Even on different files of identical size it's ~3x slower:

$ dd if=/dev/urandom of=file1 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 14.8693 s, 4.9 MB/s
$ dd if=/dev/urandom of=file2 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 16.1581 s, 4.5 MB/s
$ python2.5 -m timeit -n 1 'open("file1", "rb").read()'
1 loops, best of 3: 5.26 sec per loop
$ python3.0 -m timeit -n 1 'open("file2", "rb").read()'
1 loops, best of 3: 14.8 sec per loop

Istvan Albert · Dec 4, 2008

Turns out write performance is also slow!

The program below takes

3 seconds on python 2.5
17 seconds on python 3.0

yes, 17 seconds! tested many times in various order. I believe the
slowdowns are not constant (3x) but some sort of nonlinear function
(quadratic?) play with the N to see it.

===================================

import time

start = time.time()

N = 10**6
fp = open('testfile.txt', 'wt')
for n in range(N):
fp.write( '%s\n' % n )
fp.close()

end = time.time()

print (end-start)

Python 3 __bytes__ method	1	Jan 11, 2014
API design for Python 2 / 3 compatibility	3	Apr 13, 2013
Python battle game help	2	Feb 23, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
ANN: Python 3 enum package	0	Feb 16, 2013
Install python 2 and 3 in the "wrong" order	0	Feb 13, 2014
isinstance(.., file) for Python 3	5	Nov 8, 2012
[RELEASED] Python 3.4.0 release candidate 3	0	Mar 10, 2014

Python 3 read() function

Cro

Cro

skip

MRAB

Istvan Albert

Istvan Albert

Ð”Ð°Ð¼Ñ˜Ð°Ð½ Ð“ÐµÐ¾Ñ€Ð³Ð¸ÐµÐ²ÑÐºÐ¸

George Sakkis

Terry Reedy

MRAB

Albert Hopkins

Istvan Albert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads