Python 3 read() function

C

Cro

Good day.
I have installed Python 3 and i have a problem with the builtin read()
function.

Code:
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import io
vContent = io.StringIO()
vContent = huge.read() # This line takes hours to process !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
tried...

The same thing, in Python 2.5 :

Code:
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import StringIO
vContent = StringIO.StringIO()
vContent = huge.read() # This line takes 2 seconds !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few
seconds...

My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.

Thank you very very much.
 
C

Cro

Do you really mean io.StringIO? I guess you want io.BytesIO() ..
Christian

Mmm... i don't know.
I also tried :

Code:
IDLE 3.0

It still waits a lot... i don't have the patience to wait for the file
to load completely... it takes a lot!

Thank you for your reply.
 
S

skip

Why do you want to disable buffering? From the io.open help:

open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)
Open file and return a stream. Raise IOError upon failure.
...
buffering is an optional integer used to set the buffering policy. By
default full buffering is on. Pass 0 to switch buffering off (only
allowed in binary mode), 1 to set line buffering, and an integer > 1
for full buffering.

I think you will get better performance if you open the file without the
third arg:

huge = io.open("C:\HUGE_FILE.pcl",'r+b')
 
M

MRAB

Cro said:
Good day.
I have installed Python 3 and i have a problem with the builtin read()
function.

Code:
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import io
vContent = io.StringIO()
vContent = huge.read() # This line takes hours to process !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
tried...

The same thing, in Python 2.5 :

Code:
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import StringIO
vContent = StringIO.StringIO()
vContent = huge.read() # This line takes 2 seconds !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few
seconds...

My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.

Thank you very very much.
>
Can't you read it without StringIO?

huge = open('C:/HUGE_FILE.pcl', 'rb', 0)
vContent = huge.read()
vSplitContent = vContent.split(b'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;')

vContent will contain a bytestring (bytes), so I think you need to split
on a bytestring b'...' (in Python 3 unmarked string literals are Unicode).
 
I

Istvan Albert

I can confirm this,

I am getting very slow read performance when reading a smaller 20 MB
file.

- Python 2.5 takes 0.4 seconds
- Python 3.0 takes 62 seconds

fname = "dmel-2R-chromosome-r5.1.fasta"
data = open(fname, 'rt').read()
print ( len(data) )
 
I

Istvan Albert

Jerry Hill wrote:

Timing of os interaction may depend on os.  I verified above on WinXp
with 4 meg Pythonxy.chm file.  Eye blink versus 3 secs, duplicated.  I
think something is wrong that needs fixing in 3.0.1.

http://bugs.python.org/issue4533

I believe that the slowdowns are even more substantial when opening
the file in text mode.
 
Ð

Дамјан ГеоргиевÑки

I don't think it matters. Here's a quick comparison between 2.5 and
3.0 on a relatively small 17 meg file:

C:\>c:\Python30\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 36.8 sec per loop

C:\>c:\Python25\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 33 msec per loop

That's 3 orders of magnitude slower on python3.0!

Isn't this because you have the file cached in memory on the second run?

--
дамјан ( http://softver.org.mk/damjan/ )

"The moment you commit and quit holding back, all sorts of unforseen
incidents, meetings and material assistance will rise up to help you.
The simple act of commitment is a powerful magnet for help." -- Napoleon
 
G

George Sakkis

Isn't this because you have the file cached in memory on the second run?

That's probably it; I see much more modest slowdown (2-3X) if I repeat
many times each run.

George
 
T

Terry Reedy

Дамјан ГеоргиевÑки said:
Isn't this because you have the file cached in memory on the second run?

In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.

Rereading Python30.chm without closing *is* much faster.Closing, reopening, and rereading is slower.
 
M

MRAB

Terry said:
In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.

Rereading Python30.chm without closing *is* much faster.
Closing, reopening, and rereading is slower.
It certainly is faster if you're already at the end of the file. :)
 
A

Albert Hopkins

Isn't this because you have the file cached in memory on the second run?

Even on different files of identical size it's ~3x slower:

$ dd if=/dev/urandom of=file1 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 14.8693 s, 4.9 MB/s
$ dd if=/dev/urandom of=file2 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 16.1581 s, 4.5 MB/s
$ python2.5 -m timeit -n 1 'open("file1", "rb").read()'
1 loops, best of 3: 5.26 sec per loop
$ python3.0 -m timeit -n 1 'open("file2", "rb").read()'
1 loops, best of 3: 14.8 sec per loop
 
I

Istvan Albert

Turns out write performance is also slow!

The program below takes

3 seconds on python 2.5
17 seconds on python 3.0

yes, 17 seconds! tested many times in various order. I believe the
slowdowns are not constant (3x) but some sort of nonlinear function
(quadratic?) play with the N to see it.

===================================

import time

start = time.time()

N = 10**6
fp = open('testfile.txt', 'wt')
for n in range(N):
fp.write( '%s\n' % n )
fp.close()

end = time.time()

print (end-start)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top