Python object overhead?

M

Matt Garman

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt


Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readlines.py


Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py
 
M

Mark Nenadov

Is this "just the way it is" or am I overlooking something obvious?


Matt,

If you iterate over even the smallest object instantiation a large amount
of times, it will be costly compared to a simple list append.

I don't think you can get around some overhead with the objects.

However, in terms of generally efficiency not specifically related to
object instantiation, you should look into xreadlines().

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():
..
 
G

Gabriel Genellina

En Fri, 23 Mar 2007 18:27:25 -0300, Mark Nenadov
I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...
 
M

Mark Nenadov

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3 :(
A file is now its own line iterator:

f = open(...)
for line in f:
...

Gabriel,

Thanks for pointing that out! I had completely forgotten about
that!

I've tested them before. readlines() is very slow. The deprecated
xreadlines() is close in speed to open() as an iterator. In my particular
test, I found the following:

readlines() -> 32 "time units"
xreadlines() -> 0.7 "time units"
open() iterator -> 0.41 "time units"

--
Mark Nenadov -> skype: marknenadov, web: http://www.marknenadov.com
-> "They need not trust me right away simply because the British say
that I am O.K.; but they are so ridiculous. Microphones everywhere
and planted so obviously. Why, if I bend over to smell a bowl of
flowers, I scratch my nose on a microphone."
-- Tricyle (Dushko Popov) on American Intelligence
 
B

Bjoern Schliessmann

Matt said:
Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing
this causes the memory overhead to go up two or three times.

(Note that almost everything in Python is an object!)
Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".
Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?

Regards,


Björn
 
F

Facundo Batista

Bjoern Schliessmann wrote:

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".

"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght > 0 (because of
newline).

What's this class intended to do?

Unless I understood it wrong, it's just an object that holds the line
inside.

Just OO purity, not practicality...
 
J

John Nagle

Matt said:
I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

Why do you want all the records in memory at once? Are you
doing some lookup on them, or what? If you're processing files
sequentially, don't keep them all in memory.

You're getting into the size range where it may be time to
use a database.

John Nagle
 
J

John Machin

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record.

An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.
However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt


Example 1: read lines into list:
# begin readlines.py

Interesting name for the file :)
How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?
import sys, time
filedata = list()
file = open(sys.argv[1])

You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top

How about using raw_input('Hit the Any key...') ?
# end readlines.py


Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py

After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.

A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.

C:\junk>type readpipe.py
import sys, csv

class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)

def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]

if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Biff|Bloggs|[email protected]
Joseph ("Joe")|Blow|[email protected]
"Joe"|Blow|[email protected]

Santa|Claus|[email protected]
'''))

C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', '(e-mail address removed)'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', '(e-mail address removed)'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', '(e-mail address removed)'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', '(e-mail address removed)'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>

HTH,
John
 
F

Felipe Almeida Lessa

(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?
 
G

Gabriel Genellina

En Sat, 24 Mar 2007 18:07:57 -0300, Felipe Almeida Lessa
Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

The syntax, by example; an "if" statement is not an object.
 
B

Bjoern Schliessmann

Facundo said:
"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght > 0 (because
of newline).

Ah, K. Normally, I strip the read line and then test "if not line".
His check /is/ okay, but IMHO it's a little bit weird.
Unless I understood it wrong, it's just an object that holds the
line inside.

A Python string would technically be the same ;)
Just OO purity, not practicality...

:)

Regards,


Björn
 
B

Bjoern Schliessmann

Felipe said:
Could you tell me what in Python isn't an object?

Difficult ;) All data structures are (CMIIW). Functions and Types
are objects, too.
Are you counting old-style classes and instances as "not object"s?

No, both are.

Regards,


Björn
 
B

Bruno Desthuilliers

Matt Garman a écrit :
I'm trying to use Python to work with large pipe ('|') delimited data
files.

Looks like a job for the csv module (in the standard lib).
The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

Just for the record, *everything* in Python is an object - so the
problem is not about 'using objects'. Now Of course, a complex object
might eat up more space than a simple one...

Python has 2 simple types for structured data : tuples (like database
rows), and dicts (associative arrays). You can use the csv module to
parse a csv-like format into either tuples or dicts. If you want to save
memory, tuples may be the best choice.
This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

What are you doing with your records ? Do you *really* need to keep the
whole list in memory ? Else you can just work line by line:

source = open(sys.argv[1])
for line in source:
do_something_with(line)
source.close()

This will avoid building a huge in-memory list.

While we're at it, your snippets are definitively unpythonic and
overcomplicated:


(snip)
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
(snip)

filedata = open(sys.argv[1]).readlines())

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:

class FileRecord(object):
def __init__(self, line):
self.line = line

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()

records = map(FileRecord, open(sys.argv[1]).readlines()))
 
B

Bruno Desthuilliers

Bruno Desthuilliers a écrit :
Matt Garman a écrit : (snip)
class FileRecord(object):


If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.

Hem... Forget about this comment - not enough coffein yet I'm afraid.
 
M

Matt Garman

"one blank line" == "EOF"? That's strange. Intended?

In my case, I know my input data doesn't have any blank lines.
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.
What's this class intended to do?

Store a line :) I just wanted to post two runnable examples. So the
above class's real intention is just to be a (contrived) example.

In the program I actually wrote, my class structure was a bit more
interesting. After storing the input line, I'd then call split("|")
(to tokenize the line). Each token would then be assigned to an
member variable. Some of the member variables turned into ints or
floats as well.

My input data had three record types; all had a few common attributes.
So I created a parent class and three child classes.

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Finally, for what it's worth: the total run time memory requirements
of my program is roughly 20x the datafile size. A 200MB file
literally requires 4GB of RAM to effectively process. Note that, in
addition to the class structure I defined above, I also create two
caches of all the data (two dicts with different keys from the
collection of objects). This is necessary to ensure the program runs
in a semi-reasonable amount of time.

Thanks to all for your input and suggestions. I received many more
responses than I expected!

Matt
 
B

Bruno Desthuilliers

Matt Garman a écrit :
(snip)
Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Don't know if this could solve your problem, but have considered using
an intermediate (preferably embedded) SQL database (something like
SQLite) ?
 
B

Bjoern Schliessmann

Matt said:
In my case, I know my input data doesn't have any blank lines.

8)

I work with a (not self-written) perl script that does funny things
with blank lines in input files. Yeah, blank lines "aren't supposed
to" be in the input data ...
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

The principle was okay (to check if the string is totally empty). I
always used readlines so far and didn't have the problem.
Thanks to all for your input and suggestions. I received many
more responses than I expected!

You're welcome. :)

Regards,


Björn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,985
Messages
2,570,199
Members
46,766
Latest member
rignpype

Latest Threads

Top