Python object overhead?

Matt Garman · Mar 23, 2007

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt

Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readlines.py

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py

Mark Nenadov · Mar 23, 2007

Is this "just the way it is" or am I overlooking something obvious?

Matt,

If you iterate over even the smallest object instantiation a large amount
of times, it will be costly compared to a simple list append.

I don't think you can get around some overhead with the objects.

However, in terms of generally efficiency not specifically related to
object instantiation, you should look into xreadlines().

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():
..

Gabriel Genellina · Mar 23, 2007

En Fri, 23 Mar 2007 18:27:25 -0300, Mark Nenadov

I'd suggest doing the following instead of that while loop:

for line in open(sys.argv[1]).xreadlines():

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3

A file is now its own line iterator:

f = open(...)
for line in f:
...

Mark Nenadov · Mar 23, 2007

Poor xreadlines method had a short life: it was born on Python 2.1 and got
deprecated on 2.3
A file is now its own line iterator:

f = open(...)
for line in f:
...

Gabriel,

Thanks for pointing that out! I had completely forgotten about
that!

I've tested them before. readlines() is very slow. The deprecated
xreadlines() is close in speed to open() as an iterator. In my particular
test, I found the following:

readlines() -> 32 "time units"
xreadlines() -> 0.7 "time units"
open() iterator -> 0.41 "time units"

--
Mark Nenadov -> skype: marknenadov, web: http://www.marknenadov.com
-> "They need not trust me right away simply because the British say
that I am O.K.; but they are so ridiculous. Microphones everywhere
and planted so obviously. Why, if I bend over to smell a bowl of
flowers, I scratch my nose on a microphone."
-- Tricyle (Dushko Popov) on American Intelligence

Bjoern Schliessmann · Mar 23, 2007

Matt said:
Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing
this causes the memory overhead to go up two or three times.

(Note that almost everything in Python is an object!)

Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line

What's this class intended to do?

Regards,

Björn

Paul Rubin · Mar 23, 2007

Bjoern Schliessmann said:
"one blank line" == "EOF"? That's strange. Intended?

A blank line would have length 1 (a newline character).

Facundo Batista · Mar 23, 2007

Bjoern Schliessmann wrote:

"one blank line" == "EOF"? That's strange. Intended?

The most common form for this would be "if not line: (do
something)".

"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght > 0 (because of
newline).

What's this class intended to do?

Unless I understood it wrong, it's just an object that holds the line
inside.

Just OO purity, not practicality...

John Nagle · Mar 24, 2007

Matt said:
I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

Why do you want all the records in memory at once? Are you
doing some lookup on them, or what? If you're processing files
sequentially, don't keep them all in memory.

You're getting into the size range where it may be time to
use a database.

John Nagle

John Machin · Mar 24, 2007

I'm trying to use Python to work with large pipe ('|') delimited data
files. The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record.

An object with only 1 attribute and no useful methods seems a little
pointless; I presume you will elaborate it later.

However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt

Example 1: read lines into list:
# begin readlines.py

Interesting name for the file

How about using the file.readlines() method?
Why do you want all 200Mb in memory at once anyway?

import sys, time
filedata = list()
file = open(sys.argv[1])

You have just clobbered the builtin file() function/type. In this case
it doesn't matter, but you should lose the habit, quickly.

while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top

How about using raw_input('Hit the Any key...') ?

# end readlines.py

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py

After all that, you still need to split the lines into the more-than-one
fieldS (plural) that one would expect in a record.

A possibly faster alternative to (fastest_line_reader_so_far,
(line.split('|')) is to use the csv module, as in the following example,
which also shows one way of making an object out of a row of data.

C:\junk>type readpipe.py
import sys, csv

class Contacts(object):
__slots__ = ['first', 'family', 'email']
def __init__(self, row):
for attrname, value in zip(self.__slots__, row):
setattr(self, attrname, value)

def readpipe(fname):
if hasattr(fname, 'read'):
f = fname
else:
f = open(fname, 'rb')
# 'b' is in case you'd like your script to be portable
reader = csv.reader(
f,
delimiter='|',
quoting=csv.QUOTE_NONE,
# Set quotechar to a char that you don't expect in your data
# e.g. the ASCII control char BEL (0x07). This is necessary
# for Python 2.3, whose csv module used the quoting arg only when
# writing, otherwise your " characters may get stripped off.
quotechar='\x07',
skipinitialspace=True,
)
for row in reader:
if row == ['']: # blank line
continue
c = Contacts(row)
# do something useful with c, e.g.
print [(x, getattr(c, x)) for x in dir(c)
if not x.startswith('_')]

if __name__ == '__main__':
if sys.argv[1:2]:
readpipe(sys.argv[1])
else:
print '*** Testing ***'
import cStringIO
readpipe(cStringIO.StringIO('''\
Biff|Bloggs|[email protected]
Joseph ("Joe")|Blow|[email protected]
"Joe"|Blow|[email protected]

Santa|Claus|[email protected]
'''))

C:\junk>\python23\python readpipe.py
*** Testing ***
[('email', '(e-mail address removed)'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', '(e-mail address removed)'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>\python25\python readpipe.py
*** Testing ***
[('email', '(e-mail address removed)'), ('family', 'Bloggs'), ('first', 'Biff')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', 'Joseph
("Joe")')]
[('email', '(e-mail address removed)'), ('family', 'Blow'), ('first', '"Joe"')]
[('email', '(e-mail address removed)'), ('family', 'Claus'), ('first', 'Santa')]

C:\junk>

HTH,
John

Felipe Almeida Lessa · Mar 24, 2007

(Note that almost everything in Python is an object!)

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

Gabriel Genellina · Mar 24, 2007

En Sat, 24 Mar 2007 18:07:57 -0300, Felipe Almeida Lessa

Could you tell me what in Python isn't an object? Are you counting
old-style classes and instances as "not object"s?

The syntax, by example; an "if" statement is not an object.

Bjoern Schliessmann · Mar 25, 2007

Facundo said:
"not line" and "len(line) == 0" is the same as long as "line" is a
string.

He's checking ok, 'cause a "blank line" has a lenght > 0 (because
of newline).

Ah, K. Normally, I strip the read line and then test "if not line".
His check /is/ okay, but IMHO it's a little bit weird.

Unless I understood it wrong, it's just an object that holds the
line inside.

A Python string would technically be the same

Just OO purity, not practicality...

Regards,

Björn

Bjoern Schliessmann · Mar 25, 2007

Felipe said:
Could you tell me what in Python isn't an object?

Difficult

All data structures are (CMIIW). Functions and Types
are objects, too.

Are you counting old-style classes and instances as "not object"s?

No, both are.

Regards,

Björn

Bruno Desthuilliers · Mar 26, 2007

Matt Garman a écrit :

I'm trying to use Python to work with large pipe ('|') delimited data
files.

Looks like a job for the csv module (in the standard lib).

The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record. However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2. (Memory usage is
checked using top.)

Just for the record, *everything* in Python is an object - so the
problem is not about 'using objects'. Now Of course, a complex object
might eat up more space than a simple one...

Python has 2 simple types for structured data : tuples (like database
rows), and dicts (associative arrays). You can use the csv module to
parse a csv-like format into either tuples or dicts. If you want to save
memory, tuples may be the best choice.

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

What are you doing with your records ? Do you *really* need to keep the
whole list in memory ? Else you can just work line by line:

source = open(sys.argv[1])
for line in source:
do_something_with(line)
source.close()

This will avoid building a huge in-memory list.

While we're at it, your snippets are definitively unpythonic and
overcomplicated:

(snip)

filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()

(snip)

filedata = open(sys.argv[1]).readlines())

Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:

class FileRecord(object):

def __init__(self, line):
self.line = line

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.

records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()

records = map(FileRecord, open(sys.argv[1]).readlines()))

Bruno Desthuilliers · Mar 26, 2007

Could you tell me what in Python isn't an object?

statements and expressions ?-)

Bruno Desthuilliers · Mar 26, 2007

Bruno Desthuilliers a écrit :

Matt Garman a écrit : (snip)
class FileRecord(object):

If this is your real code, I don't see any reason why this should eat up
3 times more space than the original version.

Hem... Forget about this comment - not enough coffein yet I'm afraid.

Matt Garman · Mar 26, 2007

"one blank line" == "EOF"? That's strange. Intended?

In my case, I know my input data doesn't have any blank lines.
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

What's this class intended to do?

Store a line

I just wanted to post two runnable examples. So the
above class's real intention is just to be a (contrived) example.

In the program I actually wrote, my class structure was a bit more
interesting. After storing the input line, I'd then call split("|")
(to tokenize the line). Each token would then be assigned to an
member variable. Some of the member variables turned into ints or
floats as well.

My input data had three record types; all had a few common attributes.
So I created a parent class and three child classes.

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Finally, for what it's worth: the total run time memory requirements
of my program is roughly 20x the datafile size. A 200MB file
literally requires 4GB of RAM to effectively process. Note that, in
addition to the class structure I defined above, I also create two
caches of all the data (two dicts with different keys from the
collection of objects). This is necessary to ensure the program runs
in a semi-reasonable amount of time.

Thanks to all for your input and suggestions. I received many more
responses than I expected!

Matt

Bruno Desthuilliers · Mar 26, 2007

Matt Garman a écrit :
(snip)

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set). Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file). (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Don't know if this could solve your problem, but have considered using
an intermediate (preferably embedded) SQL database (something like
SQLite) ?

Bjoern Schliessmann · Mar 26, 2007

Matt said:
In my case, I know my input data doesn't have any blank lines.

8)

I work with a (not self-written) perl script that does funny things
with blank lines in input files. Yeah, blank lines "aren't supposed
to" be in the input data ...

However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

The principle was okay (to check if the string is totally empty). I
always used readlines so far and didn't have the problem.

Thanks to all for your input and suggestions. I received many
more responses than I expected!

You're welcome.

Regards,

Björn

Generator using item[n-1] + item[n] memory	0	Feb 14, 2014
Database Manager: A C++ Console Application	14	May 12, 2025
Overhead of individual python apps	15	Sep 27, 2005
I need help with a Gemini prompt	1	May 14, 2025
Python subprocesses experience mysterious delay in receiving stdin EOF	0	Feb 8, 2011
Dataset overhead	3	Feb 15, 2005
Quesion about running a exe file in Python(Not enough memory)	5	Apr 25, 2013
Improving the web page download code.	5	Aug 27, 2013

Python object overhead?

Matt Garman

Mark Nenadov

Gabriel Genellina

Mark Nenadov

Bjoern Schliessmann

Paul Rubin

Facundo Batista

John Nagle

John Machin

Felipe Almeida Lessa

Gabriel Genellina

Bjoern Schliessmann

Bjoern Schliessmann

Bruno Desthuilliers

Bruno Desthuilliers

Bruno Desthuilliers

Matt Garman

Bruno Desthuilliers

Bjoern Schliessmann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads