Finding size of Variable

A

Ayushi Dalmia

Hello,

I have 10 files and I need to merge them (using K way merging). The size ofeach file is around 200 MB. Now suppose I am keeping the merged data in a variable named mergedData, I had thought of checking the size of mergedDatausing sys.getsizeof() but it somehow doesn't gives the actual value of thememory occupied.

For example, if a file in my file system occupies 4 KB of data, if I read all the lines in a list, the size of the list is around 2100 bytes only.

Where am I going wrong? What are the alternatives I can try?
 
P

Peter Otten

Ayushi said:
I have 10 files and I need to merge them (using K way merging). The size
of each file is around 200 MB. Now suppose I am keeping the merged data in
a variable named mergedData, I had thought of checking the size of
mergedData using sys.getsizeof() but it somehow doesn't gives the actual
value of the memory occupied.

For example, if a file in my file system occupies 4 KB of data, if I read
all the lines in a list, the size of the list is around 2100 bytes only.

Where am I going wrong? What are the alternatives I can try?

getsizeof() gives you the size of the list only; to complete the picture you
have to add the sizes of the lines.

However, why do you want to keep track of the actual memory used by
variables in your script? You should instead concentrate on the algorithm,
and as long as either the size of the dataset is manageable or you can limit
the amount of data accessed at a given time you are golden.
 
A

Ayushi Dalmia

Ayushi Dalmia wrote:











getsizeof() gives you the size of the list only; to complete the picture you

have to add the sizes of the lines.



However, why do you want to keep track of the actual memory used by

variables in your script? You should instead concentrate on the algorithm,

and as long as either the size of the dataset is manageable or you can limit

the amount of data accessed at a given time you are golden.

As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded inthe memory for fast access. Hence the concern.

Can you please elaborate on the point of taking lines into consideration?
 
A

Asaf Las

As I said, I need to merge large files and I cannot afford more I/O
operations. So in order to minimise the I/O operation I am writing in
chunks. Also, I need to use the merged files as indexes later which
should be loaded in the memory for fast access. Hence the concern.
Can you please elaborate on the point of taking lines into consideration?

have you tried os.sendfile()?

http://docs.python.org/dev/library/os.html#os.sendfile
 
D

Dave Angel

Ayushi Dalmia said:
As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern.

Can you please elaborate on the point of taking lines into consideration?

Please don't doublespace your quotes. If you must use
googlegroups, fix its bugs before posting.

There's usually no net gain in trying to 'chunk' your output to a
text file. The python file system already knows how to do that
for a sequential file.

For list of strings just add the getsizeof for the list to the sum
of the getsizeof of all the list items.
 
A

Ayushi Dalmia

Please don't doublespace your quotes. If you must use

googlegroups, fix its bugs before posting.



There's usually no net gain in trying to 'chunk' your output to a

text file. The python file system already knows how to do that

for a sequential file.



For list of strings just add the getsizeof for the list to the sum

of the getsizeof of all the list items.

Hey!

I need to chunk out the outputs otherwise it will give Memory Error. I needto do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.
 
D

Dennis Lee Bieber

I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.

10 200MB files is only 2GB... Most any 64-bit processor these days can
handle that. Even some 32-bit systems could handle it (WinXP booted with
the server option gives 3GB to user processes -- if the 4GB was installed
in the machine).

However, you speak of an n-way merge. The traditional merge operation
only reads one record from each file at a time, examines them for "first",
writes that "first", reads next record from the file "first" came from, and
then reassesses the set.

You mention needed to chunk the data -- that implies performing a merge
sort in which you read a few records from each file into memory, sort them,
and right them out to newFile1; then read the same number of records from
each file, sort, and write them to newFile2, up to however many files you
intend to work with -- at that point you go back and append the next chunk
to newFile1. When done, each file contains chunks of n*r records. You now
make newFilex the inputs, read/merge the records from those chunks
outputting to another file1, when you reach the end of the first chunk in
the files you then read/merge the second chunk into another file2. You
repeat this process until you end up with only one chunk in one file.
 
D

Dave Angel

Ayushi Dalmia said:
Where am I going wrong? What are the alternatives I can try?

You've rejected all the alternatives so far without showing your
code, or even properly specifying your problem.

To get the "total" size of a list of strings, try (untested):

a = sys.getsizeof (mylist )
for item in mylist:
a += sys.getsizeof (item)

This can be high if some of the strings are interned and get
counted twice. But you're not likely to get closer without some
knowledge of the data objects and where they come
from.
 
T

Tim Golden

You've rejected all the alternatives so far without showing your
code, or even properly specifying your problem.

To get the "total" size of a list of strings, try (untested):

a = sys.getsizeof (mylist )
for item in mylist:
a += sys.getsizeof (item)

The documentation for sys.getsizeof:

http://docs.python.org/dev/library/sys#sys.getsizeof

warns about the limitations of this function when applied to a
container, and even points to a recipe by Raymond Hettinger which
attempts to do a more complete job.

TJG
 
T

Tim Chase

To get the "total" size of a list of strings, try (untested):

a = sys.getsizeof (mylist )
for item in mylist:
a += sys.getsizeof (item)

I always find this sort of accumulation weird (well, at least in
Python; it's the *only* way in many other languages) and would write
it as

a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)

-tkc
 
A

Ayushi Dalmia

10 200MB files is only 2GB... Most any 64-bit processor these days can

handle that. Even some 32-bit systems could handle it (WinXP booted with

the server option gives 3GB to user processes -- if the 4GB was installed

in the machine).



However, you speak of an n-way merge. The traditional merge operation

only reads one record from each file at a time, examines them for "first",

writes that "first", reads next record from the file "first" came from, and

then reassesses the set.



You mention needed to chunk the data -- that implies performing a merge

sort in which you read a few records from each file into memory, sort them,

and right them out to newFile1; then read the same number of records from

each file, sort, and write them to newFile2, up to however many files you

intend to work with -- at that point you go back and append the next chunk

to newFile1. When done, each file contains chunks of n*r records. You now

make newFilex the inputs, read/merge the records from those chunks

outputting to another file1, when you reach the end of the first chunk in

the files you then read/merge the second chunk into another file2. You

repeat this process until you end up with only one chunk in one file.

--

Wulfraed Dennis Lee Bieber AF6VN

(e-mail address removed) HTTP://wlfraed.home.netcom.com/

The way you mentioned for merging the file is an option but that will involve a lot of I/O operation. Also, I do not want the size of the file to increase beyond a certain point. When I reach the file size upto a certain limit, I want to start writing in a new file. This is because I want to store them in memory again later.
 
A

Ayushi Dalmia

You've rejected all the alternatives so far without showing your

code, or even properly specifying your problem.



To get the "total" size of a list of strings, try (untested):



a = sys.getsizeof (mylist )

for item in mylist:

a += sys.getsizeof (item)



This can be high if some of the strings are interned and get

counted twice. But you're not likely to get closer without some

knowledge of the data objects and where they come

from.

Hello Dave,

I just thought that saving others time is better and hence I explained onlythe subset of my problem. Here is what I am trying to do:

I am trying to index the current wikipedia dump without using databases andcreate a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:

I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the words along with the posting list (a list of doc ids in which the word is present) into different files after reading 'X' number of pages. Now these files may have the same word and hence I need to merge them and write thefinal index again. Now these final indexes must be of limited size as I need to be of limited size. This is where I am stuck. I need to know how to determine the size of content in a variable before I write into the file.

Here is the code for my merging:

def mergeFiles(pathOfFolder, countFile):
listOfWords={}
indexFile={}
topOfFile={}
flag=[0]*countFile
data=defaultdict(list)
heap=[]
countFinalFile=0
for i in xrange(countFile):
fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
indexFile= bz2.BZ2File(fileName, 'rb')
flag=1
topOfFile=indexFile.readline().strip()
listOfWords = topOfFile.split(' ')
if listOfWords[0] not in heap:
heapq.heappush(heap, listOfWords[0])

while any(flag)==1:
temp = heapq.heappop(heap)
for i in xrange(countFile):
if flag==1:
if listOfWords[0]==temp:

//This is where I am stuck. I cannot wait until memory //error, as I need to do some postprocessing too.
try:
data[temp].extend(listOfWords[1:])
except MemoryError:
writeFinalIndex(data, countFinalFile, pathOfFolder)
data=defaultdict(list)
countFinalFile+=1

topOfFile=indexFile.readline().strip()
if topOfFile=='':
flag=0
indexFile.close()
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
else:
listOfWords = topOfFile.split(' ')
if listOfWords[0] not in heap:
heapq.heappush(heap, listOfWords[0])
writeFinalIndex(data, countFinalFile, pathOfFolder)

countFile is the number of files and writeFileIndex method writes into the file.
 
A

Ayushi Dalmia

I always find this sort of accumulation weird (well, at least in

Python; it's the *only* way in many other languages) and would write

it as



a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)



-tkc

This also doesn't gives the true size. I did the following:

import sys
data=[]
f=open('stopWords.txt','r')

for line in f:
line=line.split()
data.extend(line)

print sys.getsizeof(data)

where stopWords.txt is a file of size 4KB
 
R

Rustom Mody

This also doesn't gives the true size. I did the following:
import sys
data=[]
f=open('stopWords.txt','r')
for line in f:
line=line.split()
data.extend(line)
print sys.getsizeof(data)
where stopWords.txt is a file of size 4KB

Try getsizeof("".join(data))

General advice:
- You have been recommended (by Chris??) that you should use a database
- You say you cant use a database (for whatever reason)

Now the fact is you NEED database (functionality)
How to escape this catch-22 situation?
In computer science its called somewhat sardonically "Greenspun's 10th rule"

And the best way out is to

1 isolate those aspects of database functionality you need
2 temporarily forget about your original problem and implement the dbms
(subset of) DBMS functionality you need
3 Use 2 above to implement 1
 
A

Ayushi Dalmia

This also doesn't gives the true size. I did the following:


import sys
f=open('stopWords.txt','r')



for line in f:
line=line.split()
data.extend(line)



print sys.getsizeof(data)


where stopWords.txt is a file of size 4KB



Try getsizeof("".join(data))



General advice:

- You have been recommended (by Chris??) that you should use a database

- You say you cant use a database (for whatever reason)



Now the fact is you NEED database (functionality)

How to escape this catch-22 situation?

In computer science its called somewhat sardonically "Greenspun's 10th rule"



And the best way out is to



1 isolate those aspects of database functionality you need

2 temporarily forget about your original problem and implement the dbms

(subset of) DBMS functionality you need

3 Use 2 above to implement 1

Hello Rustum,

Thanks for the enlightenment. I did not know about the Greenspun's Tenth rule. It is interesting to know that. However, it is an academic project and not a research one. Hence I donot have the liberty to choose what to work with. Life is easier with databases though, but I am not allowed to use them.. Thanks for the tip. I will try to replicate those functionality.
 
P

Peter Otten

Ayushi said:
You've rejected all the alternatives so far without showing your

code, or even properly specifying your problem.



To get the "total" size of a list of strings, try (untested):



a = sys.getsizeof (mylist )

for item in mylist:

a += sys.getsizeof (item)



This can be high if some of the strings are interned and get

counted twice. But you're not likely to get closer without some

knowledge of the data objects and where they come

from.

Hello Dave,

I just thought that saving others time is better and hence I explained
only the subset of my problem. Here is what I am trying to do:

I am trying to index the current wikipedia dump without using databases
and create a search engine for Wikipedia documents. Note, I CANNOT USE
DATABASES. My approach:

I am parsing the wikipedia pages using SAX Parser, and then, I am dumping
the words along with the posting list (a list of doc ids in which the word
is present) into different files after reading 'X' number of pages. Now
these files may have the same word and hence I need to merge them and
write the final index again. Now these final indexes must be of limited
size as I need to be of limited size. This is where I am stuck. I need to
know how to determine the size of content in a variable before I write
into the file.

Here is the code for my merging:

def mergeFiles(pathOfFolder, countFile):
listOfWords={}
indexFile={}
topOfFile={}
flag=[0]*countFile
data=defaultdict(list)
heap=[]
countFinalFile=0
for i in xrange(countFile):
fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
indexFile= bz2.BZ2File(fileName, 'rb')
flag=1
topOfFile=indexFile.readline().strip()
listOfWords = topOfFile.split(' ')
if listOfWords[0] not in heap:
heapq.heappush(heap, listOfWords[0])


At this point you have already done it wrong as your heap contains the
complete data and you have done a lot of O(N) tests on the heap.
This is both slow and consumes a lot of memory. See

http://code.activestate.com/recipes/491285-iterator-merge/

for a sane way to merge sorted data from multiple files. Your code becomes
(untested)

with open("outfile.txt", "wb") as outfile:

infiles = []
for i in xrange(countFile):
filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2')
infiles.append(bz2.BZ2File(filename, "rb"))

outfile.writelines(imerge(*infiles))

for infile in infiles:
infile.close()

Once you have your data in a single file you can read from that file and do
the postprocessing you mention below.

while any(flag)==1:
temp = heapq.heappop(heap)
for i in xrange(countFile):
if flag==1:
if listOfWords[0]==temp:

//This is where I am stuck. I cannot wait until memory
//error, as I need to do some postprocessing too. try:
data[temp].extend(listOfWords[1:])
except MemoryError:
writeFinalIndex(data, countFinalFile,
pathOfFolder) data=defaultdict(list)
countFinalFile+=1

topOfFile=indexFile.readline().strip()
if topOfFile=='':
flag=0
indexFile.close()
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
else:
listOfWords = topOfFile.split(' ')
if listOfWords[0] not in heap:
heapq.heappush(heap, listOfWords[0])
writeFinalIndex(data, countFinalFile, pathOfFolder)

countFile is the number of files and writeFileIndex method writes into the
file.
 
S

Steven D'Aprano

This also doesn't gives the true size. I did the following:


What do you mean by "true size"?

Do you mean the amount of space a certain amount of data will take in
memory? With or without the overhead of object headers? Or do you mean
how much space it will take when written to disk? You have not been clear
what you are trying to measure.

If you are dealing with one-byte characters, you can measure the amount
of memory they take up (excluding object overhead) by counting the number
of characters: 23 one-byte characters requires 23 bytes. Plus the object
overhead gives:

py> sys.getsizeof('a'*23)
44

44 bytes (23 bytes for the 23 single-byte characters, plus 21 bytes
overhead). One thousand such characters takes:

py> sys.getsizeof('a'*1000)
1021

If you write such a string to disk, it will take 1000 bytes (or 1KB),
unless you use some sort of compression.
import sys
data=[]
f=open('stopWords.txt','r')

for line in f:
line=line.split()
data.extend(line)

print sys.getsizeof(data)

This will give you the amount of space taken by the list object. It will
*not* give you the amount of space taken by the individual strings.

A Python list looks like this:


| header | array of pointers |


The header is of constant or near-constant size; the array depends on the
number of items in the list. It may be bigger than the list, e.g. a list
with 1000 items might have allocated space for 2000 items. It will never
be smaller.

getsizeof(list) only counts the direct size of that list, including the
array, but not the things which the pointers point at. If you want the
total size, you need to count them as well.

where stopWords.txt is a file of size 4KB

My guess is that if you split a 4K file into words, then put the words
into a list, you'll probably end up with 6-8K in memory.
 
C

Chris Angelico

My guess is that if you split a 4K file into words, then put the words
into a list, you'll probably end up with 6-8K in memory.

I'd guess rather more; Python strings have a fair bit of fixed
overhead, so with a whole lot of small strings, it will get more
costly.
'3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan 5 2014, 16:23:43) [MSC v.1600 32
bit (Intel)]'29

"Stop words" tend to be short, rather than long, words, so I'd look at
an average of 2-3 letters per word. Assuming they're separated by
spaces or newlines, that means there'll be roughly a thousand of them
in the file, for about 25K of overhead. A bit less if the words are
longer, but still quite a bit. (Byte strings have slightly less
overhead, 17 bytes apiece, but still quite a bit.)

ChrisA
 
D

Dave Angel

Ayushi Dalmia said:
I always find this sort of accumulation weird (well, at least in

Python; it's the *only* way in many other languages) and would write

it as



a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)



-tkc

This also doesn't gives the true size. I did the following:

import sys
data=[]
f=open('stopWords.txt','r')

for line in f:
line=line.split()
data.extend(line)

print sys.getsizeof(data)

Did you actually READ either of my posts or Tim's? For a
container, you can't just use getsizeof on the container.


a = sys.getsizeof (data)
for item in mylist:
a += sys.getsizeof (data)
print a
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top