why is my hash being weird??

P

pycraze

Hi ,

I am using Fedora Core -3 and my Python version is 2.4 .
kernel version - 2.6.9-1.667smp
There is some freakish thing that happens with python hashes when i
run a python script
my python file is basically :

myhash = {}
def summa():
global myhash
myhash[0] = 0
myhash[1] = 1
myhash[2] = 2
myhash[3] = 3

i run a C file :
main(int argc, char **argv)
{
int i = atoi(argv[1]), j;
printf("myhash = {}\n");
printf("def summa():\n");
printf(" global myhash\n");

for (j = 0; j < i; j++)
printf(" myhash[%d] = %d\n", j, j);

printf("\nsumma()\n");
}

and the output of this .c file is redirected to a .py file.
I do the following Steps to the .c file to create the .py file

1 cc -o s s.c
2 ./s (input) >>test.py
3 python test.py
When i run python to this .py file , i find that this process eats
lots of virtual memory of my machine. I can give u detailed examples
to what heights can the virtual memory can go , when i do a top ,
with the inputs given to the c file

1. input is 100000 VIRT is 119m
2. input is 300000 VIRT is 470m
3 input is 700000 VIRT is 1098m
4 input is 1000000 VIRT is 1598m

where VIRT - virtual memory

m - MB ( Mega Bytes)

these results are very alarming as it means that each
hash requires 1 KB of space approx .

I would like to know why and how to solve this problem ?

I also did try change the above .c file , so that the new .c file
will have multiple functions that divide the load of building the hash
structure. Then again the results are same

Pls do assist me with this problem of mine !!!
 
S

Steven D'Aprano

pycraze said:
I do the following Steps to the .c file to create the .py file

1 cc -o s s.c
2 ./s (input) >>test.py
3 python test.py

You are appending to the test file. How many times have
you appended to it? Once? Twice? A dozen times? Just
what is in the file test.py after all this time?

When i run python to this .py file , i find that this process eats
lots of virtual memory of my machine. I can give
> detailed examples
> to what heights can the virtual memory can go ,
> when i do a top ,
> with the inputs given to the c file
>
> 1. input is 100000 VIRT is 119m

The dictionary you create is going to be quite small:
at least 780KB. Call it a megabyte. (What's a dozen or
two K between friends?) Heck, call it 2MB.

That still leaves a mysterious 117MB unaccounted for.
Some of that will be the Python virtual machine and
various other overhead. What else is there?

Simple: you created a function summa with 100000 lines
of code. That's a LOT of code to go into one object.
Normally, 100,000 lines of code will be split between
dozens, hundreds of functions and multiple modules. But
you've created one giant lump of code that needs to be
paged in and out of memory in one piece. Ouch!

.... global hash
.... hash[0] = 0
.... hash[1] = 1
.... 3 0 LOAD_CONST 1 (0)
3 LOAD_GLOBAL 0 (hash)
6 LOAD_CONST 1 (0)
9 STORE_SUBSCR

4 10 LOAD_CONST 2 (1)
13 LOAD_GLOBAL 0 (hash)
16 LOAD_CONST 2 (1)
19 STORE_SUBSCR
20 LOAD_CONST 0 (None)
23 RETURN_VALUE

That's how much bytecode you get for two keys. Now
imagine how much you'll need for 100,000 keys.

You don't need to write the code from C, just do it all
in Python:

hash = {}
def summa(i):
global hash
for j in range(i):
hash[j] = j

import sys
summa(sys.argv[1])


Now run the script:


python test.py 100000
 
P

pycraze

You are appending to the test file. How many times have
you appended to it? Once? Twice? A dozen times? Just
what is in the file test.py after all this time?
when input =4
> ./s 4 >test.py

test.py is
myhash = {}
def summa():
global myhash
myhash[0] = 0
myhash[1] = 1
myhash[2] = 2
myhash[3] = 3

if now input is 100 then test py will be

myhash = {}
def summa():
global myhash
myhash[0] = 0
myhash[1] = 1
myhash[2] = 2
myhash[3] = 3
.......
.......
.......
.......
myhash[99] = 99
I append only once , and that too i do this exercise to get a largely big hash.


This result came to a bit of a suprise.. when i construct large hashes
... my system gets stalled....

So i was interested how, for this i came up with this exercise .

Anyway thanks
Dennis
 
S

Steve Holden

pycraze said:
You are appending to the test file. How many times have
you appended to it? Once? Twice? A dozen times? Just
what is in the file test.py after all this time?

when input =4
./s 4 >test.py
test.py is
myhash = {}
def summa():
global myhash
myhash[0] = 0
myhash[1] = 1
myhash[2] = 2
myhash[3] = 3

if now input is 100 then test py will be

myhash = {}
def summa():
global myhash
myhash[0] = 0
myhash[1] = 1
myhash[2] = 2
myhash[3] = 3
.......
.......
.......
.......
myhash[99] = 99

I append only once , and that too i do this exercise to get a largely big hash.



This result came to a bit of a suprise.. when i construct large hashes
.. my system gets stalled....

So i was interested how, for this i came up with this exercise .

OK. Now examine a similar program:

myhash = {}
def summa(n):
global myhash
for i in range(n):
myhash = i

summan(1000000)

and see how this compares with your program having a million lines of
source. Then repeat, changing the argument to summa. Then draw some
conclusions. Then report those results back.

regards
Steve
 
P

pycraze

Surely adopting the above method is much better than what i have
approached earlier . The main reason i did adopt this exercise was when
i have to marshal a 20 - 40 MB above test.py file to the disk , the
simple load of the test.py will sky rocket my virtual memory
consumption.

I was bit troubled with this bottleneck , so i wanted to do some
research before i come to some conclusions .

Anyways , i really appreciate ur enthusiasm to helping me come to a
conclusion

Dennis
 
S

Steve Holden

pycraze said:
Surely adopting the above method is much better than what i have
approached earlier . The main reason i did adopt this exercise was when
i have to marshal a 20 - 40 MB above test.py file to the disk , the
simple load of the test.py will sky rocket my virtual memory
consumption.

I was bit troubled with this bottleneck , so i wanted to do some
research before i come to some conclusions .

Anyways , i really appreciate ur enthusiasm to helping me come to a
conclusion

Dennis
I'd be interested to know what other languages you have tested with
20-40MB source files, and what conclusions you have arrived at about
them. Particularly since your initial conclusion about Python was
"dictionaries are weird and each entry uses approximately 1kB" :)

regards
Steve
 
S

Steven D'Aprano

Surely adopting the above method is much better than what i have
approached earlier . The main reason i did adopt this exercise was when
i have to marshal a 20 - 40 MB above test.py file to the disk , the
simple load of the test.py will sky rocket my virtual memory
consumption.

Why do you have to marshal a 20MB test.py file? That's just crazy.

The entire standard Python language, object files and source files
combined, is about 90MB. There are 185 *.py modules just in the top level
of the directory, that's less than half a megabyte per module (and in
reality, much less than that).

If your source file is more than 100K in size, you really should be
breaking it into separate modules. If any one function is more than one
page when printed out, you probably should be splitting it into two or
more functions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,076
Latest member
OrderKetoBeez

Latest Threads

Top