Large Amount of Data

Jack · May 25, 2007

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

Matimus · May 25, 2007

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

The OS will take care of memory swapping. It might get slow, but I
don't think it should fail.

Matt

Marc 'BlackJack' Rintsch · May 25, 2007

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

Jack · May 25, 2007

Thanks for the replies!

Database will be too slow for what I want to do.

Larry Bates · May 25, 2007

Jack said:
Thanks for the replies!

Database will be too slow for what I want to do.

Purchase more memory. It is REALLY cheap these days.

-Larry

kaens · May 25, 2007

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

Could you process it in chunks, instead of reading in all the data at once?

Vyacheslav Maslov · May 25, 2007

Larry said:
Purchase more memory. It is REALLY cheap these days.

Not a solution at all. What about if amount of data exceed architecture
memory limits? i.e. 4Gb at 32bit.

Better solution is to use database for data storage/processing

Dennis Lee Bieber · May 26, 2007

Thanks for the replies!

Database will be too slow for what I want to do.

Slower than having every process on the computer potentially slowed
down due to page swapping (and, for really huge data, still running the
risk of exceeding the single-process address space)?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

John Nagle · May 26, 2007

Jack said:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle

Jack · May 26, 2007

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

Jack · May 26, 2007

I suppose I can but it won't be very efficient. I can have a smaller
hashtable,
and process those that are in the hashtable and save the ones that are not
in the hash table for another round of processing. But chunked hashtable
won't work that well because you don't know if they exist in other chunks.
In order to do this, I'll need to have a rule to partition the data into
chunks.
So this is more work in general.

Jack · May 26, 2007

If swap memery can not handle this efficiently, I may need to partition
data to multiple servers and use RPC to communicate.

Marc 'BlackJack' Rintsch · May 26, 2007

I have tens of millions (could be more) of document in files. Each of them
has other properties in separate files. I need to check if they exist,
update and merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and updating a database will take a long time...

But databases are exactly build and optimized to handle large amounts of
data.

Let's say, I want to do something a search engine needs to do in terms
of the amount of data to be processed on a server. I doubt any serious
search engine would use a database for indexing and searching. A hash
table is what I need, not powerful queries.

You are not forced to use complex queries and an index is much like a hash
table, often even implemented as a hash table. And a database doesn't
have to be an SQL database. The `shelve` module or an object DB like zodb
or Durus are databases too.

Maybe you should try it and measure before claiming it's going to be too
slow and spend time to implement something like a database yourself.

Ciao,
Marc 'BlackJack' Rintsch

Steve Holden · May 26, 2007

Jack said:
> I have tens of millions (could be more) of document in files. Each of them
> has other
> properties in separate files. I need to check if they exist, update and
> merge properties, etc.
> And this is not a one time job. Because of the quantity of the files, I
> think querying and
> updating a database will take a long time...
>

And I think you are wrong. But of course the only way to find out who's
right and who's wrong is to do some experiments and get some benchmark
timings.

All I *would* say is that it's unwise to proceed with a memory-only
architecture when you only have assumptions about the limitations of
particular architectures, and your problem might actually grow to exceed
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to
take a real nose-dive. Then where do you go? Much better to architect
the application so that you anticipate exceeding memory limits from the
start, I'd hazard.

> Let's say, I want to do something a search engine needs to do in terms of
> the amount of
> data to be processed on a server. I doubt any serious search engine would
> use a database
> for indexing and searching. A hash table is what I need, not powerful
> queries.
>

You might be surprised. Google, for example, use a widely-distributed
and highly-redundant storage format, but they certainly don't keep the
whole Internet in memory

Perhaps you need to explain the problem in more detail if you still need
help.

regards
Steve

--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

John Machin · May 26, 2007

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Don't think, benchmark.

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?

Jack · May 26, 2007

I'll save them in a file for further processing.

John Machin said:
And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

John Machin · May 26, 2007

I'll save them in a file for further processing.

Further processing would be what?
Did you read the remainder of what I wrote?

Jack · May 27, 2007

John, thanks for your reply. I will then use the files as input to generate
an index. So the
files are temporary, and provide some attributes in the index. So I do this
multiple times
to gather different attributes, merge, etc.

What risks of data corruption are reduced by divide large size PST files?	0	Oct 13, 2025
Why should I split large MBOX files?	0	Mar 31, 2026
Need total amount displayed of data-price attribute from each table	2	Jul 3, 2022
How can I calculate the last payment for Reprofiled Amount column with 2 decimal places to make the sum of all payments to be the same as RC amount?	2	Jul 13, 2023
Can I split PST files without losing data?	0	Apr 21, 2026
How do I fix Error 1028: Insufficient Memory in IBM Notes when opening a large NSF file?	0	Feb 19, 2026
Measure the amount of memory used?	2	Aug 18, 2011
How do I move MBOX data to Hotmail without Outlook?	3	Mar 19, 2026

Large Amount of Data

Jack

Matimus

Marc 'BlackJack' Rintsch

Jack

Larry Bates

kaens

Vyacheslav Maslov

Dennis Lee Bieber

John Nagle

Jack

Jack

Jack

Marc 'BlackJack' Rintsch

Steve Holden

John Machin

Jack

John Machin

Jack

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads