I need ideas on how to sort 350 million lines of data

C

chadda

I have roughly 350 million lines of data in the following form

name, price, weight, brand, sku, upc, size

sitting on my home PC.

Is there some kind of sane way to sort this without taking up too much
ram or jacking up my limited CPU time?
 
C

chadda

What operating system ?

I would throw it into MySQL









- Show quoted text -


At the risk of sounding like a total dumba--, is it' possible to
upload a .cvs file directly into mysql?
 
C

chadda

At the risk of sounding like a total dumba--, is it' possible to
upload a .cvs file directly into mysql?- Hide quoted text -

- Show quoted text -

Never mind. I can google the answer. Thanks.
 
X

xhoster

I have roughly 350 million lines of data in the following form

name, price, weight, brand, sku, upc, size

Name, in particular, seems like it might be able to contain embedded
punctuation and might be escaped in some way. That could complicate
things
sitting on my home PC.

What kind of PC is your home PC?
Is there some kind of sane way to sort this without taking up too much
ram

As long as you have plenty of scratch space, Linux's system sort will
use temp files to sort things much larger than main memory. For all I
know, Window's DOS emulator's sort will as well. But it is a matter of
whether you can get the system sort command to sort on the field and
collation sequence you want sorted. If not, you could use Perl to
transform the data into something more acceptable, use the system sort,
then transform it back.
or jacking up my limited CPU time?

Sorting 350 million records will take some CPU time. I don't know what
you consider to be "jacking up" or how limited you think your CPU time.
My CPUs are limited to about 86,400 seconds per day, rather I am using
them or not.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
B

Bill H

I have roughly 350 million lines of data in the following form

name, price, weight, brand, sku, upc, size

sitting on my home PC.

Is there some kind of sane way to sort this without taking up too much
ram or jacking up my limited CPU time?

Just out of curiosity I would like to know how someone has a file
containing 350 million line of product information sitting on a home
pc in the first place. I mean it had to have come from some sort of
database to start with, and withthose numbers we aren't talking about
a second hand store.

Bill H
 
T

Ted Zlatanov

On Sat, 17 May 2008 08:21:21 -0700 (PDT) (e-mail address removed) wrote:

c> I have roughly 350 million lines of data in the following form
c> name, price, weight, brand, sku, upc, size

c> sitting on my home PC.

c> Is there some kind of sane way to sort this without taking up too much
c> ram or jacking up my limited CPU time?

One simple way, without using databases, is to take smaller pieces (say,
10K lines each) and sort them individually by whatever field you need.
Then you take the top or bottom of each piece, make a new set, and sort
that set for the final result.

If you need to sort the whole list and not just get the max/min, apply
the same algorithm except you keep each sorted piece open and keep
taking the smallest/largest element from the top/bottom of the piece
that contains it.

For more information and if my explanation doesn't make sense, look up
the "merge sort" algorithm.

Ted
 
T

Ted Zlatanov

b> IIRC Linux/Unix sort used quicksort for in RAM
b> and merge sort (via disc) if the data size exceeds RAM size,
b> again using quicksort in RAM when the portion to be
b> merged fit in RAM.

Yes, but a) it writes them in /tmp (unless you use -T in newer sort
implementations), b) it's not as flexible as what I described, and c) it
only works on Unix-like systems (on Windows you have to install cygwin
or other packages, etc.).

(b) is particularly important IMO for anything but simple sorting.

Ted
 
C

chadda

Name, in particular, seems like it might be able to contain embedded
punctuation and might be escaped in some way. That could complicate
things


What kind of PC is your home PC?

My home PC is an 700MHZ intel, 256MB RAM running Fedora Core Linux 6
 
B

Bill H

In an earlier thread* you'll see the OP is planning to download 350
million records one at a time from the doba.com website. Sinan pointed
out this would take 3.7 years of continuous scraping (at 3 pages/sec).

Perhaps the OP is planning ahead.

--
RGB
* "Need ideas on how to make this code faster than a speeding turtle"- Hide quoted text -

- Show quoted text -

Well if he was downloading them individually he should have sorted
them at the same time and killed 2 birds with one stone in those 3.7
years.

Bill H

BTW - whats up with google now using captcha in their posting??
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top