I need ideas on how to sort 350 million lines of data

Discussion in 'Perl Misc' started by chadda@lonemerchant.com, May 17, 2008.

  1. Guest

    I have roughly 350 million lines of data in the following form

    name, price, weight, brand, sku, upc, size

    sitting on my home PC.

    Is there some kind of sane way to sort this without taking up too much
    ram or jacking up my limited CPU time?
    , May 17, 2008
    #1
    1. Advertising

  2. Andrew Rich Guest

    What operating system ?

    I would throw it into MySQL


    <> wrote in message
    news:...
    >I have roughly 350 million lines of data in the following form
    >
    > name, price, weight, brand, sku, upc, size
    >
    > sitting on my home PC.
    >
    > Is there some kind of sane way to sort this without taking up too much
    > ram or jacking up my limited CPU time?
    Andrew Rich, May 17, 2008
    #2
    1. Advertising

  3. Guest

    On May 17, 8:49 am, "Andrew Rich" <> wrote:
    > What operating system ?
    >
    > I would throw it into MySQL
    >
    > <> wrote in message
    >
    > news:...
    >
    >
    >
    > >I have roughly 350 million lines of data in the following form

    >
    > > name, price, weight, brand, sku, upc, size

    >
    > > sitting on my home PC.

    >
    > > Is there some kind of sane way to sort this without taking up too much
    > > ram or jacking up my limited CPU time?- Hide quoted text -

    >
    > - Show quoted text -



    At the risk of sounding like a total dumba--, is it' possible to
    upload a .cvs file directly into mysql?
    , May 17, 2008
    #3
  4. Guest

    On May 17, 8:56 am, wrote:
    > On May 17, 8:49 am, "Andrew Rich" <> wrote:
    >
    >
    >
    >
    >
    > > What operating system ?

    >
    > > I would throw it into MySQL

    >
    > > <> wrote in message

    >
    > >news:...

    >
    > > >I have roughly 350 million lines of data in the following form

    >
    > > > name, price, weight, brand, sku, upc, size

    >
    > > > sitting on my home PC.

    >
    > > > Is there some kind of sane way to sort this without taking up too much
    > > > ram or jacking up my limited CPU time?- Hide quoted text -

    >
    > > - Show quoted text -

    >
    > At the risk of sounding like a total dumba--, is it' possible to
    > upload a .cvs file directly into mysql?- Hide quoted text -
    >
    > - Show quoted text -


    Never mind. I can google the answer. Thanks.
    , May 17, 2008
    #4
  5. Guest

    wrote:
    > I have roughly 350 million lines of data in the following form
    >
    > name, price, weight, brand, sku, upc, size


    Name, in particular, seems like it might be able to contain embedded
    punctuation and might be escaped in some way. That could complicate
    things

    > sitting on my home PC.


    What kind of PC is your home PC?

    > Is there some kind of sane way to sort this without taking up too much
    > ram


    As long as you have plenty of scratch space, Linux's system sort will
    use temp files to sort things much larger than main memory. For all I
    know, Window's DOS emulator's sort will as well. But it is a matter of
    whether you can get the system sort command to sort on the field and
    collation sequence you want sorted. If not, you could use Perl to
    transform the data into something more acceptable, use the system sort,
    then transform it back.

    > or jacking up my limited CPU time?


    Sorting 350 million records will take some CPU time. I don't know what
    you consider to be "jacking up" or how limited you think your CPU time.
    My CPUs are limited to about 86,400 seconds per day, rather I am using
    them or not.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , May 18, 2008
    #5
  6. Bill H Guest

    On May 17, 11:21 am, wrote:
    > I have roughly 350 million lines of data in the following form
    >
    > name, price, weight, brand, sku, upc, size
    >
    > sitting on my home PC.
    >
    > Is there some kind of sane way to sort this without taking up too much
    > ram or jacking up my limited CPU time?


    Just out of curiosity I would like to know how someone has a file
    containing 350 million line of product information sitting on a home
    pc in the first place. I mean it had to have come from some sort of
    database to start with, and withthose numbers we aren't talking about
    a second hand store.

    Bill H
    Bill H, May 19, 2008
    #6
  7. Ted Zlatanov Guest

    On Sat, 17 May 2008 08:21:21 -0700 (PDT) wrote:

    c> I have roughly 350 million lines of data in the following form
    c> name, price, weight, brand, sku, upc, size

    c> sitting on my home PC.

    c> Is there some kind of sane way to sort this without taking up too much
    c> ram or jacking up my limited CPU time?

    One simple way, without using databases, is to take smaller pieces (say,
    10K lines each) and sort them individually by whatever field you need.
    Then you take the top or bottom of each piece, make a new set, and sort
    that set for the final result.

    If you need to sort the whole list and not just get the max/min, apply
    the same algorithm except you keep each sorted piece open and keep
    taking the smallest/largest element from the top/bottom of the piece
    that contains it.

    For more information and if my explanation doesn't make sense, look up
    the "merge sort" algorithm.

    Ted
    Ted Zlatanov, May 19, 2008
    #7
  8. Ted Zlatanov Guest

    On Tue, 20 May 2008 11:42:55 +0100 bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

    b> Ted Zlatanov wrote:
    >> One simple way, without using databases, is to take smaller pieces (say,
    >> 10K lines each) and sort them individually by whatever field you need.
    >> Then you take the top or bottom of each piece, make a new set, and sort
    >> that set for the final result.
    >>
    >> If you need to sort the whole list and not just get the max/min, apply
    >> the same algorithm except you keep each sorted piece open and keep
    >> taking the smallest/largest element from the top/bottom of the piece
    >> that contains it.
    >>
    >> For more information and if my explanation doesn't make sense, look up
    >> the "merge sort" algorithm.


    b> IIRC Linux/Unix sort used quicksort for in RAM
    b> and merge sort (via disc) if the data size exceeds RAM size,
    b> again using quicksort in RAM when the portion to be
    b> merged fit in RAM.

    Yes, but a) it writes them in /tmp (unless you use -T in newer sort
    implementations), b) it's not as flexible as what I described, and c) it
    only works on Unix-like systems (on Windows you have to install cygwin
    or other packages, etc.).

    (b) is particularly important IMO for anything but simple sorting.

    Ted
    Ted Zlatanov, May 20, 2008
    #8
  9. Guest

    On May 18, 1:51 pm, wrote:
    > wrote:
    > > I have roughly 350 million lines of data in the following form

    >
    > > name, price, weight, brand, sku, upc, size

    >
    > Name, in particular, seems like it might be able to contain embedded
    > punctuation and might be escaped in some way. That could complicate
    > things
    >
    > > sitting on my home PC.

    >
    > What kind of PC is your home PC?
    >


    My home PC is an 700MHZ intel, 256MB RAM running Fedora Core Linux 6
    , May 20, 2008
    #9
  10. Bill H Guest

    On May 20, 2:51 pm, RedGrittyBrick <>
    wrote:
    > Bill H wrote:
    > > On May 17, 11:21 am, wrote:
    > >> I have roughly 350 million lines of data in the following form

    >
    > >> name, price, weight, brand, sku, upc, size

    >
    > >> sitting on my home PC.

    >
    > >> Is there some kind of sane way to sort this without taking up too much
    > >> ram or jacking up my limited CPU time?

    >
    > > Just out of curiosity I would like to know how someone has a file
    > > containing 350 million line of product information sitting on a home
    > > pc in the first place. I mean it had to have come from some sort of
    > > database to start with, and withthose numbers we aren't talking about
    > > a second hand store.

    >
    > In an earlier thread* you'll see the OP is planning to download 350
    > million records one at a time from the doba.com website. Sinan pointed
    > out this would take 3.7 years of continuous scraping (at 3 pages/sec).
    >
    > Perhaps the OP is planning ahead.
    >
    > --
    > RGB
    > * "Need ideas on how to make this code faster than a speeding turtle"- Hide quoted text -
    >
    > - Show quoted text -


    Well if he was downloading them individually he should have sorted
    them at the same time and killed 2 birds with one stone in those 3.7
    years.

    Bill H

    BTW - whats up with google now using captcha in their posting??
    Bill H, May 20, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Weng Tianxiang
    Replies:
    36
    Views:
    3,399
    Brannon
    Jul 15, 2006
  2. Micah Elliott

    PEP 350: Codetags

    Micah Elliott, Sep 26, 2005, in forum: Python
    Replies:
    20
    Views:
    812
    =?iso-8859-1?Q?Fran=E7ois?= Pinard
    Oct 1, 2005
  3. Cole
    Replies:
    0
    Views:
    376
  4. Replies:
    3
    Views:
    110
    Mark D Powell
    Aug 17, 2005
  5. Replies:
    2
    Views:
    89
    Tad J McClellan
    Oct 18, 2008
Loading...

Share This Page