Large amount of files to parse/organize, tips on algorithm?

Discussion in 'Python' started by cnb, Sep 2, 2008.

  1. cnb

    cnb Guest

    I have a bunch of files consisting of moviereviews.

    For each file I construct a list of reviews and then for each new file
    I merge the reviews so that in the end have a list of reviewers and
    for each reviewer all their reviews.

    What is the fastest way to do this?

    1. Create one file with reviews, open next file an for each review see
    if the reviewer exists, then add the review else create new reviewer.

    2. create all the separate files with reviews then mergesort them?
    cnb, Sep 2, 2008
    #1
    1. Advertising

  2. On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:

    > I have a bunch of files consisting of moviereviews.
    >
    > For each file I construct a list of reviews and then for each new file I
    > merge the reviews so that in the end have a list of reviewers and for
    > each reviewer all their reviews.
    >
    > What is the fastest way to do this?


    Use the timeit module to find out.


    > 1. Create one file with reviews, open next file an for each review see
    > if the reviewer exists, then add the review else create new reviewer.
    >
    > 2. create all the separate files with reviews then mergesort them?


    The answer will depend on whether you have three reviews or three
    million, whether each review is twenty words or twenty thousand words,
    and whether you have to do the merging once only or over and over again.


    --
    Steven
    Steven D'Aprano, Sep 2, 2008
    #2
    1. Advertising

  3. cnb

    cnb Guest

    On Sep 2, 7:06 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
    > > I have a bunch of files consisting of moviereviews.

    >
    > > For each file I construct a list of reviews and then for each new file I
    > > merge the reviews so that in the end have a list of reviewers and for
    > > each reviewer all their reviews.

    >
    > > What is the fastest way to do this?

    >
    > Use the timeit module to find out.
    >
    > > 1. Create one file with reviews, open next file an for each review see
    > > if the reviewer exists, then add the review else create new reviewer.

    >
    > > 2. create all the separate files with reviews then mergesort them?

    >
    > The answer will depend on whether you have three reviews or three
    > million, whether each review is twenty words or twenty thousand words,
    > and whether you have to do the merging once only or over and over again.
    >
    > --
    > Steven




    I merge once. each review has 3 fields, date rating customerid. in
    total ill be parsing between 10K and 100K, eventually 450K reviews.
    cnb, Sep 2, 2008
    #3
  4. cnb

    cnb Guest

    over 17000 files...

    netflixprize.
    cnb, Sep 2, 2008
    #4
  5. cnb

    Eric Wertman Guest

    Eric Wertman, Sep 2, 2008
    #5
  6. cnb

    Paul Rubin Guest

    cnb <> writes:
    > For each file I construct a list of reviews and then for each new file
    > I merge the reviews so that in the end have a list of reviewers and
    > for each reviewer all their reviews.
    >
    > What is the fastest way to do this?


    Scan through all the files sequentially, emitting records like

    (movie, reviewer, review)

    Then use an external sort utility to sort/merge that output file
    on each of the 3 columns. Beats writing code.
    Paul Rubin, Sep 2, 2008
    #6
  7. cnb

    jay graves Guest

    jay graves, Sep 2, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ulrich Scholz
    Replies:
    2
    Views:
    259
    Thomas Kellerer
    Sep 14, 2007
  2. Spam Catcher

    Large Web Service ... How to Organize?

    Spam Catcher, Feb 12, 2008, in forum: ASP .Net
    Replies:
    4
    Views:
    418
    Spam Catcher
    Feb 13, 2008
  3. Replies:
    5
    Views:
    351
    Paul McGuire
    Mar 20, 2009
  4. Jayden Shui
    Replies:
    13
    Views:
    544
    Jorgen Grahn
    Nov 20, 2011
  5. Sebastian Newstream

    Moving large amount of files, 1.750.000+

    Sebastian Newstream, Nov 9, 2008, in forum: Ruby
    Replies:
    14
    Views:
    210
    Siep Korteling
    Nov 10, 2008
Loading...

Share This Page