deduping

Discussion in 'Python' started by dirknbr, Jun 21, 2010.

  1. dirknbr

    dirknbr Guest

    Hi

    I have 2 files (done and outf), and I want to chose unique elements
    from the 2nd column in outf which are not in done. This code works but
    is not efficient, can you think of a quicker way? The a=1 is just a
    redundant task obviously, I put it this way around because I think
    'in' is quicker than 'not in' - is that true?

    done_={}
    for line in done:
    done_[line.strip()]=0

    print len(done_)

    universe={}
    for line in outf:
    if line.split(',')[1].strip() in universe.keys():
    a=1
    else:
    if line.split(',')[1].strip() in done_.keys():
    a=1
    else:
    universe[line.split(',')[1].strip()]=0

    Dirk
     
    dirknbr, Jun 21, 2010
    #1
    1. Advertising

  2. > universe={}
    > for line in outf:
    >     if line.split(',')[1].strip() in universe.keys():
    >         a=1
    >     else:
    >         if line.split(',')[1].strip() in done_.keys():
    >             a=1
    >         else:
    >             universe[line.split(',')[1].strip()]=0
    >


    I can not say too much because I don't see what is processed
    but what I can say is: "line.split(',')[1].strip()" might be
    called three times so I would do it once only. And I would write
    it like this:

    for line in outf:
    key = line.split(',')[1].strip()
    if not (key in universe.keys()):
    if not (key in done_.keys()):
    universe[key] = 0
     
    Thomas Lehmann, Jun 21, 2010
    #2
    1. Advertising

  3. dirknbr

    Peter Otten Guest

    dirknbr wrote:

    > Hi
    >
    > I have 2 files (done and outf), and I want to chose unique elements
    > from the 2nd column in outf which are not in done. This code works but
    > is not efficient, can you think of a quicker way? The a=1 is just a
    > redundant task obviously, I put it this way around because I think
    > 'in' is quicker than 'not in' - is that true?
    >
    > done_={}
    > for line in done:
    > done_[line.strip()]=0
    >
    > print len(done_)
    >
    > universe={}
    > for line in outf:
    > if line.split(',')[1].strip() in universe.keys():
    > a=1
    > else:
    > if line.split(',')[1].strip() in done_.keys():
    > a=1
    > else:
    > universe[line.split(',')[1].strip()]=0


    Instead of

    if key in some_dict.keys():
    #...

    which converts the keys in the dictionary to a list and then performs an
    O(N) lookup on that list you should use

    if key in some_dict:
    #...

    which doesn't build a list and looks up the key in constant time.

    Peter
     
    Peter Otten, Jun 21, 2010
    #3
  4. dirknbr

    Guest

    Use a set instead of a dictionary for done keys?

    Malcolm
     
    , Jun 21, 2010
    #4
  5. dirknbr

    Dave Angel Guest

    dirknbr wrote:
    > Hi
    >
    > I have 2 files (done and outf), and I want to chose unique elements
    > from the 2nd column in outf which are not in done. This code works but
    > is not efficient, can you think of a quicker way? The a=1 is just a
    > redundant task obviously, I put it this way around because I think
    > 'in' is quicker than 'not in' - is that true?
    >
    > done_={}
    > for line in done:
    > done_[line.strip()]=0
    >
    > print len(done_)
    >
    > universe={}
    > for line in outf:
    > if line.split(',')[1].strip() in universe.keys():
    > a=1
    > else:
    > if line.split(',')[1].strip() in done_.keys():
    > a=1
    > else:
    > universe[line.split(',')[1].strip()]=0
    >
    > Dirk
    >
    >

    Where you have a=1, one would normally use the "pass" statement. But
    you're wrong that 'not in' is less efficient than 'in'. If there's a
    difference, it's probably negligible, and almost certainly less than the
    extra else clause you're forcing here.

    When doing an 'in', do *not* use the keys() method, as you're replacing
    a fast lookup with a slow one, not to mention the time it takes to build
    the keys() list each time.

    In both these cases, you can use a set, rather than a dict. And there's
    no need to test whether the item is already in the set, just put it in
    again.

    Changing all that, you'll wind up with something like (untested)

    done_set = set()
    universe = set()
    for line in done:
    done_set.add(line.strip())
    for line in outf:
    item = line.split(',')[1].strip()
    if item not in done_set
    universe.add(item)


    DaveA
     
    Dave Angel, Jun 21, 2010
    #5
  6. dirknbr

    Paul Rubin Guest

    dirknbr <> writes:
    > done_={}
    > for line in done:
    > done_[line.strip()]=0
    > ...


    Maybe you mean:

    done_ = set(line.strip() for line in done)
    outf_ = set(line.split(',')[1] for line in outf)
    universe = done_ & outf # this finds the set intersection
     
    Paul Rubin, Jun 21, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green

    deduping algorithm

    Roedy Green, Jul 22, 2004, in forum: Java
    Replies:
    14
    Views:
    5,142
    Roedy Green
    Jul 23, 2004
  2. Roedy Green

    Deduping quotations

    Roedy Green, Nov 30, 2009, in forum: Java
    Replies:
    2
    Views:
    362
    Tom Anderson
    Nov 30, 2009
Loading...

Share This Page