Best way to handle large lists?

Discussion in 'Python' started by Chaz Ginger, Oct 3, 2006.

  1. Chaz Ginger

    Chaz Ginger Guest

    I have a system that has a few lists that are very large (thousands or
    tens of thousands of entries) and some that are rather small. Many times
    I have to produce the difference between a large list and a small one,
    without destroying the integrity of either list. I was wondering if
    anyone has any recommendations on how to do this and keep performance
    high? Is there a better way than

    [ i for i in bigList if i not in smallList ]

    Thanks.
    Chaz
     
    Chaz Ginger, Oct 3, 2006
    #1
    1. Advertising

  2. Chaz Ginger

    Duncan Booth Guest

    Chaz Ginger <> wrote:

    > I have a system that has a few lists that are very large (thousands or
    > tens of thousands of entries) and some that are rather small. Many times
    > I have to produce the difference between a large list and a small one,
    > without destroying the integrity of either list. I was wondering if
    > anyone has any recommendations on how to do this and keep performance
    > high? Is there a better way than
    >
    > [ i for i in bigList if i not in smallList ]


    How about:

    smallSet = set(smallList)
    something = [ i for i in bigList if i not in smallSet ]

    Use timeit.py on some representative data to see what difference that
    makes.
     
    Duncan Booth, Oct 3, 2006
    #2
    1. Advertising

  3. Chaz Ginger

    Paul Rubin Guest

    Chaz Ginger <> writes:
    > I have a system that has a few lists that are very large (thousands or
    > tens of thousands of entries) and some that are rather small. Many times
    > I have to produce the difference between a large list and a small one,
    > without destroying the integrity of either list. I was wondering if
    > anyone has any recommendations on how to do this and keep performance
    > high? Is there a better way than
    >
    > [ i for i in bigList if i not in smallList ]


    diff = list(set(bigList) - set(smallList))

    Note that doesn't get you the elements in any particular order.
     
    Paul Rubin, Oct 3, 2006
    #3
  4. Chaz Ginger

    durumdara Guest

    Chaz Ginger írta:
    > I have a system that has a few lists that are very large (thousands or
    > tens of thousands of entries) and some that are rather small. Many times
    > I have to produce the difference between a large list and a small one,
    > without destroying the integrity of either list. I was wondering if
    > anyone has any recommendations on how to do this and keep performance
    > high? Is there a better way than
    >
    > [ i for i in bigList if i not in smallList ]
    >
    > Thanks.
    > Chaz
    >

    Hi !

    If you have big list, you can use dbm like databases.
    They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help.

    In is very slow in large datasets, but bsddb is use hash values, so it
    is very quick.
    The SleepyCat database have many extras, you can set the cache size and
    many other parameters.

    Or if you don't like dbm style databases, you can use SQLite. Also
    quick, you can use SQL commands.
    A little slower than bsddb, but it is like SQL server. You can improve
    the speed with special parameters.

    dd
     
    durumdara, Oct 3, 2006
    #4
  5. I don't know enough about Python internals, but the suggested solutions
    all seem to involve scanning bigList. Can this presumably linear
    operation be avoided by using dict or similar to find all occurrences of
    smallist items in biglist and then deleting those occurrences?

    Bill Williams



    In article <prrUg.1662$We.477@trndny08>,
    Chaz Ginger <> wrote:

    > I have a system that has a few lists that are very large (thousands or
    > tens of thousands of entries) and some that are rather small. Many times
    > I have to produce the difference between a large list and a small one,
    > without destroying the integrity of either list. I was wondering if
    > anyone has any recommendations on how to do this and keep performance
    > high? Is there a better way than
    >
    > [ i for i in bigList if i not in smallList ]
    >
    > Thanks.
    > Chaz
     
    Bill Williams, Oct 3, 2006
    #5
  6. Chaz Ginger

    Hari Sekhon Guest

    I don't know much about the python internals either, so this may be the
    blind leading the blind, but aren't dicts much slower to work with than
    lists and therefore wouldn't your suggestion to use dicts be much
    slower? I think it's something to do with the comparative overhead of
    using keys in dicts rather than using positional indexes in lists/arrays...

    At least that is what I thought.

    Can anyone confirm this?

    -h

    Hari Sekhon



    Bill Williams wrote:
    > I don't know enough about Python internals, but the suggested solutions
    > all seem to involve scanning bigList. Can this presumably linear
    > operation be avoided by using dict or similar to find all occurrences of
    > smallist items in biglist and then deleting those occurrences?
    >
    > Bill Williams
    >
    >
    >
    > In article <prrUg.1662$We.477@trndny08>,
    > Chaz Ginger <> wrote:
    >
    >
    >> I have a system that has a few lists that are very large (thousands or
    >> tens of thousands of entries) and some that are rather small. Many times
    >> I have to produce the difference between a large list and a small one,
    >> without destroying the integrity of either list. I was wondering if
    >> anyone has any recommendations on how to do this and keep performance
    >> high? Is there a better way than
    >>
    >> [ i for i in bigList if i not in smallList ]
    >>
    >> Thanks.
    >> Chaz
    >>
     
    Hari Sekhon, Oct 3, 2006
    #6
  7. Bill Williams enlightened us with:
    > I don't know enough about Python internals, but the suggested
    > solutions all seem to involve scanning bigList. Can this presumably
    > linear operation be avoided by using dict or similar to find all
    > occurrences of smallist items in biglist and then deleting those
    > occurrences?


    And how would that beat O(n)? Every element of bigList has to be
    scanned at one point, either to compare it to every earlier element in
    bigList and eliminate it, or to compare it to every element in
    smallList.

    Run benchmarks on the suggestions, and see which is fastest for
    yourself.

    Sybren
    --
    Sybren Stüvel
    Stüvel IT - http://www.stuvel.eu/
     
    Sybren Stuvel, Oct 3, 2006
    #7
  8. Chaz Ginger

    Chaz Ginger Guest

    I've done that and decided that Python's 'list comprehension' isn't a
    way to go. I was hoping that perhaps someone had some experience with
    some C or C++ library that has a Python interface that would make a
    difference.

    Chaz

    Sybren Stuvel wrote:
    > Bill Williams enlightened us with:
    >> I don't know enough about Python internals, but the suggested
    >> solutions all seem to involve scanning bigList. Can this presumably
    >> linear operation be avoided by using dict or similar to find all
    >> occurrences of smallist items in biglist and then deleting those
    >> occurrences?

    >
    > And how would that beat O(n)? Every element of bigList has to be
    > scanned at one point, either to compare it to every earlier element in
    > bigList and eliminate it, or to compare it to every element in
    > smallList.
    >
    > Run benchmarks on the suggestions, and see which is fastest for
    > yourself.
    >
    > Sybren
     
    Chaz Ginger, Oct 3, 2006
    #8
  9. Chaz Ginger

    Paul Rubin Guest

    Sybren Stuvel <> writes:
    > > I don't know enough about Python internals, but the suggested
    > > solutions all seem to involve scanning bigList. Can this presumably
    > > linear operation be avoided by using dict or similar to find all
    > > occurrences of smallist items in biglist and then deleting those
    > > occurrences?

    >
    > And how would that beat O(n)? Every element of bigList has to be
    > scanned at one point, either to compare it to every earlier element in
    > bigList and eliminate it, or to compare it to every element in
    > smallList.


    Maybe the application should use sets instead of lists for these
    collections.
     
    Paul Rubin, Oct 3, 2006
    #9
  10. Chaz Ginger

    Chaz Ginger Guest

    Paul Rubin wrote:
    > Sybren Stuvel <> writes:
    >>> I don't know enough about Python internals, but the suggested
    >>> solutions all seem to involve scanning bigList. Can this presumably
    >>> linear operation be avoided by using dict or similar to find all
    >>> occurrences of smallist items in biglist and then deleting those
    >>> occurrences?

    >> And how would that beat O(n)? Every element of bigList has to be
    >> scanned at one point, either to compare it to every earlier element in
    >> bigList and eliminate it, or to compare it to every element in
    >> smallList.

    >
    > Maybe the application should use sets instead of lists for these
    > collections.

    What would sets do for me over lists?

    Chaz
     
    Chaz Ginger, Oct 3, 2006
    #10
  11. Chaz Ginger

    Larry Bates Guest

    Chaz Ginger wrote:
    > I have a system that has a few lists that are very large (thousands or
    > tens of thousands of entries) and some that are rather small. Many times
    > I have to produce the difference between a large list and a small one,
    > without destroying the integrity of either list. I was wondering if
    > anyone has any recommendations on how to do this and keep performance
    > high? Is there a better way than
    >
    > [ i for i in bigList if i not in smallList ]
    >
    > Thanks.
    > Chaz



    IMHO the only way to speed things up is to know more about the
    actual data in the lists (e.g are the elements unique, can they
    be sorted, etc) and take advantage of all that information to
    come up with a "faster" algorithm. If they are unique, sets
    might be a good choice. If they are sorted, bisect module
    might help. The specifics about the list(s) may yield a faster
    method.

    -Larry
     
    Larry Bates, Oct 3, 2006
    #11
  12. Chaz Ginger wrote:

    > What would sets do for me over lists?


    It's faster to tell whether something is in a set or dict than in a list
    (for some minimum size).

    Jeremy

    --
    Jeremy Sanders
    http://www.jeremysanders.net/
     
    Jeremy Sanders, Oct 3, 2006
    #12
  13. Chaz Ginger

    Chaz Ginger Guest

    Larry Bates wrote:
    > Chaz Ginger wrote:
    >> I have a system that has a few lists that are very large (thousands or
    >> tens of thousands of entries) and some that are rather small. Many times
    >> I have to produce the difference between a large list and a small one,
    >> without destroying the integrity of either list. I was wondering if
    >> anyone has any recommendations on how to do this and keep performance
    >> high? Is there a better way than
    >>
    >> [ i for i in bigList if i not in smallList ]
    >>
    >> Thanks.
    >> Chaz

    >
    >
    > IMHO the only way to speed things up is to know more about the
    > actual data in the lists (e.g are the elements unique, can they
    > be sorted, etc) and take advantage of all that information to
    > come up with a "faster" algorithm. If they are unique, sets
    > might be a good choice. If they are sorted, bisect module
    > might help. The specifics about the list(s) may yield a faster
    > method.
    >
    > -Larry

    Each item in the list is a fully qualified domain name, e.g.
    foo.bar.com. The order in the list has no importance. That is about all
    there is to the list other than to say the number of items in a list can
    top out about 10,000.

    Chaz
     
    Chaz Ginger, Oct 3, 2006
    #13
  14. "Chaz Ginger" <> wrote in message
    news:...

    > Each item in the list is a fully qualified domain name, e.g.
    > foo.bar.com. The order in the list has no importance.


    So you don't actually need to use lists at all, then.
    You can just use sets and write:

    newSet = bigSet - littleSet
     
    Richard Brodie, Oct 3, 2006
    #14
  15. Jeremy Sanders wrote:

    > Chaz Ginger wrote:
    >
    >> What would sets do for me over lists?

    >
    > It's faster to tell whether something is in a set or dict than in a list
    > (for some minimum size).


    As a footnote, this program

    import random
    num = 100000

    a = set( range(num) )
    for i in range(100000):
    x = random.randint(0, num-1) in a

    completes in less than a second, whereas

    import random
    num = 100000

    a = range(num)
    for i in range(100000):
    x = random.randint(0, num-1) in a

    takes a long time on my computer.

    Jeremy

    --
    Jeremy Sanders
    http://www.jeremysanders.net/
     
    Jeremy Sanders, Oct 3, 2006
    #15
  16. Chaz Ginger

    Chaz Ginger Guest

    Jeremy Sanders wrote:
    > Jeremy Sanders wrote:
    >
    >> Chaz Ginger wrote:
    >>
    >>> What would sets do for me over lists?

    >> It's faster to tell whether something is in a set or dict than in a list
    >> (for some minimum size).

    >
    > As a footnote, this program
    >
    > import random
    > num = 100000
    >
    > a = set( range(num) )
    > for i in range(100000):
    > x = random.randint(0, num-1) in a
    >
    > completes in less than a second, whereas
    >
    > import random
    > num = 100000
    >
    > a = range(num)
    > for i in range(100000):
    > x = random.randint(0, num-1) in a
    >
    > takes a long time on my computer.
    >
    > Jeremy
    >

    Thanks Jeremy. I am in the process of converting my stuff to use sets! I
    wouldn't have thought it would have made that big a deal! I guess it is
    live and learn.

    Peace,
    Chaz
     
    Chaz Ginger, Oct 3, 2006
    #16
  17. Hari Sekhon wrote:

    > That is surprising since I read on this list recently that lists were
    > faster than dicts


    depends on what you're doing with them, of course.

    > It was one reason that was cited as to why local vars are better than
    > global vars.


    L[int] is indeed a bit faster than D[string] (but not much), but that
    doesn't mean that you can loop over *all* items in a list faster than
    you can look up a single key in a dictionary.

    </F>
     
    Fredrik Lundh, Oct 3, 2006
    #17
  18. Chaz Ginger

    durumdara Guest

    Hi !
    > Thanks Jeremy. I am in the process of converting my stuff to use sets! I
    > wouldn't have thought it would have made that big a deal! I guess it is
    > live and learn.
    >

    If you have simplified records with big amount of data, you can trying
    dbhash. With this you don't get out from memory...

    dd


    import dbhash
    import time
    import random
    import gc
    import sys

    itemcount = 250000

    db = dbhash.open('test.dbh','w')
    for i in range(itemcount):
    db[str(i)] = str(i)

    littlelist = []
    littleset = set()

    while len(littlelist) < 1000:
    x = str(random.randint(0, itemcount-1))
    if not (x in littlelist):
    littlelist.append(x)
    littleset.add(x)

    def DBHash():
    gc.collect()
    hk = db.has_key
    st = time.time()
    newlist = []
    for val in littlelist:
    if hk(val):
    newlist.append(val)
    et = time.time()
    print "Size", len(newlist)
    newlist.sort()
    print "Hash", hash(str(newlist))
    print "Time", "%04f"%(et-st)
    print

    def Set():
    gc.collect()
    largeset = set()
    for i in range(itemcount):
    largeset.add(str(i))

    st = time.time()
    newset = largeset.intersection(littleset)
    newsetlist = []
    while newset:
    newsetlist.append(newset.pop())
    et = time.time()
    print "Size", len(newsetlist)
    newsetlist.sort()
    print "Hash", hash(str(newsetlist))
    print "Time", "%04f"%(et-st)

    DBHash()
    Set()
     
    durumdara, Oct 4, 2006
    #18
  19. Chaz Ginger

    GHUM Guest

    > > Maybe the application should use sets instead of lists for these
    > > collections.

    > What would sets do for me over lists?


    searching for an element in a list is O(n)
    searching for an element in a set is O(1) (for reasonable distributed
    elements)

    Harald
     
    GHUM, Oct 4, 2006
    #19
  20. Hari Sekhon wrote:

    > So are you saying that using a dict means a faster search since you only
    > need to look up one value?
    >
    > I would think that you would have to look through the keys and stop at
    > the first key that matches since each key has to be uniq, so perhaps if
    > it is nearer the front of the set of keys then perhaps it would be a
    > quicker lookup?


    http://en.wikipedia.org/wiki/Hash_table

    </F>
     
    Fredrik Lundh, Oct 4, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ravikanth[MVP]
    Replies:
    6
    Views:
    3,927
    Aemca
    Jul 18, 2003
  2. Thomas Scheiderich

    Best way to handle documents in ASP.NET

    Thomas Scheiderich, May 20, 2004, in forum: ASP .Net
    Replies:
    11
    Views:
    2,500
    Jim Corey
    May 20, 2004
  3. Top Mole
    Replies:
    0
    Views:
    320
    Top Mole
    Apr 5, 2006
  4. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    444
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  5. Top Mole
    Replies:
    0
    Views:
    347
    Top Mole
    Apr 5, 2006
Loading...

Share This Page