Re: clarification

Discussion in 'Python' started by Thomas Jollans, Aug 17, 2007.

  1. On Friday 17 August 2007, Beema shafreen wrote:
    > hi everybody,
    > i have a file with data separated by tab
    > mydata:
    > fhl1 fkh2
    > dfp1 chk1
    > mal3 alp14
    > mal3 moe1
    > mal3 spi1
    > mal3 bub1
    > mal3 bub3
    > mal3 mph1
    > mal3 mad3
    > hob1 nak1
    > hob1 wsp1
    > hob1 rad3
    > cdr2 cdc13
    > cdr2 cdc2
    > shows these two are separated by tab represented as columns
    > i have to check the common data between these two coloumn1 an coloumn2
    > my code:
    > data = []
    > data1 = []
    > result = []
    > fh = open('sheet1','r')
    > for line in fh.readlines():
    > splitted = line.strip().split('\t')
    > data.append(splitted[0])
    > data1.append(splitted[1])
    > for k in data:
    > if k in data1:
    > result.append(k)
    > print result
    > fh.close()
    >
    > can you tell me problem with my script and what should is do for this


    No, I have not tested it. You tell us the problem, and we might understand the
    situation better than you.

    --
    Regards, Thomas Jollans
    GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
    Hacker key <http://hackerkey.com/>:
    v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
     
    Thomas Jollans, Aug 17, 2007
    #1
    1. Advertising

  2. Thomas Jollans a écrit :
    > On Friday 17 August 2007, Beema shafreen wrote:
    >> hi everybody,
    >> i have a file with data separated by tab
    >> mydata:
    >> fhl1 fkh2

    <zip>
    >> shows these two are separated by tab represented as columns
    >> i have to check the common data between these two coloumn1 an coloumn2
    >> my code:
    >> data = []
    >> data1 = []
    >> result = []
    >> fh = open('sheet1','r')
    >> for line in fh.readlines():
    >> splitted = line.strip().split('\t')
    >> data.append(splitted[0])
    >> data1.append(splitted[1])
    >> for k in data:
    >> if k in data1:
    >> result.append(k)
    >> print result
    >> fh.close()


    Use set data type for data and data1 (you fill them with an algo like th
    one you wrote - just use add() in place of appen()) then use set
    intersection to get common data.

    See doc for set data type:
    http://docs.python.org/lib/types-set.html

    Would look like (not tested):
    data = set()
    data1 = set()
    fh = open('sheet1','r')
    for line in fh.readlines():
    splitted = line.strip().split('\t')
    data.add(splitted[0])
    data1.add(splitted[1])

    result = data.intersection(data1)
     
    Laurent Pointal, Aug 17, 2007
    #2
    1. Advertising

  3. Laurent Pointal a écrit :

    [cleaning]
    fh = open('sheet1')
    for line in fh:
     
    Laurent Pointal, Aug 17, 2007
    #3
  4. Laurent Pointal wrote:
    > Thomas Jollans a écrit :
    >> On Friday 17 August 2007, Beema shafreen wrote:
    >>> hi everybody,
    >>> i have a file with data separated by tab
    >>> mydata:
    >>> fhl1 fkh2

    > <zip>
    >>> shows these two are separated by tab represented as columns
    >>> i have to check the common data between these two coloumn1 an coloumn2
    >>> my code:
    >>> data = []
    >>> data1 = []
    >>> result = []
    >>> fh = open('sheet1','r')
    >>> for line in fh.readlines():
    >>> splitted = line.strip().split('\t')
    >>> data.append(splitted[0])
    >>> data1.append(splitted[1])
    >>> for k in data:
    >>> if k in data1:
    >>> result.append(k)
    >>> print result
    >>> fh.close()

    >
    > Use set data type for data and data1 (you fill them with an algo like th
    > one you wrote - just use add() in place of appen()) then use set
    > intersection to get common data.
    >
    > See doc for set data type:
    > http://docs.python.org/lib/types-set.html
    >
    > Would look like (not tested):
    > data = set()
    > data1 = set()
    > fh = open('sheet1','r')
    > for line in fh.readlines():
    > splitted = line.strip().split('\t')
    > data.add(splitted[0])
    > data1.add(splitted[1])
    >
    > result = data.intersection(data1)


    lefts = set()
    rights = set()
    with open('sheet1', 'r') as fh:
    for line in fh:
    trimmed = line.strip()
    if trimmed: # Skip blanks (file end often looks like that)
    left, right = line.strip().split('\t')
    lefts.add(left)
    rights.add(right)
    result = lefts & rights

    -Scott
     
    Scott David Daniels, Aug 17, 2007
    #4
  5. Thomas Jollans

    samwyse Guest

    Scott David Daniels wrote:
    > lefts = set()
    > rights = set()
    > with open('sheet1', 'r') as fh:
    > for line in fh:
    > trimmed = line.strip()
    > if trimmed: # Skip blanks (file end often looks like that)
    > left, right = line.strip().split('\t')
    > lefts.add(left)
    > rights.add(right)
    > result = lefts & rights
    >
    > -Scott


    # change to however many columns may later exist
    cols = [ set() for i in range(0, 2) ]
    with open('sheet1', 'r') as fh:
    for line in fh:
    for col, data in zip(cols, line.strip().split('\t')):
    col.add(data)
    result = cols[0] & cols[1]
     
    samwyse, Aug 18, 2007
    #5
  6. Thomas Jollans

    samwyse Guest

    samwyse wrote:
    > Scott David Daniels wrote:
    >
    >> lefts = set()
    >> rights = set()
    >> with open('sheet1', 'r') as fh:
    >> for line in fh:
    >> trimmed = line.strip()
    >> if trimmed: # Skip blanks (file end often looks like that)
    >> left, right = line.strip().split('\t')
    >> lefts.add(left)
    >> rights.add(right)
    >> result = lefts & rights
    >>
    >> -Scott

    >
    >
    > # change to however many columns may later exist
    > cols = [ set() for i in range(0, 2) ]
    > with open('sheet1', 'r') as fh:
    > for line in fh:
    > for col, data in zip(cols, line.strip().split('\t')):
    > col.add(data)
    > result = cols[0] & cols[1]


    My laptop started complaining about low power before I was really
    finished with this last night, so I just hit Send and went to bed. The
    last line really should be:
    result = reduce(operator.and_, cols)

    I'll note that using 'zip' transparently deals with handling those cases
    where there are either more or fewer columns of data than expected.

    Finally, does anyone familar with P3K know how best to do the reduction
    without using 'reduce'? Right now, sets don't support the 'add' and
    'multiply' operators, so 'sum' and (the currently ficticious) 'product'
    won't work at all; while 'any' and 'all' don't work as one might hope.
    Are there an 'intersection' and 'union' built-ins anywhere?
     
    samwyse, Aug 18, 2007
    #6
  7. samwyse <> wrote:
    ...
    > Finally, does anyone familar with P3K know how best to do the reduction
    > without using 'reduce'? Right now, sets don't support the 'add' and
    > 'multiply' operators, so 'sum' and (the currently ficticious) 'product'
    > won't work at all; while 'any' and 'all' don't work as one might hope.
    > Are there an 'intersection' and 'union' built-ins anywhere?


    For intersection and union of a sequence of sets, I'd use:

    def union_all(sos):
    result = set()
    for s in sos: result.update(s)
    return result

    def intersect_all(sos):
    it = iter(sos)
    result = set(it.next())
    for s in it: result.intersection_update(s)
    return result

    The latter will raise an exception if sos is empty -- I don't think the
    "intersection of no sets at all" has a single natural interpretation
    (while the "union of no sets at all" appears to be naturally interpreted
    as an empty set)... if you disagree, just wrap a try/except around the
    initialization of result, and return whatever in the except clause.

    Of course, hoisting the unbound method out of the loops can afford the
    usual small optimization. But my point is that, in Python, these
    operations (like, say, the concatenation of a sequence of lists, etc)
    are best performed "in place" via loops calling mutator methods such as
    update and intersection_update (or a list's extend, etc), rather than
    "functionally" (building and tossing away many intermediate results).
    E.g., consider:

    brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
    100000 loops, best of 3: 18.8 usec per loop

    brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
    10000 loops, best of 3: 87.2 usec per loop

    Even in a case as tiny as this one, "reduce" is taking over 4 times
    longer than the loop with the in-place mutator -- and it only gets
    worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
    this is part of what makes reduce an "attractive nuisance"...;-). [[And
    so is sum, if used OTHERWISE than for the documented purpose, computing
    "the sum of a sequence of numbers": a loop with r.extend is similarly
    faster, to concatenate a sequence of lists, when compared to sum(sol,
    [])...!!!]]


    Alex
     
    Alex Martelli, Aug 18, 2007
    #7
  8. Thomas Jollans

    samwyse Guest

    Alex Martelli wrote:

    > Of course, hoisting the unbound method out of the loops can afford the
    > usual small optimization. But my point is that, in Python, these
    > operations (like, say, the concatenation of a sequence of lists, etc)
    > are best performed "in place" via loops calling mutator methods such as
    > update and intersection_update (or a list's extend, etc), rather than
    > "functionally" (building and tossing away many intermediate results).
    > E.g., consider:
    >
    > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    > range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
    > 100000 loops, best of 3: 18.8 usec per loop
    >
    > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    > range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
    > 10000 loops, best of 3: 87.2 usec per loop
    >
    > Even in a case as tiny as this one, "reduce" is taking over 4 times
    > longer than the loop with the in-place mutator -- and it only gets
    > worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
    > this is part of what makes reduce an "attractive nuisance"...;-). [[And
    > so is sum, if used OTHERWISE than for the documented purpose, computing
    > "the sum of a sequence of numbers": a loop with r.extend is similarly
    > faster, to concatenate a sequence of lists, when compared to sum(sol,
    > [])...!!!]]


    Curiously, when I attempt the actual problem at hand (set intersection)
    rather than something altogether different (set union) on my Dell laptop
    running Win2K Pro, I get the following results:

    stmt = 'r=reduce(set.intersection, los)'
    best of 3: 21.9 usec per loop
    stmt = 'r=intersect_all(los)'
    best of 3: 29.3 usec per loop

    So, the "nuisance" is more attractive than you thought. Apparently, one
    can't really depend on "in place" mutator methods being more efficient
    that using functional techniques. BTW, the above figures reflect 10,000
    iterations; using larger counts makes the difference even larger. I
    suspect that this is because the Python interpreter has fewer chances to
    be interrupted by external processes when it's using 'reduce'. This in
    turn indicates that both implementations actually work about same and
    your "O(n squared)" argument is irrelevant.

    P.S. Here's the source I used for the timings:

    from timeit import Timer

    setup = """
    def intersect_all(los):
    it = iter(los)
    result = set(it.next())
    for s in it: result.intersection_update(s)
    return result

    not7=range(2,11)
    not7.remove(7)
    los=[set(range(0,360,step)) for step in not7]
    """

    number = 10000
    repeat = 3
    precision = 3

    for stmt in 'r=reduce(set.intersection, los)', 'r=intersect_all(los)':
    t = Timer(stmt, setup)
    r = t.repeat(repeat, number)
    best = min(r)
    usec = best * 1e6 / number
    print "stmt =", repr(stmt)
    print "best of %d: %.*g usec per loop" % (repeat, precision, usec)
     
    samwyse, Aug 19, 2007
    #8
  9. samwyse <> wrote:
    ...
    > > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    > > range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
    > > 100000 loops, best of 3: 18.8 usec per loop
    > >
    > > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
    > > range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
    > > 10000 loops, best of 3: 87.2 usec per loop
    > >
    > > Even in a case as tiny as this one, "reduce" is taking over 4 times
    > > longer than the loop with the in-place mutator -- and it only gets
    > > worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
    > > this is part of what makes reduce an "attractive nuisance"...;-). [[And


    The set-union case, just like the list-catenation case, is O(N squared)
    (when approached in a functional way) because the intermediate result
    often _grows_ [whenever a new set or list operand adds items], and thus
    a new temporary value must be allocated, and the K results-so-far
    "copied over" (as part of constructing the new temp value) from the
    previous temporary value; and sum(range(N)) grows quadratically in N.
    The in-place approach avoids that fate by a strategy of proportional
    over-allocation (used by both set and lists) that makes in-place
    operations such as .update(X) and .extend(X) amortized O(K) where K is
    len(X).

    In the set-intersection case, the intermediate result _shrinks_ rather
    than growing, so the amount of data "copied over" is a diminishing
    quantity at each step, and so the analysis showing quadratic behavior
    for the functional approach does not hold; behavior may be roughly
    linear, influenced in minor ways by accidental regularities in the sets'
    structure and order (especially likely for short sequences of small
    sets, as in your case). Using a slightly longer sequence of slightly
    larger sets, with little structure to each, e.g.:

    random.seed(12345) # set seed to ensure total repeatability
    los=[set(random.sample(range(1000), 990)) for x in range(200)]

    at the end of the setup (the intersection of these 200 sets happens to
    contain 132 items), I measure (as usual on my 18-months-old Macbook Pro
    laptop):

    stmt = 'reduce(set.intersection,los)'
    best of 3: 1.66e+04 usec per loop
    stmt = 'intersect_all(los)'
    best of 3: 1.66e+04 usec per loop

    and occasionally 1.65 or 1.67 instead of 1.66 for either or both,
    whether with 10,000 or 100,000 loops. (Not sure whether your
    observations about the reduce-based approach becoming faster with more
    loops may be due to anomalies in Windows' scheduler, or the accidental
    regularities mentioned above; my timings are probably more regular since
    I have two cores, one of which probably ends up dedicated to whatever
    task I'm benchmarking while the other one runs all "background" stuff).

    > turn indicates that both implementations actually work about same and
    > your "O(n squared)" argument is irrelevant.


    It's indeed irrelevant when the behavior _isn't_ quadratic (as in the
    case of intersections) -- but unfortunately it _is_ needlessly quadratic
    in most interesting cases involving containers (catenation of sequences,
    union of sets, merging of dictionaries, merging of priority-queues,
    ....), because in those cases the intermediate temporary values tend to
    grow, as I tried to explain in more detail above.


    Alex
     
    Alex Martelli, Aug 19, 2007
    #9
  10. Thomas Jollans

    Paul Rubin Guest

    (Alex Martelli) writes:
    > > turn indicates that both implementations actually work about same and
    > > your "O(n squared)" argument is irrelevant.

    >
    > It's indeed irrelevant when the behavior _isn't_ quadratic (as in the
    > case of intersections) -- but unfortunately it _is_ needlessly quadratic
    > in most interesting cases involving containers (catenation of sequences,
    > union of sets, merging of dictionaries, merging of priority-queues,
    > ...), because in those cases the intermediate temporary values tend to
    > grow, as I tried to explain in more detail above.


    Mostly directed to samwyse: Alex is correct here and I'd add that
    Python and its libraries are written for an imperative, mutating
    approach though there are some situations in which you can write in
    functional style without a big efficiency hit. In particular, the
    quadratic behavior Alex points out is because of Python's hash-based
    implementation of sets, so you can't make a new set that adds or
    removes one element from the old set, without copying all the elements
    around.

    A more purist functional approach would probably implement sets with
    some other data structure such as AVL trees, so you can make a new one
    that adds or deletes some node in O(log n) time (it would share almost
    all of its structure with the old one). So instead of O(n) for the
    hash-based mutating implementation or O(n**2) for the hash/copying
    based functional implementation, you get O(n log n) for the functional
    tree implementation. Haskell's Data.Map module does something like
    this. There's a book "Purely Functional Data Structures" by Chris
    Okasaki that shows how to implement dicts, priority queues, and
    fancier objects in functional style based on similar principles.

    If you want to try out functional programming in a lightweight
    language (much easier to grok than Haskell) you might look at Hedgehog Lisp:

    http://hedgehog.oliotalo.fi/

    It includes a functional dictionary implementation based on AVL trees
    but with a Python dictionary-like API.
     
    Paul Rubin, Aug 20, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nikos Mitas
    Replies:
    2
    Views:
    1,121
    Hubble
    Sep 27, 2005
  2. ss

    One Clarification

    ss, Aug 19, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    332
    Kevin Spencer
    Aug 19, 2003
  3. Ken Cox [Microsoft MVP]

    Database Connection clarification

    Ken Cox [Microsoft MVP], Nov 26, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    372
    Jon Booth
    Nov 26, 2003
  4. Alex Agranov

    Tabstrip clarification

    Alex Agranov, Jan 20, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    326
    Alex Agranov
    Jan 20, 2004
  5. Lerp

    Clarification Needed

    Lerp, Aug 17, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    363
    Edd Connolly
    Aug 18, 2004
Loading...

Share This Page