Comparing lists

Discussion in 'Python' started by Odd-R., Oct 10, 2005.

  1. Odd-R.

    Odd-R. Guest

    I have to lists, A and B, that may, or may not be equal. If they are not
    identical, I want the output to be three new lists, X,Y and Z where X has
    all the elements that are in A, but not in B, and Y contains all the
    elements that are B but not in A. Z will then have the elements that are
    in both A and B.

    One way of doing this is of course to iterate throug the lists and compare
    each of the element, but is there a more efficient way?

    Thanks in advance!



    --
    Har du et kjøleskap, har du en TV
    så har du alt du trenger for å leve

    -Jokke & Valentinerne
    Odd-R., Oct 10, 2005
    #1
    1. Advertising

  2. Odd-R. wrote:

    >I have to lists, A and B, that may, or may not be equal. If they are not
    >identical, I want the output to be three new lists, X,Y and Z where X has
    >all the elements that are in A, but not in B, and Y contains all the
    >elements that are B but not in A. Z will then have the elements that are
    >in both A and B.
    >
    >

    These are set operations.

    >One way of doing this is of course to iterate throug the lists and compare
    >each of the element, but is there a more efficient way?
    >
    >

    Maybe, using sets?

    L1 = [1,2,3,4]
    L2=[3,4,5,6]
    diff1 = list(set(L1)-set(L2)) # [1,2]
    diff2 = list(set(L2)-set(L1)) # [5,6]
    symdiff = diff1+diff2 # Symmetric difference [1,2,5,6]
    intersect = set(L1+L2) - set(symdiff) # Intersect [3,4]

    Best,

    Les
    Laszlo Zsolt Nagy, Oct 10, 2005
    #2
    1. Advertising

  3. Odd-R.

    Guest

    try to use set.
    L1 = [1,1,2,3,4]
    L2 = [1,3, 99]
    A = set(L1)
    B = set(L2)

    X = A-B
    print X

    Y = B-A
    print Y

    Z = A | B
    print Z

    Cheers,
    pujo
    , Oct 10, 2005
    #3
  4. <> wrote in message
    news:...
    > try to use set.
    > L1 = [1,1,2,3,4]
    > L2 = [1,3, 99]
    > A = set(L1)
    > B = set(L2)
    >
    > X = A-B
    > print X
    >
    > Y = B-A
    > print Y
    >
    > Z = A | B
    > print Z


    But how "efficient" is this? Could you be a bit
    more explicit on that point? What is the order
    of complexity of set([...]) or of A-B, B-A,
    A | B, A ^ B and A & B? - The Python Documentation
    leaves me completely in the dark in this regard.

    Sorting the two lists and then extracting
    A-B, B-A, A|B, A & B and A ^ B in one single
    pass seems to me very likely to be much faster
    for large lists.

    Regards,
    Christian
    Christian Stapfer, Oct 10, 2005
    #4
  5. "Christian Stapfer" <> wrote:

    > <> wrote:
    > > try to use set.

    >
    > Sorting the two lists and then extracting
    > A-B, B-A, A|B, A & B and A ^ B in one single
    > pass seems to me very likely to be much faster
    > for large lists.


    Why don't you implement it, test it and time it to be more convincing about your intuition ?

    George
    George Sakkis, Oct 10, 2005
    #5
  6. On Mon, 10 Oct 2005 14:34:35 +0200, Christian Stapfer wrote:

    > Sorting the two lists and then extracting
    > A-B, B-A, A|B, A & B and A ^ B in one single
    > pass seems to me very likely to be much faster
    > for large lists.


    Unless you are running a Python compiler in your head, chances are your
    intuition has no connection to the actual performance of Python except by
    pure coincidence.

    Write some code, and time it in action.

    In fact, here is some code I've written for you, complete with timer. Do
    yourself a favour, and run it and see for yourself which is faster.

    Then, if you think you can write a list version that is more efficient
    than my (admittedly very inefficient) version, please do so. Then time the
    difference again. Then think about how much time it took you to write,
    test and debug your list version, compared to my four lines using sets.


    from sets import Set
    import time

    def compare_and_separate(A, B):
    only_A = []
    only_B = []
    both_AB = []
    for item in A:
    # ignore items we've already seen before
    if item in only_A or item in both_AB:
    continue
    if item in B:
    both_AB.append(item)
    else:
    only_A.append(item)
    for item in B:
    # ignore items we've already seen before
    if item in only_B or item in both_AB:
    continue
    only_B.append(item)
    return (only_A, only_B, both_AB)

    def compare_and_separate_with_sets(A, B):
    only_A = list(Set(A)-Set(B))
    only_B = list(Set(B)-Set(A))
    both_AB = list(Set(A+B) - Set(only_A+only_B))
    return (only_A, only_B, both_AB)

    A = range(199) + range(44, 300, 3) + range(250, 500) + range(2000, 2100)
    B = range(1000) + range(3000, 4000)

    def tester():
    # confirm that the two functions give the same results
    r1 = compare_and_separate(range(5), range(3,8))
    r2 = compare_and_separate_with_sets(range(5), range(3,8))
    print r1
    print r2
    if r1 == r2:
    print " Same."
    else:
    print " Warning: results are different."
    loopit = range(20) # repeat the test 20 times for timing purposes
    t1 = time.time()
    for i in loopit:
    results = compare_and_separate(A, B)
    t1 = time.time() - t1
    t2 = time.time()
    for i in loopit:
    results = compare_and_separate_with_sets(A, B)
    t2 = time.time() - t2
    print "Time with loops:", t1
    print "Time with sets:", t2



    --
    Steven.
    Steven D'Aprano, Oct 10, 2005
    #6
  7. "George Sakkis" <> wrote in message
    news:1128952417.0544cc73190bbbd4e683448c137969c8@teranews...
    > "Christian Stapfer" <> wrote:
    >
    >> <> wrote:
    >> > try to use set.

    >>
    >> Sorting the two lists and then extracting
    >> A-B, B-A, A|B, A & B and A ^ B in one single
    >> pass seems to me very likely to be much faster
    >> for large lists.

    >
    > Why don't you implement it, test it and time it
    > to be more convincing about your intuition ?


    The problem is in the generation of the test data.
    Even merely generating a set of (suitably "average",
    "random", and suitably "worst case") datasets might
    turn out to be a major undertaking.
    If the documentation stated the order-of-magnitude
    behavior of those basic operations up front, then
    I (and *anyone* else who ever wanted to use those
    operations on large lists / large sets) could do
    a quick order-of-magnitude estimation of how
    a certain program design will behave, performance
    wise.
    *Experimenting* is not necessarily as easy to
    do as you seem to believe. How do you, for example,
    hit upon the worst-case behavior with your test
    data? - Without knowing *anything* about the
    implementation it might a matter sheer luck.
    If you *know* something about the implementation
    then, of course, you might be able to figure it
    out. (But note that if you know *that* much about
    the implementation, you usually have an order-of-
    magnitude estimate anyway and don't need to do
    *any* experimenting in order to answer my question.)

    Regards,
    Christian
    Christian Stapfer, Oct 10, 2005
    #7
  8. Odd-R.

    Steve Holden Guest

    Christian Stapfer wrote:
    > "George Sakkis" <> wrote in message
    > news:1128952417.0544cc73190bbbd4e683448c137969c8@teranews...
    >
    >>"Christian Stapfer" <> wrote:
    >>
    >>
    >>><> wrote:
    >>>
    >>>>try to use set.
    >>>
    >>> Sorting the two lists and then extracting
    >>>A-B, B-A, A|B, A & B and A ^ B in one single
    >>>pass seems to me very likely to be much faster
    >>>for large lists.

    >>
    >>Why don't you implement it, test it and time it
    >>to be more convincing about your intuition ?

    >
    >
    > The problem is in the generation of the test data.
    > Even merely generating a set of (suitably "average",
    > "random", and suitably "worst case") datasets might
    > turn out to be a major undertaking.
    > If the documentation stated the order-of-magnitude
    > behavior of those basic operations up front, then
    > I (and *anyone* else who ever wanted to use those
    > operations on large lists / large sets) could do
    > a quick order-of-magnitude estimation of how
    > a certain program design will behave, performance
    > wise.
    > *Experimenting* is not necessarily as easy to
    > do as you seem to believe. How do you, for example,
    > hit upon the worst-case behavior with your test
    > data? - Without knowing *anything* about the
    > implementation it might a matter sheer luck.
    > If you *know* something about the implementation
    > then, of course, you might be able to figure it
    > out. (But note that if you know *that* much about
    > the implementation, you usually have an order-of-
    > magnitude estimate anyway and don't need to do
    > *any* experimenting in order to answer my question.)
    >

    You are, of course, either assuming that there's a single implementation
    of Python, or that all implementations have the same behaviour. Or
    alternatively you are asking all implementers to do what you seem to
    consider so difficult (i.e. identify worst-case scenarios and then
    estimate order-of-magnitude behaviour for them).

    Test results with known test data are relatively easy to extrapolate
    from, and if your test data are reasonably representative of live data
    then so will your performance estimates.

    Anyway, aren't you more interested in average behaviour than worst-case?
    Most people are.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
    Steve Holden, Oct 10, 2005
    #8
  9. "Steve Holden" <> wrote in message
    news:...
    > Christian Stapfer wrote:
    >> "George Sakkis" <> wrote in message
    >> news:1128952417.0544cc73190bbbd4e683448c137969c8@teranews...
    >>
    >>>"Christian Stapfer" <> wrote:
    >>>><> wrote:
    >>>>
    >>>>>try to use set.
    >>>>
    >>>> Sorting the two lists and then extracting
    >>>>A-B, B-A, A|B, A & B and A ^ B in one single
    >>>>pass seems to me very likely to be much faster
    >>>>for large lists.
    >>>
    >>>Why don't you implement it, test it and time it
    >>>to be more convincing about your intuition ?

    >>
    >> The problem is in the generation of the test data.
    >> Even merely generating a set of (suitably "average",
    >> "random", and suitably "worst case") datasets might
    >> turn out to be a major undertaking.
    >> If the documentation stated the order-of-magnitude
    >> behavior of those basic operations up front, then
    >> I (and *anyone* else who ever wanted to use those
    >> operations on large lists / large sets) could do
    >> a quick order-of-magnitude estimation of how
    >> a certain program design will behave, performance
    >> wise.
    >> *Experimenting* is not necessarily as easy to
    >> do as you seem to believe. How do you, for example,
    >> hit upon the worst-case behavior with your test
    >> data? - Without knowing *anything* about the
    >> implementation it might a matter sheer luck.
    >> If you *know* something about the implementation
    >> then, of course, you might be able to figure it
    >> out. (But note that if you know *that* much about
    >> the implementation, you usually have an order-of-
    >> magnitude estimate anyway and don't need to do
    >> *any* experimenting in order to answer my question.)
    >>

    > You are, of course, either assuming that there's a
    > single implementation of Python,


    Of course not!

    > or that all implementations have the same behaviour.


    Of course not!

    But it is reasonable, anyway, to ask for information
    about a specific implementation (that one is *forced*
    to use). Performance *will* depend on it, whether we
    like it or not, whether that information is given by
    the implementer or not.
    And if the implementer wants to complain that
    giving such information would break his wonderful
    abstraction then I can only answer: It is *reality*
    that *will* break your abstraction as regards
    performance! It is, therefore, absolutely *no*
    use to *pretend* you *can* avoid this form of "breaking
    the abstraction" by simply avoiding to mention it
    in the documentation...

    > Or alternatively you are asking all implementers
    > to do what you seem to consider so difficult
    > (i.e. identify worst-case scenarios and then estimate order-of-magnitude
    > behaviour for them).


    I consider it the job of the implementer to know
    about the trade-offs that he has been making in
    choosing one particular implementation, and to
    know what computational complexity therefore
    attaches to the various operations exposed in
    its interface. Why should *every* user of his
    module be forced to either read the source
    code of his implementation or try to figure
    it out experiment-wise on his own?

    > Test results with known test data are relatively
    > easy to extrapolate from, and if your test data
    > are reasonably representative of live data then so will your performance
    > estimates.


    How reasonable is it to ask me, or anyone else
    for that matter, to extract, experiment-wise
    (if it can be done at all with reasonable effort)
    information that properly belongs to the implementer
    and really should have been exposed in the
    documentation in the first place?

    > Anyway, aren't you more interested in average
    > behaviour than worst-case? Most people are.


    It depends. *I* am NOT the OP of this thread.
    *The*OP* had asked for an "efficient" way
    to do what he needed to do. There are many
    measures of efficiency. Even "programmer efficiency"
    may be one of them. But I assumed that, since
    the OP took the time to ask his question in this
    NG, that it really was about "computational
    efficiency". Then he was offered solutions -
    but they were offered just matter-of-factly.
    *Surely* those who had posted those solutions
    would be able to answer my question *why*
    it was, that they *assumed* conversion into
    sets to be more efficient than operating
    on the lists in the first place. I was *not*
    at all criticizing anyone (except, perhaps,
    the Python documentation for its lack of
    *insightful* information): I was just asking
    for a *small* clarification that I *assumed*
    (mistakenly it now appears) others could
    *easily* provide.

    Regards,
    Christian
    --
    "Those who cannot remember the past are
    condemned to repeat it."
    - George Santayana
    Christian Stapfer, Oct 11, 2005
    #9
  10. Christian Stapfer wrote:
    > "Steve Holden" <> wrote in message
    > news:...
    >
    >>Christian Stapfer wrote:
    >>
    >>>"George Sakkis" <> wrote in message
    >>>news:1128952417.0544cc73190bbbd4e683448c137969c8@teranews...
    >>>
    >>>
    >>>>"Christian Stapfer" <> wrote:
    >>>>
    >>>>><> wrote:
    >>>>>>try to use set.

    A reasonable suggestion. A set must have the trade-offs involved.
    The abstraction itself brings to mind the issues, and the performance
    can, at least in theory, be handled there. If that is true (that
    the "set" abstraction "sees" the problem), then you can rely on the
    Python implementation of "set" to either now, or eventually, have
    a "good" implementation -- one not too far off efficient. The Python
    gang is good at this stuff; a "better" set implementation will win if
    it can show better performance without related down-sides.

    As to the "either now, or eventually;" if you _must_ have performance
    now, not in some abstract future, then it behooves you to _test_,
    _test_, _test_!

    >>> If the documentation stated the order-of-magnitude
    >>>behavior of those basic operations up front, then
    >>>I (and *anyone* else who ever wanted to use those
    >>>operations on large lists / large sets) could do
    >>>a quick order-of-magnitude estimation of how
    >>>a certain program design will behave, performance
    >>>wise.

    And, if the proper documentation is in place, and it
    says "dictionary lookup is O(N)" (and you avoid using
    it for exactly that reason), how surprised will you be
    to discover that the O(N) is only reached if the hash
    values of the keys are all equal?

    Oh, maybe you expect "O(N)" to really mean "\Theta(N)".
    Then, if you are a dweeb like me, you will respond that
    "This is not possible, a dictionary of size N must take at
    least 'O(lg N)' to read the key, never mind processing it."
    But, it turns out, that given a bound on the size of a
    process, processing an address is "O(1)", not "O(lg N)".
    Is that too practical for you, or not enough?

    >>> *Experimenting* is not necessarily as easy to
    >>>do as you seem to believe.



    >>>How do you, for example, hit upon the worst-case
    >>>behavior with your test data?

    Are you saying the documentation should characterize the
    cases that achieve worst-case behavior? This is a stiff
    burden indeed, not one I am used to in even the most rigorous
    classes I've seen. If there is such a characterization,
    how burned will you feel if a case is overlooked and that
    case is the one that you sold to your customer? Are you
    willing to provide the same guaranteed performance and
    documentation of performance to your customer that you
    you expect of the Python system? Would you mind if the
    quality is proportional to the price you paid?

    >>You are, of course, either assuming that there's a
    >>single implementation of Python,

    > Of course not!
    >
    >>or that all implementations have the same behaviour.

    > Of course not!


    > But it is reasonable, anyway, to ask for information
    > about a specific implementation (that one is *forced*
    > to use).

    You are not _forced_ to use any implementation of Python.
    You are free to implement your own Python system.

    > And if the implementer wants to complain that
    > giving such information would break his wonderful
    > abstraction then I can only answer: It is *reality*
    > that *will* break your abstraction as regards
    > performance! It is, therefore, absolutely *no*
    > use to *pretend* you *can* avoid this form of "breaking
    > the abstraction" by simply avoiding to mention it
    > in the documentation...

    I simply repeat: I have never seen a paper characterizing
    the performance of an algorithm as "O(expr(N))" that described
    in details _all_ cases that reached that limit. At most such
    papers describe _one_ such cases. Nor have I ever seen papers
    describing the performance of an algorithm as "\Theta(expr(N))"
    that characterized the cases that broke the "\Theta" performance.
    If you provide me the papers, provide me a C compiler with
    equivalent docs on all C expressions, and provide me the funding
    to update the Python docs, I will be happy to do so for a single
    version of Python and a since version of CPython. I expect I
    will have an easier time of it than the IronPython people will
    have.

    > I consider it the job of the implementer to know
    > about the trade-offs that he has been making in
    > choosing one particular implementation, and to
    > know what computational complexity therefore
    > attaches to the various operations exposed in
    > its interface.

    Am I to take it you provide this to all of your customers?

    > How reasonable is it to ask me, or anyone else
    > for that matter, to extract, experiment-wise
    > (if it can be done at all with reasonable effort)
    > information that properly belongs to the implementer
    > and really should have been exposed in the
    > documentation in the first place?

    Not at all reasonable. How reasonable is it to ask
    me to provide you support information for free?

    --Scott David Daniels
    Scott David Daniels, Oct 11, 2005
    #10
  11. "Scott David Daniels" <> wrote in message
    news:...
    > Christian Stapfer wrote:
    >> "Steve Holden" <> wrote in message
    >> news:...
    >>
    >>>Christian Stapfer wrote:
    >>>
    >>>>"George Sakkis" <> wrote in message
    >>>>news:1128952417.0544cc73190bbbd4e683448c137969c8@teranews...
    >>>>
    >>>>
    >>>>>"Christian Stapfer" <> wrote:
    >>>>>
    >>>>>><> wrote:
    >>>>>>>try to use set.

    > A reasonable suggestion. A set must have the trade-offs involved.
    > The abstraction itself brings to mind the issues, and the performance
    > can, at least in theory, be handled there. If that is true (that
    > the "set" abstraction "sees" the problem), then you can rely on the
    > Python implementation of "set" to either now, or eventually, have
    > a "good" implementation -- one not too far off efficient.


    It is, unfortunately, not infrequently the case
    that there are different implementations for
    a given data type that are efficient in different
    ways - but that there is no one single implementation
    that is the most efficient in *all* regards.
    Thus the implementer must make a trade-off,
    and that trade-off has consequences, performance
    wise, that he might want to tell the users of his
    module about...

    > The Python gang is good at this stuff;


    I'm not denying it. The question is about
    telling users upfront what the consequences
    are (roughly) of the trade-offs that have
    been made so that they can use the modules
    that have been provided more effectively.

    > a "better" set implementation will win if
    > it can show better performance without
    > related down-sides.


    Is there ever such a thing? I always
    thought that, most of the time, there
    is no such thing as a free lunch in
    this field.

    > As to the "either now, or eventually;" if you _must_ have performance
    > now, not in some abstract future, then it behooves you to _test_,
    > _test_, _test_!


    Well, I might want to test: but *first* I want
    to design, preferably in "armchair style"
    - at least as regards the basic approach
    that I want to take. This "armchair stage"
    involves not infrequently the use of rough
    complexity measures to exclude the clearly
    unusable and chose, instead, an approach
    that is very likely efficient enough for
    what I need.

    >>>> If the documentation stated the order-of-magnitude
    >>>>behavior of those basic operations up front, then
    >>>>I (and *anyone* else who ever wanted to use those
    >>>>operations on large lists / large sets) could do
    >>>>a quick order-of-magnitude estimation of how
    >>>>a certain program design will behave, performance
    >>>>wise.

    > And, if the proper documentation is in place, and it
    > says "dictionary lookup is O(N)" (and you avoid using
    > it for exactly that reason), how surprised will you be
    > to discover that the O(N) is only reached if the hash
    > values of the keys are all equal?


    It's not that difficult to distinguish
    *average* case and *worst* case scenarios.
    It might even be a good idea to state,
    if that can easily be done, what the
    *best* case happens do be...

    > Oh, maybe you expect "O(N)" to really mean "\Theta(N)".
    > Then, if you are a dweeb like me, you will respond that
    > "This is not possible, a dictionary of size N must take at
    > least 'O(lg N)' to read the key, never mind processing it."
    > But, it turns out, that given a bound on the size of a
    > process, processing an address is "O(1)", not "O(lg N)".
    > Is that too practical for you, or not enough?


    I do not expect a developer to expend *inordinate*
    amounts of work to figure out the computational
    complexity of what he has implemented. But, as I
    wrote, he *must* have thought about the matter, and
    is thus, by definition, in a rather good position
    to provide what he can frequently simply pull from
    memory when it comes to documenting the interface
    of his module.

    >>>> *Experimenting* is not necessarily as easy to
    >>>>do as you seem to believe.

    >
    >
    >>>>How do you, for example, hit upon the worst-case behavior with your test
    >>>>data?

    > Are you saying the documentation should characterize the
    > cases that achieve worst-case behavior? This is a stiff
    > burden indeed, not one I am used to in even the most rigorous
    > classes I've seen.


    If it happens to be particularly difficult, for a
    given implementation (of whatever abstract data type),
    then, of course, state this upfront and leave it at
    that.

    > If there is such a characterization,
    > how burned will you feel if a case is overlooked and that
    > case is the one that you sold to your customer?


    I am not at all a lawyer type, as you seem to
    imagine. I just want to suggest that some
    (to the implementer - but not the average user)
    *easily* available information about computational
    complexity (especially for the most basic data types)
    would be a good thing to have. Nothing more.

    > Are you
    > willing to provide the same guaranteed performance and
    > documentation of performance to your customer that you
    > you expect of the Python system?


    I'm using python mainly as a very handy language to
    write whatever (usually relatively small) utility
    programs I happen to require for my main job. I am
    not (currently) developing any "serious" programs
    that are targeted for sale. (Thought the proper
    functioning and adequate efficiency of my personal
    utility programs really does matter, but it mainly
    matters for *me* - and for my "customers", if I
    can be said to have any, which seems doubtful,
    it only matters indirectly at best.)
    Even so, I frequently need to have a reasonable
    idea of what computational complexity attaches
    to certain operations on certain data types. But
    I hardly ever have the time to go into an extended
    period of first comparing several different approaches
    by way of experiment.

    > Would you mind if the quality is proportional to
    > the price you paid?


    There is really not need to indulge in polemics like
    this. I am just making a suggestion as to what information
    might *help* improve the usability of Python modules.
    If making such suggestions regularly only led to negative
    exchanges of this sort, there would be very little room
    for learning in this community indeed...

    >>>You are, of course, either assuming that there's a
    >>>single implementation of Python,

    >> Of course not!
    >>
    >>>or that all implementations have the same behaviour.

    >> Of course not!

    >
    >> But it is reasonable, anyway, to ask for information
    >> about a specific implementation (that one is *forced*
    >> to use).

    > You are not _forced_ to use any implementation of Python.


    What I wanted to say is *not*, that it appears to be
    a terrible burden to my inflated consumer-ego to
    have to use something not *absolutely* perfect in
    *all* respects, such as Python. What I wanted to
    say, here, is just this: *whenever* a Python program
    is run, it will run on some specific implementation
    of Python. That much should be obvious.

    > You are free to implement your own Python system.


    Don't be ridiculous.

    >> And if the implementer wants to complain that
    >> giving such information would break his wonderful
    >> abstraction then I can only answer: It is *reality*
    >> that *will* break your abstraction as regards
    >> performance! It is, therefore, absolutely *no*
    >> use to *pretend* you *can* avoid this form of "breaking
    >> the abstraction" by simply avoiding to mention it
    >> in the documentation...

    > I simply repeat: I have never seen a paper characterizing
    > the performance of an algorithm as "O(expr(N))" that described
    > in details _all_ cases that reached that limit.


    That's what's usually called "throwing out the
    baby with the bathwater". It is you (and only
    you) here, who is arguing that if perfection is
    not attainable (with reasonable effort), nothing
    should be done at all. I as am willing to live
    with imperfect module inferface specification as
    I am willing with imperfect tools more generally.
    But that will not stop me from making *suggestions*
    for improvement (not arrogant demands, mind you).

    > At most such
    > papers describe _one_ such cases. Nor have I ever seen papers
    > describing the performance of an algorithm as "\Theta(expr(N))"
    > that characterized the cases that broke the "\Theta" performance.


    To repeat: I was not asking for the impossible, not
    even asking for the difficult, but simply asking for the
    dumping of the (to the implementer, but not the average
    user) obvious and reasonably certain information
    about computational complexity of what operations
    are exposed in the interface of a module.
    I have seen *many* books on algorithms that specify
    computational complexity measures for the algorithms
    described. Any really good library of algorithms will
    at least try to reach that standard, too.

    > If you provide me the papers, provide me a C compiler with
    > equivalent docs on all C expressions, and provide me the funding
    > to update the Python docs, I will be happy to do so for a single
    > version of Python and a since version of CPython.


    I am not asking you, nor anyone else, to do
    any amount of boring or otherwise hard work
    here.

    > I expect I
    > will have an easier time of it than the IronPython people will
    > have.
    >> I consider it the job of the implementer to know
    >> about the trade-offs that he has been making in
    >> choosing one particular implementation, and to
    >> know what computational complexity therefore
    >> attaches to the various operations exposed in
    >> its interface.

    > Am I to take it you provide this to all of your customers?


    I have no customers (lucky me). And I did not
    want to pose as the all too frequently encountered
    species of the arrogant (and equally ignorant)
    customer who expects nothing but perfection
    from *others* (but not himself, of course).
    I am not at all like that: you are attacking
    a figment of your imagination.

    >> How reasonable is it to ask me, or anyone else
    >> for that matter, to extract, experiment-wise
    >> (if it can be done at all with reasonable effort)
    >> information that properly belongs to the implementer
    >> and really should have been exposed in the
    >> documentation in the first place?

    > Not at all reasonable. How reasonable is it to ask
    > me to provide you support information for free?


    Even providing the most basic information about
    the interface of a module, even providing the
    mere names of member functions, could be denied
    on *that* basis. Thus, this argument fails.
    It fails simply because it "proves" too much...

    Regards,
    Christian
    --
    »In the long run men hit only what they aim at.
    Therefore, though they should fail immediately,
    they had better aim at something high.«
    - Henry David Thoreau: 'Walden'
    Christian Stapfer, Oct 11, 2005
    #11
  12. Odd-R.

    jon Guest

    To take the heat out of the discussion:

    sets are blazingly fast.
    jon, Oct 13, 2005
    #12
  13. Let me begin by apologizing to Christian as I was too snippy in
    my reply, and sounded even snippier than I meant to.

    Christian Stapfer wrote:
    > "Scott David Daniels" <> wrote in message
    > news:...
    >>a "better" set implementation will win if
    >>it can show better performance without
    >>related down-sides.

    >
    > Is there ever such a thing? I always
    > thought that, most of the time, there
    > is no such thing as a free lunch in

    If you look at the history of Python's sort, it has steadily gotten
    better. The list implementations has been tweaked to produce better
    performance appending and popping. There are a number of such cases.
    In fact, as Python rolls along, code keeps getting improved. Usually
    the requirement is that the change not degrade current benchmarks and
    provide a substantial improvement in at least some practical cases.

    >>As to the "either now, or eventually;" if you _must_ have performance
    >>now, not in some abstract future, then it behooves you to _test_,
    >>_test_, _test_!

    >
    > Well, I might want to test: but *first* I want
    > to design, preferably in "armchair style" ... [using]
    > rough complexity measures to ....


    >>>>>If the documentation stated the order-of-magnitude
    >>>>>behavior of those basic operations up front, then
    >>>>>I (and *anyone* else who ever wanted to use those
    >>>>>operations on large lists / large sets) could do
    >>>>>a quick order-of-magnitude estimation of how
    >>>>>a certain program design will behave, performance
    >>>>>wise.


    I think this is where I started over-reacting. There are
    a number of requests here over time by people who state
    that things would be so much better for every one if only
    someone (never themselves) did some more work that they
    might not otherwise want to do. The people who implement
    the code often do so on their own time. I find the Python
    docs surprisingly good for even commercial documentation.
    For work that is done gratis, it is phenomenal. I hesitate
    to ask any more of it (although I have pointed out bugs in
    docs as I've found them).

    >>And, if the proper documentation is in place, and it
    >>says "dictionary lookup is O(N)" (and you avoid using
    >>it for exactly that reason), how surprised will you be
    >>to discover that the O(N) is only reached if the hash
    >>values of the keys are all equal?

    >
    > It's not that difficult to distinguish
    > *average* case and *worst* case scenarios.
    > It might even be a good idea to state,
    > if that can easily be done, what the
    > *best* case happens do be...
    >
    >>Oh, maybe you expect "O(N)" to really mean "\Theta(N)".
    >>Then, if you are a dweeb like me, you will respond that
    >>"This is not possible, a dictionary of size N must take at
    >>least 'O(lg N)' to read the key, never mind processing it."
    >>But, it turns out, that given a bound on the size of a
    >>process, processing an address is "O(1)", not "O(lg N)".
    >>Is that too practical for you, or not enough?

    >
    > I do not expect a developer to expend *inordinate*
    > amounts of work to figure out the computational
    > complexity of what he has implemented. But, as I
    > wrote, he *must* have thought about the matter, and
    > is thus, by definition, in a rather good position
    > to provide what he can frequently simply pull from
    > memory when it comes to documenting the interface
    > of his module.


    I talked about Big-O (worst case) or Big-Theta (average case)
    just to point out that no simple characterization like "O(N)"
    tells enough of the story to be practically useful. Once you
    decide that isn't good enough, the burden on creating the
    documentation is getting substantial, especially given that
    you've already spent the effort to write the code and tests
    for it. In fact I'd hesitate to raise the documentation bar
    any higher -- I'd hate to think someone thought, "Oh, forget
    it, I'll just use this code myself."

    >>>>> *Experimenting* is not necessarily as easy to
    >>>>>do as you seem to believe.

    No, "experimenting" is not always easy (though often it is
    easy enough). However, "experimenting" puts the cost on the
    person who derives the benefit, and is thus likely to not be
    done in a slipshod way.

    > I am not at all a lawyer type, as you seem to
    > imagine. I just want to suggest that some
    > (to the implementer - but not the average user)
    > *easily* available information about computational
    > complexity (especially for the most basic data types)
    > would be a good thing to have. Nothing more.

    My point is that simple performance characterization is not
    good enough to be useful, and fully accurate characterization
    is an onerous documentation burden. Something in between
    will be fraught with complaints about "surprising" worst
    cases. Whether you would approach middle-ground documentation
    with the spirit of "this is enough to go on" or not, rest
    assured that a number of people will complain in a "language-
    lawyer-like" way about any perceived imperfections.

    >>Would you mind if the quality is proportional to
    >>the price you paid?

    Here I was too snippy. Sorry.

    >>>>You are, of course, either assuming that there's a
    >>>>single implementation of Python,
    >>>>or that all implementations have the same behaviour.

    (Each line denied with "Of course not.")
    But realistically, most users will learn the performance
    characteristics once, and continue to use the trade-offs
    that were characteristics of that version of Python going
    forward. Now maybe you behave differently, but most
    programmers I know (and I explicitly include myself) do
    have that tendency. Lots of python code moves around
    operating systems and implementations (and the number
    of implementations is growing).

    >>>But it is reasonable, anyway, to ask for information
    >>>about a specific implementation (that one is *forced*
    >>>to use).

    >>You are not _forced_ to use any implementation of Python.

    > What I wanted to say is ... this: *whenever* a Python
    > program is run, it will run on some specific
    > implementation of Python. That much should be obvious.

    If I were still in a snippy mood, I'd point out that you
    might look at your original statement and see how it might
    be read by someone who accidentally read what you wrote,
    rather than what wanted to write.

    >>You are free to implement your own Python system.

    > Don't be ridiculous.

    If Pypy gets going this may not be as unlikely as you think.
    You will be able to (relatively easily) replace implementations
    of parts of Python and re-generate a system.

    > It is you (and only you) here, who is arguing that if
    > perfection is not attainable (with reasonable effort),
    > nothing should be done at all.

    I am trying to say the request you make is substantially larger
    than you understand (or would appear at first blush). Further
    it is a request that asks for work from others for your (and
    other's) benefit, not work that you offer to pitch in and help on.

    > I am as willing to live with imperfect module inferface
    > specification as I am willing with imperfect tools more
    > generally. But that will not stop me from making
    > *suggestions* for improvement (not arrogant demands, mind you).

    Try writing something that would fit your standards for all of
    the basic types, punting where you don't know details. If it is
    enough to be useful, others will chip in and help fix it where it
    is broken.

    > I have seen *many* books on algorithms that specify
    > computational complexity measures for the algorithms
    > described. Any really good library of algorithms will
    > at least try to reach that standard, too.

    How many "really good libraries of algorithms" do you know of?
    Could you name a couple that have these characterizations?
    Practical code is rife with issues of "finite address space,"
    low-order effects swamping the higher-order terms for most
    practical sizes and such. It is this collection of nasty little
    details that makes characterizing libraries frustratingly hard.

    >>>I consider it the job of the implementer to ...

    >>Am I to take it you provide this to all of your customers?

    > ...I did not want to pose as the customer who expects nothing
    > but perfection from *others* (but not himself, of course).
    > I am not at all like that: you are attacking
    > a figment of your imagination.

    Actually, I was not attacking at all. I was trying to suggest
    that it is a harder task than you imagine. I am assuming you won't
    take me up on the suggestion of starting a flawed document, but I'd
    be happy to be proved wrong. If not, try writing a characterization
    of the performance of a dozen or so of your own programs. Does the
    result seem useful in proportion to the work it took you to write it?

    >>>How reasonable is it to ask me, or anyone else
    >>>for that matter, to extract, experiment-wise
    >>>(if it can be done at all with reasonable effort)
    >>>information that properly belongs to the implementer
    >>>and really should have been exposed in the
    >>>documentation in the first place?


    >>Not at all reasonable. How reasonable is it to ask
    >>me to provide you support information for free?


    OK, here I was _very_ snippy. Sorry again.

    I will assert that often the experiment results in a single bit of
    information: "it is fast enough." Experimenting will get you reliable
    answers like that, and only when run-time is an issue will you need to
    go into it much further.

    --Scott David Daniels
    Scott David Daniels, Oct 14, 2005
    #13
  14. "jon" <> wrote in message
    news:...
    >
    > To take the heat out of the discussion:
    >
    > sets are blazingly fast.


    I'd prefer a (however) rough characterization
    of computational complexity in terms of Big-Oh
    (or Big-whatever) *anytime* to marketing-type
    characterizations like this one...

    Regards,
    Christian
    Christian Stapfer, Oct 15, 2005
    #14
  15. On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote:

    > "jon" <> wrote in message
    > news:...
    >>
    >> To take the heat out of the discussion:
    >>
    >> sets are blazingly fast.

    >
    > I'd prefer a (however) rough characterization
    > of computational complexity in terms of Big-Oh
    > (or Big-whatever) *anytime* to marketing-type
    > characterizations like this one...


    Oh how naive.

    The marketing department says: "It's O(N), so it is blindingly fast."

    Translation: the amount of computation it does is linearly proportional
    to N. The constant of proportionality is 1e10.

    The marketing department says: "Our competitor's product is O(N**2), so it
    runs like a three-legged dog."

    Translation: the amount of computation it does is linearly proportional to
    N squared. The constant of proportionality is 1e-100.

    You do the maths.

    Big O notation is practically useless for judging how fast a single
    algorithm will be, or how one algorithm compares to another. It is only
    useful for telling you how a single algorithm will scale as the input
    increases.

    It is very common for sensible programmers to fall back on a "less
    efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if
    that algorithm runs faster than the "more efficient" O(N) or O(log N)
    algorithm. In fact, that's exactly what the sort() method does in Python:
    for small enough lists, say, under 100 elements, it is quicker to run an
    O(N**2) algorithm (shell sort I believe) than it is to perform the
    complex set up for the merge-sort variant used for larger lists.

    As for sets, they are based on dicts, which are effectively hash tables.
    Hash tables are O(1), unless there are collisions, in which case the more
    common algorithms degenerate to O(N). So, a very rough and ready estimate
    of the complexity of the algorithm using sets would be somewhere between
    O(1) and O(N) depending on the details of Python's algorithms.

    So, let's do an experiment, shall we?

    from sets import Set
    import time

    def compare_and_separate_with_sets(A, B):
    AB = Set(A+B)
    A = Set(A)
    B = Set(B)
    only_A = list(A-B)
    only_B = list(B-A)
    both_AB = list(AB - Set(only_A+only_B))
    return (only_A, only_B, both_AB)

    def timeit(f, args, n):
    """Time function f when called with *args. For timing purposes,
    does n calls of f. Returns the average time used per call in seconds.
    """
    loopit = range(n)
    t = time.time()
    for i in loopit:
    results = f(*args)
    t = time.time() - t
    return t/n

    def single_test(N):
    print ("N = %-8d" % N),
    A = range(N)
    B = range(N/2, 3*N/2)
    return timeit(compare_and_separate_with_sets, (A, B), 20)

    def test_Order():
    # test how compare_and_separate_with_sets scales with size
    for i in range(7):
    print single_test(10**i)


    Now run the test code:

    py> test_Order()
    N = 1 0.000219106674194
    N = 10 0.000135183334351
    N = 100 0.000481128692627
    N = 1000 0.0173740386963
    N = 10000 0.103679180145
    N = 100000 0.655336141586
    N = 1000000 8.12827801704

    In my humble opinion, that's not bad behaviour. It looks O(log N) to me,
    and quite fast too: about 8 seconds to compare and separate two lists of
    one million items each.

    The craziest thing is, the amount of time it took to write and test two
    different algorithms was probably 1% of the time it would take to hunt up
    theoretical discussions of what the big O behaviour of the algorithms
    would be.



    --
    Steven.
    Steven D'Aprano, Oct 15, 2005
    #15
  16. "Steven D'Aprano" <> wrote in message
    news:p...
    > On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote:
    >
    >> "jon" <> wrote in message
    >> news:...
    >>>
    >>> To take the heat out of the discussion:
    >>>
    >>> sets are blazingly fast.

    >>
    >> I'd prefer a (however) rough characterization
    >> of computational complexity in terms of Big-Oh
    >> (or Big-whatever) *anytime* to marketing-type
    >> characterizations like this one...

    >
    > Oh how naive.


    Why is it that even computer science undergrads
    are required to learn the basics of Big-Oh and
    all that? Are computer scientists really morons,
    as Xah Lee suggests? I can't believe it, but
    maybe you have a point...

    > The marketing department says: "It's O(N), so it is blindingly fast."


    I might as well interpret "blindingly fast"
    as meaning O(1). - Why not?
    Surely marketing might also have reasoned like
    this: "It's O(1), so its blindingly fast".
    But I *want*, nay, I *must* know whether it is
    O(N) or O(1). So forget about marketingspeak,
    it's a deadly poison for developers. It might
    be ok to induce grandiose feelings in childish
    users - but developers are grown ups: they must
    face reality...

    > Translation: the amount of computation it does is linearly proportional
    > to N. The constant of proportionality is 1e10.
    >
    > The marketing department says: "Our competitor's product is O(N**2), so it
    > runs like a three-legged dog."
    >
    > Translation: the amount of computation it does is linearly proportional to
    > N squared. The constant of proportionality is 1e-100.
    >
    > You do the maths.
    >
    > Big O notation is practically useless for judging how fast a single
    > algorithm will be, or how one algorithm compares to another.


    That's why Knuth liked it so much?
    That's why Aho, Hopcroft and Ullman liked it so much?
    That's why Gonnet and Baeza-Yates liked it so much?

    > It is only useful for telling you how a single algorithm
    > will scale as the input increases.


    And that's really very useful information indeed.
    Since, given such information for the basic data types
    and operations, as implemented by the language and
    its standard libraries, I stand a real chance of
    being able to determine the computational complexity
    of the *particular*combination* of data types and
    algorithms of my own small utility or of a
    critical piece of my wonderful and large application,
    on which the future of my company depends, with some
    confidence and accuracy.

    > It is very common for sensible programmers to fall back on a "less
    > efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if
    > that algorithm runs faster than the "more efficient" O(N) or O(log N)
    > algorithm. In fact, that's exactly what the sort() method does in Python:
    > for small enough lists, say, under 100 elements, it is quicker to run an
    > O(N**2) algorithm (shell sort I believe) than it is to perform the
    > complex set up for the merge-sort variant used for larger lists.
    >
    > As for sets, they are based on dicts, which are effectively hash tables.
    > Hash tables are O(1), unless there are collisions,


    Depending on the "load factor" of the hash tables.
    So we would want to ask, if we have very large
    lists indeed, how much space needs to be invested
    to keep the load factor so low that we can say
    that the membership test is O(1). Do A-B and A&B
    have to walk the entire hash table (which must be
    larger than the sets, because of a load factor
    < 1)? Also: the conversion of lists to sets needs
    the insertion of N elements into those hash tables.
    That alone already makes the overall algorithm
    *at*least* O(N). So forget about O(log N).

    > in which case the more
    > common algorithms degenerate to O(N).


    So here, indeed, we have the kind of reasoning that
    one ought to be able to deliver, based on what's in
    the Python documentation. Luckily, you have that
    kind the knowledge of both, how sets are implemented
    and what Big-Oh attaches to the hash table operation
    of "look up".
    In order to *enable* SUCH reasoning for *everyone*,
    starting from the module interface documentation only,
    one clearly needs something along the lines that
    I was suggesting...

    > So, a very rough and ready estimate
    > of the complexity of the algorithm using sets would be somewhere between
    > O(1) and O(N) depending on the details of Python's algorithms.
    >
    > So, let's do an experiment, shall we?
    >
    > from sets import Set
    > import time
    >
    > def compare_and_separate_with_sets(A, B):
    > AB = Set(A+B)
    > A = Set(A)
    > B = Set(B)
    > only_A = list(A-B)
    > only_B = list(B-A)
    > both_AB = list(AB - Set(only_A+only_B))
    > return (only_A, only_B, both_AB)
    >
    > def timeit(f, args, n):
    > """Time function f when called with *args. For timing purposes,
    > does n calls of f. Returns the average time used per call in seconds.
    > """
    > loopit = range(n)
    > t = time.time()
    > for i in loopit:
    > results = f(*args)
    > t = time.time() - t
    > return t/n
    >
    > def single_test(N):
    > print ("N = %-8d" % N),
    > A = range(N)
    > B = range(N/2, 3*N/2)
    > return timeit(compare_and_separate_with_sets, (A, B), 20)
    >
    > def test_Order():
    > # test how compare_and_separate_with_sets scales with size
    > for i in range(7):
    > print single_test(10**i)
    >
    >
    > Now run the test code:
    >
    > py> test_Order()
    > N = 1 0.000219106674194
    > N = 10 0.000135183334351


    Curious: N=10 takes less time than N=1?

    > N = 100 0.000481128692627


    Why do we have such a comparatively large jump
    here, from N=100 to N=1000? Did hash tables
    overflow during conversion or something like
    that?

    > N = 1000 0.0173740386963
    > N = 10000 0.103679180145
    > N = 100000 0.655336141586
    > N = 1000000 8.12827801704


    Doesn't look quite O(n). Not yet...

    > In my humble opinion, that's not bad behaviour.
    > It looks O(log N) to me,


    How could that be? *Every* element of A and B must touched,
    if only to be copied: that can't make it O(log(N)).
    Also, the conversion of lists to sets must be at least
    O(N). And N isn't the right measure anyway. It would probably
    have to be in terms of |A| and |B|. For example, if |A| is very
    small, as compared to |B|, then A-B and A & B can be determined
    rather quickly by only considering elements of A.

    > and quite fast too: about 8 seconds to compare and separate two lists of
    > one million items each.
    >
    > The craziest thing is, the amount of time it took to write and test two
    > different algorithms was probably 1% of the time it would take to hunt up
    > theoretical discussions of what the big O behaviour of the algorithms
    > would be.


    You must distinguish questions of principle
    and questions of muddling through like this
    testing bit you've done. It would take me some
    time to even be *sure* how to interpret the
    result. I would never want to say "it looks
    O(log N) to me", as you do, and leave it at
    that. Rather, I might say, as you do, "it
    looks O(log N) to me", *but* then try to figure
    out, given my knowledge of the implementation
    (performance wise, based on information that
    is sadly missing in the Python documentation),
    *why* that might be. Then, if my experiments says
    "it looks like O(log N)" AND if my basic
    knowledge of the implementation of set and
    list primitives says "it should be O(log N)"
    as well, I would venture, with some *confidence*,
    to claim: "it actually IS O(log N)"....

    You do not compare the convert-it-to-sets-approach
    to the single list-walk either. Supposing the OP
    had actually sorted lists to begin with, then a
    single, simultaneous walk of the lists would be
    about as fast as it can get. Very, very likely
    *faster* than conversion to sets would be...

    Regards,
    Christian
    Christian Stapfer, Oct 15, 2005
    #16
  17. On Sat, 15 Oct 2005 18:17:36 +0200, Christian Stapfer wrote:

    >>> I'd prefer a (however) rough characterization
    >>> of computational complexity in terms of Big-Oh
    >>> (or Big-whatever) *anytime* to marketing-type
    >>> characterizations like this one...

    >>
    >> Oh how naive.

    >
    > Why is it that even computer science undergrads
    > are required to learn the basics of Big-Oh and
    > all that?


    So that they know how to correctly interpret what Big O notation means,
    instead of misinterpreting it. Big O notation doesn't tell you everything
    you need to know to predict the behaviour of an algorithm. It doesn't even
    tell you most of what you need to know about its behaviour. Only actual
    *measurement* will tell you what you need to know.

    Perhaps you should actually sit down and look at the assumptions,
    simplifications, short-cuts and trade-offs that computer scientists make
    when they estimate an algorithm's Big O behaviour. It might shock you out
    of your faith that Big O is the be all and end all of algorithm planning.

    For all but the simplest algorithm, it is impractical to actually count
    all the operations -- and even if you did, the knowledge wouldn't help
    you, because you don't know how long each operation takes to get executed.
    That is platform specific.

    So you simplify. You pretend that paging never happens. That memory
    allocations take zero time. That set up and removal of data structures
    take insignificant time. That if there is an N**2 term, it always swamps
    an N term. You assume that code performance is independent of the CPUs.
    You assume that some operations (e.g. comparisons) take no time, and
    others (e.g. moving data) are expensive.

    Those assumptions sometimes are wildly wrong. I've been seriously bitten
    following text book algorithms written for C and Pascal: they assume that
    comparisons are cheap and swapping elements are expensive. But in Python,
    swapping elements is cheap and comparisons are expensive, because of all
    the heavy object-oriented machinery used. Your classic text book algorithm
    is not guaranteed to survive contact with the real world: you have to try
    it and see.

    Given all the assumptions, it is a wonder that Big O estimates are ever
    useful, not that they sometimes are misleading.



    [snip]
    >> The marketing department says: "It's O(N), so it is blindingly fast."

    >
    > I might as well interpret "blindingly fast" as meaning O(1). - Why not?
    > Surely marketing might also have reasoned like
    > this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must*
    > know whether it is O(N) or O(1).


    You might _want_, but you don't _need_ to know which it is, not in every
    case. In general, there are many circumstances where it makes no
    sense to worry about Big O behaviour. What's your expected data look like?
    If your data never gets above N=2, then who cares whether it is O(1)=1,
    O(N)=2, O(N**2)=4 or O(2**N)=2? They are all about as fast.

    Even bubble sort will sort a three element list fast enough -- and
    probably faster than more complicated sorts. Why spend all the time
    setting up the machinery for a merge sort for three elements?


    [snip]

    >> Big O notation is practically useless for judging how fast a single
    >> algorithm will be, or how one algorithm compares to another.

    >
    > That's why Knuth liked it so much?
    > That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet
    > and Baeza-Yates liked it so much?


    Two reasons: it is useful for telling you how a single algorithm will
    scale as the input increases, just as I said.

    And, unlike more accurate ways of calculating the speed of an algorithm
    from first principles, it is actually possible to do Big O calculations.

    No doubt the state of the art of algorithm measurements has advanced since
    I was an undergraduate, but certain fundamental facts remain: in order to
    calculate that Big O, you have to close your eyes to all the realities
    of practical code execution, and only consider an idealised calculation.
    Even when your calculation gives you constants of proportionality and
    other coefficients, Big O notation demands you throw that information away.

    But by doing so, you lose valuable information. An O(N**2) algorithm that
    scales like 1e-6 * N**2 will be faster than an O(N) algorithm that scales
    as 1e6 * N, until N reaches one million million. By tossing away those
    coefficients, you wrongly expect the first algorithm to be slower than the
    second, and choose the wrong algorithm.



    >> It is only useful for telling you how a single algorithm will scale as
    >> the input increases.

    >
    > And that's really very useful information indeed.


    Yes it is. Did I ever say it wasn't?


    > Since, given such
    > information for the basic data types and operations, as implemented by
    > the language and its standard libraries, I stand a real chance of being
    > able to determine the computational complexity of the
    > *particular*combination* of data types and algorithms of my own small
    > utility or of a critical piece of my wonderful and large application, on
    > which the future of my company depends, with some confidence and
    > accuracy.


    Yes, zero is a real chance.


    [snip]

    >> As for sets, they are based on dicts, which are effectively hash
    >> tables. Hash tables are O(1), unless there are collisions,

    >
    > Depending on the "load factor" of the hash tables. So we would want to
    > ask, if we have very large lists indeed, how much space needs to be
    > invested to keep the load factor so low that we can say that the
    > membership test is O(1).


    And knowing that hash tables are O(1) will not tell you that, will it?

    There is only one practical way of telling: do the experiment. Keep
    loading up that hash table until you start getting lots of collisions.


    > Do A-B and A&B have to walk the entire hash
    > table (which must be larger than the sets, because of a load factor <
    > 1)? Also: the conversion of lists to sets needs the insertion of N
    > elements into those hash tables. That alone already makes the overall
    > algorithm *at*least* O(N). So forget about O(log N).


    Yes, inserting N items into a hash table takes at least N inserts. But if
    those inserts are fast enough, you won't even notice the time it takes to
    do it, compared to the rest of your algorithm. In many algorithms, you
    don't even care about the time it takes to put items in your hash table,
    because that isn't part of the problem you are trying to solve.

    So in real, practical sense, it may be that your algorithm gets dominated
    by the O(log N) term even though there is technically an O(N) term in
    there. Are Python dicts like that? I have no idea. But I know how to find
    out: choose a problem domain I care about ("dicts with less than one
    million items") and do the experiment.


    >> in which case the more
    >> common algorithms degenerate to O(N).

    >
    > So here, indeed, we have the kind of reasoning that one ought to be able
    > to deliver, based on what's in the Python documentation. Luckily, you
    > have that kind the knowledge of both, how sets are implemented and what
    > Big-Oh attaches to the hash table operation of "look up".
    > In order to *enable* SUCH reasoning for *everyone*,
    > starting from the module interface documentation only, one clearly needs
    > something along the lines that I was suggesting...


    I don't object to having that Big O information available, except
    insofar as it can be misleading, but I take issue with your position that
    such information is necessary.


    [snip]

    >> Now run the test code:
    >>
    >> py> test_Order()
    >> N = 1 0.000219106674194
    >> N = 10 0.000135183334351

    >
    > Curious: N=10 takes less time than N=1?


    Yes, funny how real-world results aren't as clean and neat as they are in
    theory. There are those awkward assumptions coming to bite you again. I've
    done two additional tests, and get:

    N = 1 0.000085043907166
    N = 10 0.000106656551361

    N = 1 0.000497949123383
    N = 10 0.000124049186707

    Remember, these results are averaged over twenty trials. So why it is
    quicker to do work with sets of size 10 than sets of size 1? Big O
    notation will never tell you, because it ignores the implementation
    details that really make a difference.



    >> N = 100 0.000481128692627

    >
    > Why do we have such a comparatively large jump here, from N=100 to
    > N=1000? Did hash tables overflow during conversion or something like
    > that?


    Who knows? Maybe Python was doing some garbage collection the first time I
    run it. I've modified my code to print a scale factor, and here is another
    run:

    N = 1 0.00113509893417
    N = 10 0.000106143951416 (x 0.093511)
    N = 100 0.00265134572983 (x 24.978774)
    N = 1000 0.0057701587677 (x 2.176313)
    N = 10000 0.0551437973976 (x 9.556721)
    N = 100000 0.668345856667 (x 12.120055)
    N = 1000000 8.6285964489 (x 12.910376)

    An increase from N=1 to 1000000 (that's a factor of one million) leads to
    an increase in execution time of about 7600.

    You will notice that the individual numbers vary significantly from trial
    to trial, but the over-all pattern is surprisingly consistent.


    >> N = 1000 0.0173740386963
    >> N = 10000 0.103679180145
    >> N = 100000 0.655336141586
    >> N = 1000000 8.12827801704

    >
    > Doesn't look quite O(n). Not yet...


    No it doesn't.

    >
    >> In my humble opinion, that's not bad behaviour. It looks O(log N) to
    >> me,


    That's a mistake -- it is nowhere near O(log N). My bad. Closer to
    O(sqrt N), if anything.


    > How could that be? *Every* element of A and B must touched, if only to
    > be copied: that can't make it O(log(N)).


    And, no doubt, if you had *really enormous* lists, oh, I don't know, maybe
    a trillion items, you would see that O(N) behaviour. But until then, the
    overall performance is dominated by the smaller-order terms with larger
    coefficients.


    > Also, the conversion of lists
    > to sets must be at least O(N). And N isn't the right measure anyway. It
    > would probably have to be in terms of |A| and |B|. For example, if |A|
    > is very small, as compared to |B|, then A-B and A & B can be determined
    > rather quickly by only considering elements of A.



    Both lists have the same number of elements, so double N.


    [snip]

    > You must distinguish questions of principle and questions of muddling
    > through like this testing bit you've done.


    Your "question of principle" gives you completely misleading answers.
    Remember, you were the one who predicted that lists would have to be
    faster than sets. Your prediction failed miserably.


    > It would take me some time to
    > even be *sure* how to interpret the result.


    What's to interpret? I know exactly how fast the function will run, on
    average, on my hardware. I can even extrapolate to larger sizes of N,
    although I would be very careful to not extrapolate too far. (I predict
    less than 10 minutes to work on a pair of 10,000,000 element lists, and
    less than two hours to work on 100,000,000 element lists.)

    > I would never want to say
    > "it looks O(log N) to me", as you do, and leave it at that. Rather, I
    > might say, as you do, "it looks O(log N) to me", *but* then try to
    > figure out, given my knowledge of the implementation (performance wise,
    > based on information that is sadly missing in the Python documentation),
    > *why* that might be.


    Fine. You have the source code, knock yourself out.

    > Then, if my experiments says "it looks like O(log
    > N)" AND if my basic knowledge of the implementation of set and list
    > primitives says "it should be O(log N)" as well, I would venture, with
    > some *confidence*, to claim: "it actually IS O(log N)"....
    >
    > You do not compare the convert-it-to-sets-approach
    > to the single list-walk either.


    No I did not, because I didn't have a function to do it. You've got my
    source code. Knock yourself out to use it to test any function you like.

    > Supposing the OP had actually sorted
    > lists to begin with, then a single, simultaneous walk of the lists would
    > be about as fast as it can get. Very, very likely *faster* than
    > conversion to sets would be...


    Please let us know how you go with that. It should be really interesting
    to see how well your prediction copes with the real world.

    (Hint: another of those awkward little implementation details... how much
    work is being done in C code, and how much in pure Python? Just something
    for you to think about. And remember, an O(N) algorithm in Python will be
    faster than an O(N**2) algorithm in C... or is that slower?)



    --
    Steven.
    Steven D'Aprano, Oct 15, 2005
    #17
  18. "Steven D'Aprano" <> wrote in message
    news:p...
    > On Sat, 15 Oct 2005 18:17:36 +0200, Christian Stapfer wrote:
    >
    >>>> I'd prefer a (however) rough characterization
    >>>> of computational complexity in terms of Big-Oh
    >>>> (or Big-whatever) *anytime* to marketing-type
    >>>> characterizations like this one...
    >>>
    >>> Oh how naive.

    >>
    >> Why is it that even computer science undergrads
    >> are required to learn the basics of Big-Oh and
    >> all that?

    >
    > So that they know how to correctly interpret what Big O notation means,
    > instead of misinterpreting it. Big O notation doesn't tell you everything
    > you need to know to predict the behaviour of an algorithm.


    Well, that's right. I couldn't agree more:
    it doesn't tell you *everything* but it does
    tell you *something*. And that *something*
    is good to have.

    > It doesn't even tell you most of what you need to know about its
    > behaviour.
    > Only actual *measurement* will tell you what you need to know.


    Well, that's where you err: Testing doesn't
    tell you everything *either*. You need *both*:
    a reasonable theory *and* experimental data...
    If theory and experimental data disagree,
    we would want to take a closer look on both,
    the theory (which may be mistaken or inadequate)
    *and* the experiment (which may be inadequate or
    just plain botched as well).

    > Perhaps you should actually sit down and look at the assumptions,
    > simplifications, short-cuts and trade-offs that computer scientists make
    > when they estimate an algorithm's Big O behaviour. It might shock you out
    > of your faith that Big O is the be all and end all of algorithm planning.
    >
    > For all but the simplest algorithm, it is impractical to actually count
    > all the operations -- and even if you did, the knowledge wouldn't help
    > you, because you don't know how long each operation takes to get executed.
    > That is platform specific.
    >
    > So you simplify. You pretend that paging never happens. That memory
    > allocations take zero time. That set up and removal of data structures
    > take insignificant time. That if there is an N**2 term, it always swamps
    > an N term. You assume that code performance is independent of the CPUs.
    > You assume that some operations (e.g. comparisons) take no time, and
    > others (e.g. moving data) are expensive.
    >
    > Those assumptions sometimes are wildly wrong. I've been seriously bitten
    > following text book algorithms written for C and Pascal: they assume that
    > comparisons are cheap and swapping elements are expensive. But in Python,
    > swapping elements is cheap and comparisons are expensive, because of all
    > the heavy object-oriented machinery used. Your classic text book algorithm
    > is not guaranteed to survive contact with the real world: you have to try
    > it and see.


    Still: the expensiveness of those operations (such
    as swapping elements vs. comparisons) will only
    affect the constant of proportionality, not the
    asymptotic behavior of the algorithm. Sooner or
    later the part of your program that has the
    worst asymptotic behavior will determine speed
    (or memory requirements) of your program.

    > Given all the assumptions, it is a wonder that Big O
    > estimates are ever useful, not that they sometimes
    > are misleading.
    >
    > [snip]
    >>> The marketing department says: "It's O(N), so it is blindingly fast."

    >>
    >> I might as well interpret "blindingly fast" as meaning O(1). - Why not?
    >> Surely marketing might also have reasoned like
    >> this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must*
    >> know whether it is O(N) or O(1).

    >
    > You might _want_, but you don't _need_ to know which it is, not in every
    > case. In general, there are many circumstances where it makes no
    > sense to worry about Big O behaviour. What's your expected data look like?
    > If your data never gets above N=2, then who cares whether it is O(1)=1,
    > O(N)=2, O(N**2)=4 or O(2**N)=2? They are all about as fast.
    >
    > Even bubble sort will sort a three element list fast enough -- and
    > probably faster than more complicated sorts. Why spend all the time
    > setting up the machinery for a merge sort for three elements?


    Since you have relapsed into a fit of mere polemics,
    I assume to have made my point as regards marketing
    type characterizations of algorithms ("blazingly
    fast") vs. measures, however rough, of asymptotic
    complexity measures, like Big-Oh. - Which really
    was the root of this sub-thread that went like this:

    ...>> To take the heat out of the discussion:
    ... >> sets are blazingly fast.

    ... > I'd prefer a (however) rough characterization
    ... > of computational complexity in terms of Big-Oh
    ... > (or Big-whatever) *anytime* to marketing-type
    ... > characterizations like this one...

    > [snip]
    >
    >>> Big O notation is practically useless for judging how fast a single

    ^^^^^^^^^^^^^^^^^^^^^^
    >>> algorithm will be, or how one algorithm compares to another.

    >>
    >> That's why Knuth liked it so much?
    >> That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet
    >> and Baeza-Yates liked it so much?

    >
    > Two reasons: it is useful for telling you how a single algorithm will
    > scale as the input increases, just as I said.


    Curiously, just a few lines before writing this,
    you have polemically denied any "practical" use
    for Big-Oh notation.

    > And, unlike more accurate ways of calculating the speed of an algorithm
    > from first principles, it is actually possible to do Big O calculations.


    Right. It's a compromise: being somewhat precise
    - without getting bogged down trying to solve major
    combinatorial research problems...

    > No doubt the state of the art of algorithm measurements has advanced since
    > I was an undergraduate, but certain fundamental facts remain: in order to
    > calculate that Big O, you have to close your eyes to all the realities
    > of practical code execution, and only consider an idealised calculation.


    That's right. Nothing stops you from then opening
    your eyes and testing some code, of course. *But*
    always try to relate what you see there with what
    theoretical grasp of the situation you have.
    If experimental data and theory *disagree*: try to
    fix the experiment and/or the theory.

    > Even when your calculation gives you constants of proportionality and
    > other coefficients, Big O notation demands you throw that information
    > away.
    >
    > But by doing so, you lose valuable information. An O(N**2) algorithm that
    > scales like 1e-6 * N**2 will be faster than an O(N) algorithm that scales
    > as 1e6 * N, until N reaches one million million. By tossing away those
    > coefficients, you wrongly expect the first algorithm to be slower than the
    > second, and choose the wrong algorithm.
    >
    >>> It is only useful for telling you how a single algorithm will scale as
    >>> the input increases.

    >>
    >> And that's really very useful information indeed.

    >
    > Yes it is. Did I ever say it wasn't?


    Well yes, by the way you attacked Big-Oh notation
    as "practically useless" (see above) I assumed you
    did.

    >> Since, given such
    >> information for the basic data types and operations, as implemented by
    >> the language and its standard libraries, I stand a real chance of being
    >> able to determine the computational complexity of the
    >> *particular*combination* of data types and algorithms of my own small
    >> utility or of a critical piece of my wonderful and large application, on
    >> which the future of my company depends, with some confidence and
    >> accuracy.

    >
    > Yes, zero is a real chance.
    >
    >
    > [snip]
    >
    >>> As for sets, they are based on dicts, which are effectively hash
    >>> tables. Hash tables are O(1), unless there are collisions,

    >>
    >> Depending on the "load factor" of the hash tables. So we would want to
    >> ask, if we have very large lists indeed, how much space needs to be
    >> invested to keep the load factor so low that we can say that the
    >> membership test is O(1).

    >
    > And knowing that hash tables are O(1) will not tell you that, will it?
    >
    > There is only one practical way of telling: do the experiment. Keep
    > loading up that hash table until you start getting lots of collisions.
    >
    >> Do A-B and A&B have to walk the entire hash
    >> table (which must be larger than the sets, because of a load factor <
    >> 1)? Also: the conversion of lists to sets needs the insertion of N
    >> elements into those hash tables. That alone already makes the overall
    >> algorithm *at*least* O(N). So forget about O(log N).

    >
    > Yes, inserting N items into a hash table takes at least N inserts. But if
    > those inserts are fast enough, you won't even notice the time it takes to
    > do it, compared to the rest of your algorithm. In many algorithms, you
    > don't even care about the time it takes to put items in your hash table,
    > because that isn't part of the problem you are trying to solve.
    >
    > So in real, practical sense, it may be that your algorithm gets dominated
    > by the O(log N) term even though there is technically an O(N) term in
    > there. Are Python dicts like that? I have no idea. But I know how to find
    > out: choose a problem domain I care about ("dicts with less than one
    > million items") and do the experiment.
    >
    >
    >>> in which case the more
    >>> common algorithms degenerate to O(N).

    >>
    >> So here, indeed, we have the kind of reasoning that one ought to be able
    >> to deliver, based on what's in the Python documentation. Luckily, you
    >> have that kind the knowledge of both, how sets are implemented and what
    >> Big-Oh attaches to the hash table operation of "look up".
    >> In order to *enable* SUCH reasoning for *everyone*,
    >> starting from the module interface documentation only, one clearly needs
    >> something along the lines that I was suggesting...

    >
    > I don't object to having that Big O information available, except
    > insofar as it can be misleading, but I take issue with your position that
    > such information is necessary.


    *Blindly* testing, that is, testing *without* being
    able to *relate* the outcomes of those tests (even
    the *design* of those tests) to some suitably
    simplified but not at all completely nonsensical
    theory (via Big-Oh notation, for example), is *not*
    really good enough.

    >>> Now run the test code:
    >>>
    >>> py> test_Order()
    >>> N = 1 0.000219106674194
    >>> N = 10 0.000135183334351

    >>
    >> Curious: N=10 takes less time than N=1?

    >
    > Yes, funny how real-world results aren't as clean and neat as they are in
    > theory. There are those awkward assumptions coming to bite you again. I've
    > done two additional tests, and get:
    >
    > N = 1 0.000085043907166
    > N = 10 0.000106656551361
    >
    > N = 1 0.000497949123383
    > N = 10 0.000124049186707
    >
    > Remember, these results are averaged over twenty trials. So why it is
    > quicker to do work with sets of size 10 than sets of size 1? Big O
    > notation will never tell you, because it ignores the implementation
    > details that really make a difference.
    >
    >
    >
    >>> N = 100 0.000481128692627

    >>
    >> Why do we have such a comparatively large jump here, from N=100 to
    >> N=1000? Did hash tables overflow during conversion or something like
    >> that?

    >
    > Who knows? Maybe Python was doing some garbage collection the first time I
    > run it. I've modified my code to print a scale factor, and here is another
    > run:
    >
    > N = 1 0.00113509893417
    > N = 10 0.000106143951416 (x 0.093511)
    > N = 100 0.00265134572983 (x 24.978774)
    > N = 1000 0.0057701587677 (x 2.176313)
    > N = 10000 0.0551437973976 (x 9.556721)
    > N = 100000 0.668345856667 (x 12.120055)
    > N = 1000000 8.6285964489 (x 12.910376)
    >
    > An increase from N=1 to 1000000 (that's a factor of one million) leads to
    > an increase in execution time of about 7600.
    >
    > You will notice that the individual numbers vary significantly from trial
    > to trial, but the over-all pattern is surprisingly consistent.
    >
    >
    >>> N = 1000 0.0173740386963
    >>> N = 10000 0.103679180145
    >>> N = 100000 0.655336141586
    >>> N = 1000000 8.12827801704

    >>
    >> Doesn't look quite O(n). Not yet...

    >
    > No it doesn't.
    >
    >>
    >>> In my humble opinion, that's not bad behaviour. It looks O(log N) to
    >>> me,

    >
    > That's a mistake -- it is nowhere near O(log N). My bad. Closer to
    > O(sqrt N), if anything.
    >
    >
    >> How could that be? *Every* element of A and B must touched, if only to
    >> be copied: that can't make it O(log(N)).

    >
    > And, no doubt, if you had *really enormous* lists, oh, I don't know, maybe
    > a trillion items, you would see that O(N) behaviour. But until then, the
    > overall performance is dominated by the smaller-order terms with larger
    > coefficients.
    >
    >
    >> Also, the conversion of lists
    >> to sets must be at least O(N). And N isn't the right measure anyway. It
    >> would probably have to be in terms of |A| and |B|. For example, if |A|
    >> is very small, as compared to |B|, then A-B and A & B can be determined
    >> rather quickly by only considering elements of A.

    >
    >
    > Both lists have the same number of elements, so double N.
    >
    >
    > [snip]
    >
    >> You must distinguish questions of principle and questions of muddling
    >> through like this testing bit you've done.

    >
    > Your "question of principle" gives you completely misleading answers.
    > Remember, you were the one who predicted that lists would have to be
    > faster than sets.


    I didn't say they would *have* to be faster
    - I was mainly asking for some *reasoned*
    argument why (and in what sense) conversion
    to sets would be an "efficient" solution
    of the OPs problem.

    > Your prediction failed miserably.


    Interestingly, you admit that you did not
    really compare the two approaches that
    were under discussion. So your experiment
    does *not* (yet) prove what you claim it
    proves.

    >> It would take me some time to
    >> even be *sure* how to interpret the result.

    >
    > What's to interpret? I know exactly how fast the function will run, on
    > average, on my hardware. I can even extrapolate to larger sizes of N,
    > although I would be very careful to not extrapolate too far. (I predict
    > less than 10 minutes to work on a pair of 10,000,000 element lists, and
    > less than two hours to work on 100,000,000 element lists.)


    For a starter: You have chosen a very particular type
    of element of those lists / sets: integers. So the
    complexity of comparisons for the OPs application
    might get *seriously* underestimated.

    >
    >> I would never want to say
    >> "it looks O(log N) to me", as you do, and leave it at that. Rather, I
    >> might say, as you do, "it looks O(log N) to me", *but* then try to
    >> figure out, given my knowledge of the implementation (performance wise,
    >> based on information that is sadly missing in the Python documentation),
    >> *why* that might be.

    >
    > Fine. You have the source code, knock yourself out.


    That's just what I do *not* think to be a particularly
    reasonable approach. Instead, I propose stating
    some (to the implementer *easily* available)
    information about asymptotic behavior of operations
    that are exposed by the module interface upfront.

    >> Then, if my experiments says "it looks like O(log
    >> N)" AND if my basic knowledge of the implementation of set and list
    >> primitives says "it should be O(log N)" as well, I would venture, with
    >> some *confidence*, to claim: "it actually IS O(log N)"....
    >>
    >> You do not compare the convert-it-to-sets-approach
    >> to the single list-walk either.

    >
    > No I did not, because I didn't have a function to do it.


    Here we see one of the problems of a purely
    experimentalist approach to computational complexity:
    you need an implementation (of the algorithm and the
    test harness) *before* you can get your wonderfully
    decisive experimental data.
    This is why we would like to have a way of (roughly)
    estimating the reasonableness of the outlines of a
    program's design in "armchair fashion" - i.e. without
    having to write any code and/or test harness.

    > You've got my
    > source code. Knock yourself out to use it to test any function you like.
    >
    >> Supposing the OP had actually sorted
    >> lists to begin with, then a single, simultaneous walk of the lists would
    >> be about as fast as it can get. Very, very likely *faster* than
    >> conversion to sets would be...

    >
    > Please let us know how you go with that. It should be really interesting
    > to see how well your prediction copes with the real world.
    >
    > (Hint: another of those awkward little implementation details... how much
    > work is being done in C code, and how much in pure Python? Just something
    > for you to think about. And remember, an O(N) algorithm in Python will be
    > faster than an O(N**2) algorithm in C... or is that slower?)


    This discussion begins to sound like the recurring
    arguments one hears between theoretical and
    experimental physicists. Experimentalists tend
    to overrate the importance of experimental data
    (setting up a useful experiment, how to interpret
    the experimental data one then gathers, and whether
    one stands any chance of detecting systematic errors
    of measurement, all depend on having a good *theory*
    in the first place). Theoreticians, on the other hand,
    tend to overrate the importance of the coherence of
    theories. In truth, *both* are needed: good theories
    *and* carefully collected experimental data.

    Regards,
    Christian
    --
    »When asked how he would have reacted if Eddington's
    *measurements* had come out differently, Einstein
    replied: "Then I would have been sorry for him
    - the *theory* is correct."«
    - Paul B. Armstrong: 'Conflicting Readings'
    Christian Stapfer, Oct 16, 2005
    #18
  19. Odd-R.

    Ron Adam Guest

    Christian Stapfer wrote:

    > This discussion begins to sound like the recurring
    > arguments one hears between theoretical and
    > experimental physicists. Experimentalists tend
    > to overrate the importance of experimental data
    > (setting up a useful experiment, how to interpret
    > the experimental data one then gathers, and whether
    > one stands any chance of detecting systematic errors
    > of measurement, all depend on having a good *theory*
    > in the first place). Theoreticians, on the other hand,
    > tend to overrate the importance of the coherence of
    > theories. In truth, *both* are needed: good theories
    > *and* carefully collected experimental data.
    >
    > Regards,
    > Christian


    An interesting parallel can be made concerning management of production
    vs management of creativity.

    In general, production needs checks and feedback to insure quality, but
    will often come to a stand still if incomplete resources are available.

    Where as creativity needs checks to insure production, but in many cases
    can still be productive even with incomplete or questionable resources.
    The quality may very quite a bit in both directions, but in creative
    tasks, that is to be expected.

    In many ways programmers are a mixture of these two. I think I and
    Steven use a style that is closer to the creative approach. I get the
    feeling your background may be closer to the production style.

    Both are good and needed for different types of tasks. And I think most
    programmers can switch styles to some degree if they need to.

    Cheers,
    Ron
    Ron Adam, Oct 16, 2005
    #19
  20. "Ron Adam" <> wrote in message
    news:cTp4f.16180$...
    > Christian Stapfer wrote:
    >
    >> This discussion begins to sound like the recurring
    >> arguments one hears between theoretical and
    >> experimental physicists. Experimentalists tend
    >> to overrate the importance of experimental data
    >> (setting up a useful experiment, how to interpret
    >> the experimental data one then gathers, and whether
    >> one stands any chance of detecting systematic errors
    >> of measurement, all depend on having a good *theory*
    >> in the first place). Theoreticians, on the other hand,
    >> tend to overrate the importance of the coherence of
    >> theories. In truth, *both* are needed: good theories
    >> *and* carefully collected experimental data.
    >>
    >> Regards,
    >> Christian

    >
    > An interesting parallel can be made concerning management of production vs
    > management of creativity.
    >
    > In general, production needs checks and feedback to insure quality, but
    > will often come to a stand still if incomplete resources are available.
    >
    > Where as creativity needs checks to insure production, but in many cases
    > can still be productive even with incomplete or questionable resources.
    > The quality may very quite a bit in both directions, but in creative
    > tasks, that is to be expected.
    >
    > In many ways programmers are a mixture of these two. I think I and Steven
    > use a style that is closer to the creative approach. I get the feeling
    > your background may be closer to the production style.


    This diagnosis reminds me of C.G. Jung, the psychologist,
    who, after having introduced the concepts of extra- and
    introversion, came to the conclusion that Freud was
    an extravert whereas Adler an introvert. The point is
    that he got it exactly wrong...

    As to the value of complexity theory for creativity
    in programming (even though you seem to believe that
    a theoretical bent of mind can only serve to stifle
    creativity), the story of the discovery of an efficient
    string searching algorithm by D.E.Knuth provides an
    interesting case in point. Knuth based himself on
    seemingly quite "uncreatively theoretical work" (from
    *your* point of view) that gave a *better* value for
    the computuational complexity of string searching
    than any of the then known algorithms could provide.

    Regards,
    Christian
    --
    »It is no paradox to say that in our most theoretical
    moods we may be nearest to our most practical applications.«
    - Alfred North Whitehead

    [and those "practical applications" will likely be most
    "creative" ones..]
    Christian Stapfer, Oct 16, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    387
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  2. James Stroud

    Comparing lists ...

    James Stroud, Feb 14, 2007, in forum: Python
    Replies:
    2
    Views:
    261
    Paul Rubin
    Feb 14, 2007
  3. hiro
    Replies:
    12
    Views:
    399
    Paul Rubin
    Jun 25, 2007
  4. Ladislav Andel

    comparing two lists

    Ladislav Andel, Aug 23, 2007, in forum: Python
    Replies:
    10
    Views:
    437
    Ladislav Andel
    Aug 27, 2007
  5. Zbigniew Braniecki

    comparing two lists, ndiff performance

    Zbigniew Braniecki, Jan 30, 2008, in forum: Python
    Replies:
    3
    Views:
    272
    Gabriel Genellina
    Jan 30, 2008
Loading...

Share This Page