Re: "Newbie" questions - "unique" sorting ?

Discussion in 'Python' started by Kim Petersen, Jun 24, 2003.

  1. Kim Petersen

    Kim Petersen Guest

    Cousin Stanley wrote:
    > John ....
    >
    > { 1. Good News | 2. Bad News | 3. Good News } ....
    >
    > 1. Good News ....
    >
    > The last version of word_list.py that I up-loaded
    > works as expected with your input file producing
    > an indexed word list with no duplicates ...
    >
    >
    > 2. Bad News ....
    >
    > It was S L O W E R than the proverbial turtle
    > in the tar pit ...
    >

    How about the simple approach?

    total=0
    unique={}
    for line in open(infile,"rb").xreadlines():
    for word in line.strip().split():
    unique[word]=unique.get(word,0)+1
    total+=1
    uwords=unique.keys()
    uwords.sort()
    open(outfile,"wb").write('\n'.join(uwords))

    --
    Med Venlig Hilsen / Regards

    Kim Petersen - Kyborg A/S (Udvikling)
    IT - Innovationshuset
    Havneparken 2
    7100 Vejle
    Tlf. +4576408183 || Fax. +4576408188
     
    Kim Petersen, Jun 24, 2003
    #1
    1. Advertising

  2. | How about the simple approach?
    | ...

    Kim ...

    The approach I used is fairly simple and similar
    to the one you posted, basically just
    stuffing words from split lines into
    a dictionary ...

    The following script produces an indexed list
    fairly quickly for relatively small files,
    but dogged out on the input file John supplied
    which yielded ...

    Total Words .... 467381
    Unique Words .... 47122

    Perhaps skipping the dictionary word count update
    in the following line might speed things up ...

    else :

    dict_words[ this_word ] += 1

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona

    -------------------------------------------------------------------

    '''

    Module ........... word_list.py

    Usage ............ python word_list.py File_In.txt File_Out.txt

    NewsGroup ........ comp.lang.python

    Date ............. 2003-06-18

    Posted_By ........ John Fitzsimmons

    Replies_From ..... [ kpop , Erik Max Francis ]

    Coded_By ......... Stanley C. Kitching

    '''

    import math
    import sys
    import time

    time_in = time.time()

    NL = '\n'

    module_name = sys.argv[ 0 ]

    print '%s %s ' % ( NL , module_name )

    path_in = sys.argv[ 1 ]
    path_out = sys.argv[ 2 ]

    file_in = file( path_in , 'r' )
    file_out = file( path_out , 'w' )

    word_total = 0


    dict_words = {}

    print
    print ' Indexing Words .... ' ,

    for iLine in file_in :

    if math.fmod( word_total , 1000 ) == 0 :

    print '.' ,

    list_words = iLine.strip().split()

    for this_word in list_words :

    if this_word not in dict_words.keys() :

    dict_words[ this_word ] = 1

    else :

    dict_words[ this_word ] += 1

    word_total += 1


    list_words = dict_words.keys()

    list_words.sort( lambda x , y : cmp( x.lower() , y.lower() ) )


    print NL
    print ' Writing Output File ....' ,

    for this_word in list_words :

    word_count = dict_words[ this_word ]

    str_out = '%6d %s %s' % ( word_count , this_word , NL )

    file_out.write( str_out )


    word_str = '%s Total Words .... %d %s' % ( NL , word_total , NL )

    keys_total = len( dict_words.keys() )

    keys_str = '%s Unique Words .... %d %s' % ( NL , keys_total , NL )


    file_out.write( word_str )

    file_out.write( keys_str )

    print NL
    print ' Complete .................'
    print
    print ' Total Words ....' , word_total
    print
    print ' Unique Words ....' , keys_total


    file_in.close()

    file_out.close()


    time_out = time.time()

    time_diff = time_out - time_in

    print NL
    print ' Process Time ........ %-6.2f Seconds' % ( time_diff )
     
    Cousin Stanley, Jun 24, 2003
    #2
    1. Advertising

  3. Kim Petersen

    Bryan Guest

    > totally skipping the word count should speed it up - and i believe that
    > the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
    > is a bit faster than doing has_key() or the least of doing:
    >
    > if this_word not in dict_words.keys() :
    >
    > which aught to be extremely slow on a large dictionary (creating and
    > dropping lists of thousands + doing a O(n) search over it). And that may
    > very well be the culprit of the slow run you see....
    >
    > >
    > > else :
    > >
    > > dict_words[ this_word ] += 1
    > >

    >
    >



    this is how i check if a key is in a dictionary:

    if this_word not in dict_words:

    i believe this doesn't do an O(n) search over it. it's supposed to constant
    time for every lookup by way of a hash lookup.

    bryan
     
    Bryan, Jun 25, 2003
    #3
  4. Kim Petersen

    Kim Petersen Guest

    Bryan wrote:
    >>totally skipping the word count should speed it up - and i believe that
    >>the approach of dict_words[this_word]=dict_words.get(this_word,0) btw.
    >>is a bit faster than doing has_key() or the least of doing:
    >>
    >> if this_word not in dict_words.keys() :
    >>
    >>which aught to be extremely slow on a large dictionary (creating and
    >>dropping lists of thousands + doing a O(n) search over it). And that may
    >>very well be the culprit of the slow run you see....


    > this is how i check if a key is in a dictionary:
    >
    > if this_word not in dict_words:
    >
    > i believe this doesn't do an O(n) search over it. it's supposed to constant
    > time for every lookup by way of a hash lookup.


    Which is probably correct since that would be equivalent to:
    not dict_words.has_key(this_word)

    But if you look at the above code mentioned - it doesn't do that - it
    calls .keys() [generating a list] and then looks if the string is in it...

    >
    > bryan
    >
    >


    --
    Med Venlig Hilsen / Regards

    Kim Petersen - Kyborg A/S (Udvikling)
    IT - Innovationshuset
    Havneparken 2
    7100 Vejle
    Tlf. +4576408183 || Fax. +4576408188
     
    Kim Petersen, Jun 25, 2003
    #4
  5. | Which is probably correct since that would be equivalent to:
    | not dict_words.has_key(this_word)
    |
    | But if you look at the above code mentioned - it doesn't do that -
    |
    | it calls .keys() [generating a list]
    | and then looks if the string is in it...

    Kim ...

    I changed one line in the script ....

    from .... if this_word not in dict_words.keys() :

    to ...... if not dict_words.has_key( this_word ) :

    The results are a little MORE than satisfactory,
    running in 48 seconds instead of 7 hours !!!!

    Complete .................

    Total Words .... 467381

    Unique Words .... 47122

    Process Time ........ 48.11 Seconds


    I failed to consider here
    that a NEW keys list might be created
    on each pass through a loop ....

    Thanks VERY MUCH for the suggestion ....

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
     
    Cousin Stanley, Jun 25, 2003
    #5
  6. Kim Petersen

    Bryan Guest

    i just did a test comparing:

    if key not in mydict:

    and

    if not mydict.has_key(key):

    and the 1st way cameout 28% faster. (on python 2.3)

    bryan

    "Erik Max Francis" <> wrote in message
    news:...
    > Cousin Stanley wrote:
    >
    > > I changed one line in the script ....
    > >
    > > from .... if this_word not in dict_words.keys() :
    > >
    > > to ...... if not dict_words.has_key( this_word ) :

    > ...
    > > I failed to consider here
    > > that a NEW keys list might be created
    > > on each pass through a loop ....

    >
    > Not only does it create a new keys list (D.keys()), but it searches
    > linearly through it (x in L)! Those are two separate O(n) operations,
    > whereas D.has_key is effectively O(1).
    >
    > --
    > Erik Max Francis && && http://www.alcyone.com/max/
    > __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    > / \ The multitude of books is making us ignorant.
    > \__/ Voltaire
     
    Bryan, Jun 26, 2003
    #6
  7. | Not only does it create a new keys list (D.keys()),
    | but it searches linearly through it (x in L)!
    |
    | Those are two separate O(n) operations,
    | whereas D.has_key is effectively O(1).

    Erik ...

    Incurring the 7 hour wrath of 2 * O( n ) in this test
    was certainly informative for me, but not much fun ....

    I could have interrupted the process,
    but knowing it ran to completion for smaller inputs,
    I was curious as to the eventual outcome ....

    Since n increases as keys are added to the dictionary,
    2 * O( n ) for the current pass should also be a bit slower
    than for the previous one making it even slower and slower
    as more and more keys are added ....

    Thanks for the info ....

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
     
    Cousin Stanley, Jun 26, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Fitzsimons

    Re: "Newbie" questions - "unique" sorting ?

    John Fitzsimons, Jun 25, 2003, in forum: Python
    Replies:
    8
    Views:
    1,571
    John Fitzsimons
    Jul 2, 2003
  2. Ali Syed
    Replies:
    3
    Views:
    565
    Mark McIntyre
    Oct 13, 2004
  3. ToshiBoy
    Replies:
    6
    Views:
    858
    ToshiBoy
    Aug 12, 2008
  4. Replies:
    2
    Views:
    1,442
    James Kanze
    Jul 6, 2010
  5. Token Type
    Replies:
    9
    Views:
    365
    Chris Angelico
    Sep 9, 2012
Loading...

Share This Page