Re: "Newbie" questions - "unique" sorting ?

Discussion in 'Python' started by John Fitzsimons, Jun 25, 2003.

  1. On Mon, 23 Jun 2003 20:35:59 -0700, "Cousin Stanley"
    <> wrote:

    Hi Cousin Stanley,

    >{ 1. Good News | 2. Bad News | 3. Good News } ....


    > 1. Good News ....


    > The last version of word_list.py that I up-loaded
    > works as expected with your input file producing
    > an indexed word list with no duplicates ...


    < snip >

    > That's 6.56 HOURS and un-acceptable performance !!!!


    I agree. :) Very clever of you to have worked out how long it would
    take. I hope you didn't wait over 6 hours to find out !!!

    > word_list.py works quickly on smaller files,
    > but as coded, is an absolute dog for indexing
    > larger files ....


    Good. I was hoping it wasn't something that I had done wrong. :)

    > 3. Good News ....


    > Since I FINALLY figured out that you're mostly interested
    > in just the URLs and not a general word list,
    > I coded a pre-process script to extract just the URLs
    > from the original input file ....


    > python url_list.py JF_In.txt JF_URLs.txt


    Unless I missed something it does lines starting ftp, http, BUT not
    lines that start www . Is that correct ? Or did I give you a file with
    no lines starting www ?

    < snip >

    >Let me know if this output looks closer to what you are after ....


    Very very good......and fast. If I can work out what happened to the
    www lines, and fix it, then everything will be great. I then hope to
    try this exercise using a different method to see if the numbers come
    up the same.

    Thank you for such excellent programming. :)


    Regards, John.
     
    John Fitzsimons, Jun 25, 2003
    #1
    1. Advertising

  2. | ...
    | I hope you didn't wait over 6 hours to find out !!!
    |

    John ...

    Actually, I did wait ....

    Since I'd the program successfully a number of times,
    but on smaller files, I wanted to know just how long
    it woulds take to run to completion ....

    The numbers in the output I posted
    were an actual copy/paste directly
    from the DOS window that I it ran in ...

    | Unless I missed something it does lines
    | starting ftp, http, BUT not lines that start www .
    |

    You didn't miss anything ....

    The version of url_list.py that you ran
    only looks for ....

    [ 'http://' ,
    'https://' ,
    'ftp://' ,
    'news://' ,
    'res://' ,
    'fido://' ]

    However, I added a bit of code to url_list.py
    to also extract lines starting with www. ...

    Download newest versions ....

    http://fastq.com/~sckitching/Python/word_list.zip

    Run as before ....

    python url_list.py JF_In.txt JF_URLs.txt

    python word_list.py JF_URLs.txt JF_URLs_Index.txt

    | Thank you for such excellent programming.

    You're welcome ....

    Thanks also to ....

    Erik Max Francis for suggesting
    the lambda sort for Mixed-Case sorting ....

    Kim Petersen for suggesting usage of ....

    dict_words.has_key[ this_word ] instead of

    this_word in dict_words.keys()

    which made an incredible difference in processing time ....

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
     
    Cousin Stanley, Jun 25, 2003
    #2
    1. Advertising

  3. "Cousin Stanley" <> wrote:
    >| Thank you for such excellent programming.
    >
    >You're welcome ....
    >
    >Thanks also to ....
    >
    > Erik Max Francis for suggesting
    > the lambda sort for Mixed-Case sorting ....
    >
    > Kim Petersen for suggesting usage of ....
    >
    > dict_words.has_key[ this_word ] instead of
    >
    > this_word in dict_words.keys()
    >
    > which made an incredible difference in processing time ....


    I've been playing a little with the script and managed to double the
    speed by using a trick that was posted here some time ago by someone
    called "Lulu". The trick was originally from someone else but I lost
    the attribution somewhere. This bumps Erik's idea from the list I'm
    afraid ..., because it translates all letters into lowercase and
    translates the rest into spaces. This speeds up sorting and splitting.

    Probably it's possible to shave off a few percents more, but I think
    doubling speed once again will cost four times more programmer effort
    or maybe twice as much money for computer equipment.

    It's all in the script below, I hope I didn't introduce any new
    errors. By the way, I don't like an empty line for every other line,
    as in your script, and using "\n" is easier than what you did. Other
    than that, nice job!

    Anton


    import sys
    import time

    time_in = time.time()
    module_name = sys.argv[ 0 ]
    print '\n %s ' % (module_name )
    #to get the file below:
    #http://sailor.gutenberg.org/etext97/1donq10.zip
    path_in = '1donq10.txt'
    path_out = 'words.out'
    file_in = file( path_in , 'r' )
    file_out = file( path_out , 'w' )
    word_total = 0
    dict_words = {}
    #start lulu magic
    i_r = map(chr, range(256))
    trans = [' '] * 256
    o_a, o_z = ord('a'), (ord('z')+1)
    trans[ord('A'):(ord('Z')+1)] = i_r[o_a:eek:_z]
    trans[o_a:eek:_z] = i_r[o_a:eek:_z]
    trans = ''.join(trans)
    #end lulu magic
    print
    print ' Indexing Words ....\n ' ,
    for iLine in file_in :
    if (word_total+1) % 10000 == 0 :
    sys.stdout.write('.')
    #use lulu magic in the line below here:
    list_words = iLine.translate(trans).split()
    for this_word in list_words :
    if not dict_words.has_key(this_word) :
    dict_words[this_word] = 1
    else :
    dict_words[this_word] += 1
    word_total += 1
    list_words = dict_words.keys()
    #lulu magic turned all words into lowercase, so standard sort is
    #possible:
    list_words.sort()
    print '\n\n Writing Output File ....' ,
    for this_word in list_words :
    word_count = dict_words[this_word]
    str_out = '%6d %s\n' % (word_count ,this_word)
    file_out.write(str_out)
    word_str = '\n Total Words .... %d\n' % (word_total)
    keys_total = len(dict_words.keys())
    keys_str = '\n Unique Words .... %d\n' % (keys_total)
    file_out.write(word_str)
    file_out.write(keys_str)
    print '\n Complete .................\n'
    print ' Total Words ....' , word_total
    print
    print ' Unique Words ....' , keys_total
    file_in.close()
    file_out.close()
    time_out = time.time()
    time_diff = time_out - time_in
    print '\n Process Time ........ %-6.2f Seconds' % (time_diff)
     
    Anton Vredegoor, Jun 26, 2003
    #3
  4. Anton ....

    Thanks for the feedback
    and the __LuLu_Magic__ coding sample ....

    I'll have to ponder the __LuLu_Magic__ a bit
    to try and understand it ...

    The reasons that I use NL instead of '\n'
    are that ...

    o it seems easier for me to type
    since it's 2 fewer characters

    o string and print statements
    seem to read a bit easier for me

    o only need to change 1 line of code
    if a different End-of-Line character sequence
    is needed

    The reasons that I use vertically double-spaced code
    with a lot of horizontal white-space is that my eyes
    are old and tired and my feeble brain just is able
    to parse it easier ....

    A familar example is from several months ago
    when you posted a link for your screensaver.py module ....

    Edited version of screensaver.py ....

    http://fastq.com/~sckitching/Python/scr_av.py

    I'm sure you will hate it,
    but it's much easier for me to read ....

    Thanks for making your screen saver available
    and thanks again for your comments and suggestions
    regarding the word_list script ...

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
     
    Cousin Stanley, Jun 27, 2003
    #4
  5. "Cousin Stanley" <> wrote:

    > Edited version of screensaver.py ....
    >
    > http://fastq.com/~sckitching/Python/scr_av.py
    >
    >I'm sure you will hate it,
    >but it's much easier for me to read ....


    On the contrary, I am very glad someone reads my code and make changes
    to it, for better or worse! The more "eyeball" inspection code gets,
    the more chances it has into evolving into something better, even if
    sometimes newer versions of the code are worse than earlier versions.
    It works like a genetic algorithm improving ones code snippets :)

    The other thing is that while using Python it seems to be common to
    read ones previous code and discover that code from only a few months
    ago would be done very differently now. For example my screensaver
    module imports a "sequencer.py" file that could now be rewritten in
    probably a fourth of the number of lines, because someone on c.l.py
    here made a comment on a newer version of it that was already half the
    number of lines of "sequencer.py". Also the "Transformer" class in the
    screensaver needlessly recomputes a lot of things at every call that
    could be done during initialization, later versions of this class are
    doing this better.

    My personal observation is that *everything* I write in Python is a
    candidate for improvement in only a few months time because of my
    changing perspectives on the matter. For another perspective on the
    "lulu" code for example, try this :

    trans = [string.lower(chr(i)) for i in range(256)]
    for i in range(256):
    if not trans in string.letters: trans = ' '
    trans = ''.join(trans)

    I think this is both a line or so shorter than the original code and
    is probably also a bit clearer. Because IMO everything written in
    Python is improved sooner or later -according to how many people look
    at it- I expect *this* code fragment to be updated once again soon.

    This peculiar aspect of Python (and probably other high level
    languages) is probably caused by the fact that Python code comes
    closer to ones thoughts than other code, and -at least for me-
    thoughts are the most volatile elements in the world.

    So better get used to it (if your experience is anything like mine of
    course) and don't let yourself be distracted by the code-reusers,
    unit-testers, and static typers that are still trying to get a grip on
    this elusive aspect of Python coding.

    Anton
     
    Anton Vredegoor, Jun 28, 2003
    #5
  6. Anton Vredegoor wrote:

    > My personal observation is that *everything* I write in Python is a
    > candidate for improvement in only a few months time because of my
    > changing perspectives on the matter. For another perspective on the
    > "lulu" code for example, try this :
    >
    > trans = [string.lower(chr(i)) for i in range(256)]
    > for i in range(256):
    > if not trans in string.letters: trans = ' '
    > trans = ''.join(trans)
    >
    > I think this is both a line or so shorter than the original code and
    > is probably also a bit clearer.


    shorter, but not necessarily clearer (code that contains "".join is
    never clear, in my experience, and the and/or trick doesn't make
    things better):

    trans = "".join([chr(x).isalpha() and chr(x).lower() or " " for x in range(256)])

    but on the other hand, this gives you room for a two lines of
    comments, explaining the intent of this piece of code.

    you can trade performance for a source code character or two:

    trans = "".join([(" ", chr(x).lower())[chr(x).isalpha()] for x in range(256)])

    fwiw, I'd probably spell it all out, to make it all obvious:

    # map letters to lowercase, and everything else to spaces
    trans = range(256)
    for i in trans:
    ch = chr(i)
    if ch.isalpha():
    trans = ch.lower()
    else:
    trans = " "
    trans = string.join(trans, "")

    but that's probably because I have a two-dimensional brain and a
    working return key ;-)

    </F>
     
    Fredrik Lundh, Jun 28, 2003
    #6
  7. On Wed, 25 Jun 2003 09:17:56 -0700, "Cousin Stanley"
    <> wrote:

    < snip >

    >The version of url_list.py that you ran
    >only looks for ....


    > [ 'http://' ,
    > 'https://' ,
    > 'ftp://' ,
    > 'news://' ,
    > 'res://' ,
    > 'fido://' ]


    >However, I added a bit of code to url_list.py
    >to also extract lines starting with www. ...


    >Download newest versions ....
    >
    > http://fastq.com/~sckitching/Python/word_list.zip


    >Run as before ....
    >
    > python url_list.py JF_In.txt JF_URLs.txt
    >
    > python word_list.py JF_URLs.txt JF_URLs_Index.txt


    Yes, that works now.

    >| Thank you for such excellent programming.


    >You're welcome ....


    >Thanks also to ....


    > Erik Max Francis for suggesting
    > the lambda sort for Mixed-Case sorting ....


    > Kim Petersen for suggesting usage of ....


    > dict_words.has_key[ this_word ] instead of
    >
    > this_word in dict_words.keys()


    > which made an incredible difference in processing time ....


    Yes, the whole process is very good now.

    Thanks to everyone who helped with this. It was very much
    appreciated. :)

    Bye the way Cousin Stanley I assume the numbers in the final result
    were how many times that string appeared in the input file ? A very
    interesting number to have.

    When I get time though I might "rem" (is that # ?) out that/those
    line/lines so that I have a second URL sorting python executable that
    doesn't include the numbers. Assuming I can work out which one(s) to
    disable ! :)


    Regards, John.
     
    John Fitzsimons, Jul 1, 2003
    #7
  8. | ...
    | Bye the way Cousin Stanley I assume the numbers
    | in the final result were how many times that string
    | appeared in the input file ?
    |
    | A very interesting number to have.
    |
    | When I get time though I might "rem" (is that # ?) out
    | that/those line/lines so that I have a second URL sorting
    | python executable that doesn't include the numbers.
    |
    | Assuming I can work out which one(s) to disable !

    John ...

    Getting rid of the word_count
    isn't too bad ....

    Find the following code block in word_list.py ....

    for this_word in list_words :

    word_count = dict_words[ this_word ]

    str_out = '%6d %s %s' % ( word_count , this_word , NL )

    file_out.write( str_out )


    Change the above 4 lines to the following 2 ....

    for this_word in list_words :

    file_out.write( this_word + NL )

    Save the changed file to word_list2.py ....


    As an alternative, since the words you are interested in
    in this application are actually URLs, you might want
    to generate actual clickable HTML links ....

    for this_word in list_words :

    file_out.write( '<a href="' + this_word + '">' + NL )

    file_out.write( this_word + NL )

    file_out.write( '</a>' + NL + NL )

    Save the changed file to word_list3.py ....

    I haven't tested either of the above changes,
    but this should provide some ideas ....

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
     
    Cousin Stanley, Jul 1, 2003
    #8
  9. On Mon, 30 Jun 2003 22:17:43 -0700, "Cousin Stanley"
    <> wrote:

    < snip >

    >| When I get time though I might "rem" (is that # ?) out
    >| that/those line/lines so that I have a second URL sorting
    >| python executable that doesn't include the numbers.


    >Getting rid of the word_count
    >isn't too bad ....


    < snip >

    Thanks for the additional info. :)

    Regards, John.
     
    John Fitzsimons, Jul 2, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kim Petersen
    Replies:
    6
    Views:
    763
    Cousin Stanley
    Jun 26, 2003
  2. Ali Syed
    Replies:
    3
    Views:
    580
    Mark McIntyre
    Oct 13, 2004
  3. ToshiBoy
    Replies:
    6
    Views:
    885
    ToshiBoy
    Aug 12, 2008
  4. Replies:
    2
    Views:
    1,489
    James Kanze
    Jul 6, 2010
  5. Token Type
    Replies:
    9
    Views:
    391
    Chris Angelico
    Sep 9, 2012
Loading...

Share This Page