Re: Newbie with sort text file question

Discussion in 'Python' started by Bob Gailer, Jul 12, 2003.

  1. Bob Gailer

    Bob Gailer Guest

    At 12:46 PM 7/12/2003 -0700, stuartc wrote:

    >Hi:
    >
    >I'm not a total newbie, but I'm pretty green. I need to sort a text
    >file and then get a total for the number of occurances for a part of
    >the string. Hopefully, this will explain it better:
    >
    >Here's the text file:
    >
    >banana_c \\yellow
    >apple_a \\green
    >orange_b \\yellow
    >banana_d \\green
    >orange_a \\orange
    >apple_w \\yellow
    >banana_e \\green
    >orange_x \\yellow
    >orange_y \\orange
    >
    >I would like two output files:
    >
    >1) Sorted like this, by the fruit name (the name before the dash)
    >
    >apple_a \\green
    >apple_w \\yellow
    >banana_c \\yellow
    >banana_d \\green
    >banana_e \\green
    >orange_a \\orange
    >orange_b \\yellow
    >orange_x \\yellow
    >orange_y \\orange
    >
    >2) Then summarized like this, ordered with the highest occurances
    >first:
    >
    >orange occurs 4
    >banana occurs 3
    >apple occurs 2
    >
    >Total occurances is 9


    I am developing a Python version of IBM's CMS Pipelines, which is designed
    for this kind of task. If you'd like to be an early recipient (read beta
    tester) of this product, let me know.

    You would invoke this program:
    Pipe("""
    < c:\input.txt
    | split /_/
    | nlocate -//-
    | sort count
    | spec 11-* 1 / occurs / 11 1-10 19
    | > c:\output1.txt
    | count
    | spec /Total occurrences is / 1 1-* 21
    | > c:\output2.txt""")

    Explanation:
    | == separates each stage of the pipe
    < == read records from file
    split == split each record into 2 records at first _
    nlocate == select records that do not contain //
    pad 10 == ensure each record has 10 characters (or whatever the longest
    fruit name is)
    sort count == sort; group by unique key and prepend count
    spec ... == select cols 11-end of input, append literal, append cols
    1-10
    > == write records to file

    spec ... == start with literal, append rest of record
    > == write records to file


    Or it can be run as a DOS Command:
    C>python pipe.py spec.txt
    where spec.txt contains the pipe specification

    An enhancement to the IBM Pipeline specification for SPLIT will be to route
    the 2nd part of each record to the secondary output, effectively discarding
    it in this example, and eliminating the need for the NLOCATE stage.

    This particular task can also be done fairly easily in Python. The appeal
    of Pipe is that you focus on the specification rather than writing Python
    code that is specific to the task. This shortens development time, and
    enhances readability and maintainability.

    The Python version:

    input = file('c:\input.txt')
    fruits = {} # a dictionary to hold each fruit and its count
    lines = input.readlines()
    for line in lines:
    fruit = line.split('_', 1)[0]
    if fruit in fruits:
    fruits[fruit] += 1 # increment count
    else:
    fruits[fruit] = 1 # add to dictionary with count of 1
    output1 = file('c:\output1.txt', 'w')
    for key, value in fruits.items():
    output1.write("%s occurs %s\n" % (key, value))
    output1.close()
    output2 = file('c:\output2.txt', 'w')
    output2.write("Total occurrences is %s\n" % len(lines))
    output2.close()

    Bob Gailer

    303 442 2625


    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003
     
    Bob Gailer, Jul 12, 2003
    #1
    1. Advertising

  2. Bob Gailer

    Andrew Dalke Guest

    Bob Gailer:
    > [Pipeline]


    Huh. Hadn't heard of that one before. Thanks for the pointer.
    (And overall, nice post!)

    > The Python version:


    Some stylistic comments

    > input = file('c:\input.txt')


    Since 'input' is a builtin, I use 'infile'. That's only a preference of
    mine.
    For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
    inside of a string so must be escaped.

    > fruits = {} # a dictionary to hold each fruit and its count
    > lines = input.readlines()
    > for line in lines:


    Since you are using Python 2.2 (later you use "if fruit in fruits",
    and "__in__" support for dicts wasn't added until Python 2.2, I
    think, and the 'file' usage is also new), this is best written as

    for line in input:

    > fruit = line.split('_', 1)[0]


    > if fruit in fruits:
    > fruits[fruit] += 1 # increment count
    > else:
    > fruits[fruit] = 1 # add to dictionary with count of 1


    Here's a handy idiom for what you want

    fruits[fruit] = fruits.get(fruit, 0) + 1

    > output1 = file('c:\output1.txt', 'w')
    > for key, value in fruits.items():
    > output1.write("%s occurs %s\n" % (key, value))
    > output1.close()
    > output2 = file('c:\output2.txt', 'w')
    > output2.write("Total occurrences is %s\n" % len(lines))
    > output2.close()


    That's missing some sorts, so I don't think it meets the OP's
    requirements.

    How about this?

    infile = open("input.txt")
    lines = []
    counts = {}
    for line in infile:
    lines.append(line)
    fruit = line.split("_", 1)[0]
    counts[fruit] = counts.get(fruit) + 1

    # Sort by name. Since "_" sorts after any letter, this means
    # that "plum_" will be placed *after* "plumbago_", which
    # is probably not what you want. Left as an exercise :)
    lines.sort()
    outfile = open("output1.txt")
    for line in lines:
    outfile.write(line)
    outfile.close()

    # Print counts from highest count to lowest
    count_data = [(n, fruit) for (fruit, n) in counts.items()]
    count_data.sort()
    outfile = open("output2.txt")
    total = 0
    for n, fruit in count_data:
    outfile.write("%s occurs %s\n" % (fruit, n))
    total += n
    outfile.write("\nTotal occurances: %s\n" % total)
    outfile.close()

    Andrew
     
    Andrew Dalke, Jul 13, 2003
    #2
    1. Advertising

  3. Bob Gailer

    Bob Gailer Guest

    At 03:12 PM 7/13/2003 -0600, Andrew Dalke wrote:

    >Bob Gailer:
    > > [Pipeline]

    >
    >Huh. Hadn't heard of that one before. Thanks for the pointer.


    Since I am developing the Python version of Pipeline I wonder if you have
    any interest in it? Would like to be an early recipient?

    >(And overall, nice post!)
    >
    > > The Python version:

    >
    >Some stylistic comments
    >
    > > input = file('c:\input.txt')

    >
    >Since 'input' is a builtin, I use 'infile'.


    Agree. When I'm in a hurry I let details slip.

    >For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
    >inside of a string so must be escaped.


    Agree. When I'm in a hurry I let details slip.

    > > fruits = {} # a dictionary to hold each fruit and its count
    > > lines = input.readlines()
    > > for line in lines:

    >
    >Since you are using Python 2.2 (later you use "if fruit in fruits",
    >and "__in__" support for dicts wasn't added until Python 2.2, I
    >think, and the 'file' usage is also new), this is best written as
    >
    > for line in input:


    Agree.

    > > fruit = line.split('_', 1)[0]

    >
    > > if fruit in fruits:
    > > fruits[fruit] += 1 # increment count
    > > else:
    > > fruits[fruit] = 1 # add to dictionary with count of 1

    >
    >Here's a handy idiom for what you want
    >
    > fruits[fruit] = fruits.get(fruit, 0) + 1


    Don't you want setdefault() instead of get()?

    > > output1 = file('c:\output1.txt', 'w')
    > > for key, value in fruits.items():
    > > output1.write("%s occurs %s\n" % (key, value))
    > > output1.close()
    > > output2 = file('c:\output2.txt', 'w')
    > > output2.write("Total occurrences is %s\n" % len(lines))
    > > output2.close()

    >
    >That's missing some sorts, so I don't think it meets the OP's requirements.


    The only reason for sort that I could see was to group things for counting.
    The output appears sorted descending, but that order was not specified, so
    I assumed random output.

    [snip]

    Bob Gailer

    303 442 2625


    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003
     
    Bob Gailer, Jul 14, 2003
    #3
  4. On Mon, 14 Jul 2003 07:10:39 -0600, Bob Gailer <> wrote:
    [...]
    >>
    >>Here's a handy idiom for what you want
    >>
    >> fruits[fruit] = fruits.get(fruit, 0) + 1

    >
    >Don't you want setdefault() instead of get()?
    >

    In this case that would set the original value twice, once on either side of the '='.
    Setdefault is more useful when you are maintaining a mutable, such as a list of things
    that you append to, as the key's associated value. You could use a length-1 list here,
    (initialized by the default to hold the count starting value of 0), e.g.,

    fruits.setdefault(fruit,[0])[0]+=1

    and then later retrieve the actual count as

    fruits[fruit][0]

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 14, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. nobody
    Replies:
    0
    Views:
    568
    nobody
    Jun 1, 2004
  2. JerryJ
    Replies:
    11
    Views:
    1,433
    Dave Moore
    Apr 28, 2004
  3. John Black
    Replies:
    6
    Views:
    2,114
    John Harrison
    May 28, 2004
  4. stuartc
    Replies:
    3
    Views:
    326
    stuartc
    Jul 13, 2003
  5. Navin
    Replies:
    1
    Views:
    770
    Ken Schaefer
    Sep 9, 2003
Loading...

Share This Page