Newbie with sort text file question

Discussion in 'Python' started by stuartc, Jul 12, 2003.

  1. stuartc

    stuartc Guest

    Hi:

    I'm not a total newbie, but I'm pretty green. I need to sort a text
    file and then get a total for the number of occurances for a part of
    the string. Hopefully, this will explain it better:

    Here's the text file:

    banana_c \\yellow
    apple_a \\green
    orange_b \\yellow
    banana_d \\green
    orange_a \\orange
    apple_w \\yellow
    banana_e \\green
    orange_x \\yellow
    orange_y \\orange

    I would like two output files:

    1) Sorted like this, by the fruit name (the name before the dash)

    apple_a \\green
    apple_w \\yellow
    banana_c \\yellow
    banana_d \\green
    banana_e \\green
    orange_a \\orange
    orange_b \\yellow
    orange_x \\yellow
    orange_y \\orange

    2) Then summarized like this, ordered with the highest occurances
    first:

    orange occurs 4
    banana occurs 3
    apple occurs 2

    Total occurances is 9

    Thanks for any help !
     
    stuartc, Jul 12, 2003
    #1
    1. Advertising

  2. stuartc

    Max M Guest

    stuartc wrote:

    > Hi:
    >
    > I'm not a total newbie, but I'm pretty green. I need to sort a text
    > file and then get a total for the number of occurances for a part of
    > the string. Hopefully, this will explain it better:
    >
    > Here's the text file:
    >
    > banana_c \\yellow
    > apple_a \\green
    > orange_b \\yellow
    > banana_d \\green
    > orange_a \\orange
    > apple_w \\yellow
    > banana_e \\green
    > orange_x \\yellow
    > orange_y \\orange
    >
    > I would like two output files:
    >
    > 1) Sorted like this, by the fruit name (the name before the dash)
    > 2) Then summarized like this, ordered with the highest occurances
    > first:
    >
    > orange occurs 4
    > banana occurs 3
    > apple occurs 2
    >
    > Total occurances is 9



    fruity = """banana_c \\yellow
    apple_a \\green
    orange_b \\yellow
    banana_d \\green
    orange_a \\orange
    apple_w \\yellow
    banana_e \\green
    orange_x \\yellow
    orange_y \\orange"""

    # print sorted list
    fruits = fruity.split('\n')
    fruits.sort()
    print '\n'.join(fruits)
    print ''

    # count occurences
    counter = {}
    for fruit in fruits:
    sort_of, apendix = fruit.split('_')
    counter[sort_of] = counter.get(sort_of, 0) + 1

    # sort by occurences
    decorated = [(counter[key], key) for key in counter.keys()]
    decorated.sort()
    decorated.reverse()

    # print result
    sum = 0
    for count, sort_of in decorated:
    print sort_of, 'occurs', count
    sum += count

    print ''
    print 'Total occurances is', sum


    regards Max M
     
    Max M, Jul 13, 2003
    #2
    1. Advertising

  3. stuartc wrote:
    > Hi:
    >
    > Here's the text file:
    >
    > banana_c \\yellow
    > apple_a \\green
    > orange_b \\yellow
    > banana_d \\green
    > orange_a \\orange
    > apple_w \\yellow
    > banana_e \\green
    > orange_x \\yellow
    > orange_y \\orange
    >
    > I would like two output files:
    >
    > 1) Sorted like this, by the fruit name (the name before the dash)
    >
    > 2) Then summarized like this, ordered with the highest occurances
    > first:

    Here is some mostly tested code ;)

    import re

    file = open ("textfile.txt") # your file name instead of textfile.txt
    alllines = list(file.readlines())
    file.close()

    alllines.sort()

    fruitre = re.compile('^[a-z]+')
    fruits = {}
    for line in alllines:
    fruitresult = fruitre.search(line)
    print line
    if fruitresult:
    fruit = fruitresult.group(0)
    fruits.setdefault(fruit, 0)
    fruits[fruit] += 1

    totalamount = 0
    for fruit, amount in fruits.items():
    print fruit, " occurs ", amount
    totalamount += amount

    print "Total amount of fruits ", totalamount

    Regards, Ben.
    PS: It looks a little unoptimized to me but it works. Hopefully others
    will reply to you as well so I can learn how to make the above better.
     
    Behrang Dadsetan, Jul 13, 2003
    #3
  4. stuartc

    stuartc Guest

    Hi Bengt:

    Thank you. Your code worked perfectly based on the text file I
    provided.

    Unfortunately for me, my real text file has one slight variation that
    I did not account for. That is, the fruit name does not always have
    an "_" after its name. For example, apple below does not an an "_"
    attached to it.

    banana_c \\yellow
    apple \\green
    orange_b \\yellow


    This variation in my text file caused a problem with the program.
    Here's the error.

    Traceback (most recent call last):
    File "G:/Python22/Sort_Fruit.py", line 47, in ?
    for fruit, dummyvar in fruitlist: fruitfreq[fruit] =
    fruitfreq.get(fruit, 0)+1
    ValueError: unpack list of wrong size

    I tried to debug and fix this variation, but I wasn't able to. I did
    notice that your split, splits each line in the file into two fields,
    as long as there's an "_" with a fruit name. If the fruit name does
    not have an "_", then the split does not occur. I think this is
    related to the problem, but I couldn't figure out how to fix it.

    Any help will be greatly appreciated. Thanks.

    - Stuart



    (Bengt Richter) wrote in message news:<beq357$thj$0@216.39.172.122>...
    > On 12 Jul 2003 12:46:51 -0700, (stuartc) wrote:
    >
    > >Hi:
    > >
    > >I'm not a total newbie, but I'm pretty green. I need to sort a text
    > >file and then get a total for the number of occurances for a part of
    > >the string. Hopefully, this will explain it better:
    > >
    > >Here's the text file:
    > >
    > >banana_c \\yellow
    > >apple_a \\green
    > >orange_b \\yellow
    > >banana_d \\green
    > >orange_a \\orange
    > >apple_w \\yellow
    > >banana_e \\green
    > >orange_x \\yellow
    > >orange_y \\orange
    > >
    > >I would like two output files:
    > >
    > >1) Sorted like this, by the fruit name (the name before the dash)
    > >
    > >apple_a \\green
    > >apple_w \\yellow
    > >banana_c \\yellow
    > >banana_d \\green
    > >banana_e \\green
    > >orange_a \\orange
    > >orange_b \\yellow
    > >orange_x \\yellow
    > >orange_y \\orange
    > >
    > >2) Then summarized like this, ordered with the highest occurances
    > >first:
    > >
    > >orange occurs 4
    > >banana occurs 3
    > >apple occurs 2
    > >
    > >Total occurances is 9
    > >
    > >Thanks for any help !

    >
    > ===< stuartc.py >========================================================
    > import StringIO
    > textf = StringIO.StringIO(r"""
    > banana_c \\yellow
    > apple_a \\green
    > orange_b \\yellow
    > banana_d \\green
    > orange_a \\orange
    > apple_w \\yellow
    > banana_e \\green
    > orange_x \\yellow
    > orange_y \\orange
    > """)
    >
    > # I would like two output files:
    > # (actually two files ?? Ok)
    >
    > # 1) Sorted like this, by the fruit name (the name before the dash)
    >
    > fruitlist = [line.split('_',1) for line in textf if line.strip()]
    > fruitlist.sort()
    >
    > # apple_a \\green
    > # apple_w \\yellow
    > # banana_c \\yellow
    > # banana_d \\green
    > # banana_e \\green
    > # orange_a \\orange
    > # orange_b \\yellow
    > # orange_x \\yellow
    > # orange_y \\orange
    >
    > outfile_1 = StringIO.StringIO()
    > outfile_1.write(''.join(['_'.join(pair) for pair in fruitlist]))
    >
    > # 2) Then summarized like this, ordered with the highest occurances
    > # first:
    >
    > # orange occurs 4
    > # banana occurs 3
    > # apple occurs 2
    >
    > outfile_2 = StringIO.StringIO()
    > fruitfreq = {}
    > for fruit, dummyvar in fruitlist: fruitfreq[fruit] = fruitfreq.get(fruit, 0)+1
    > fruitfreqlist = [(occ,name) for name,occ in fruitfreq.items()]
    > fruitfreqlist.sort()
    > fruitfreqlist.reverse()
    > outfile_2.write('\n'.join(['%s occurs %s'%(name,occ) for occ,name in fruitfreqlist]+['']))
    >
    > # Total occurances is 9
    > print >> outfile_2,"Total occurances [sic] is [sic] %s" % reduce(int.__add__, fruitfreq.values())
    >
    > ## show results
    > print '\nFile 1:\n------------\n%s------------' % outfile_1.getvalue()
    > print '\nFile 2:\n------------\n%s------------' % outfile_2.getvalue()
    > =========================================================================
    > executed:
    >
    > [15:52] C:\pywk\clp>stuartc.py
    >
    > File 1:
    > ------------
    > apple_a \\green
    > apple_w \\yellow
    > banana_c \\yellow
    > banana_d \\green
    > banana_e \\green
    > orange_a \\orange
    > orange_b \\yellow
    > orange_x \\yellow
    > orange_y \\orange
    > ------------
    >
    > File 2:
    > ------------
    > orange occurs 4
    > banana occurs 3
    > apple occurs 2
    > Total occurances [sic] is [sic] 9
    > ------------
    >
    > Is that what you wanted?
    >
    > Regards,
    > Bengt Richter
     
    stuartc, Jul 13, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. nobody
    Replies:
    0
    Views:
    539
    nobody
    Jun 1, 2004
  2. JerryJ
    Replies:
    11
    Views:
    1,407
    Dave Moore
    Apr 28, 2004
  3. John Black
    Replies:
    6
    Views:
    2,064
    John Harrison
    May 28, 2004
  4. Bob Gailer
    Replies:
    3
    Views:
    394
    Bengt Richter
    Jul 14, 2003
  5. Navin
    Replies:
    1
    Views:
    702
    Ken Schaefer
    Sep 9, 2003
Loading...

Share This Page