Newbie with sort text file question

S

stuartc

Hi:

I'm not a total newbie, but I'm pretty green. I need to sort a text
file and then get a total for the number of occurances for a part of
the string. Hopefully, this will explain it better:

Here's the text file:

banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange

I would like two output files:

1) Sorted like this, by the fruit name (the name before the dash)

apple_a \\green
apple_w \\yellow
banana_c \\yellow
banana_d \\green
banana_e \\green
orange_a \\orange
orange_b \\yellow
orange_x \\yellow
orange_y \\orange

2) Then summarized like this, ordered with the highest occurances
first:

orange occurs 4
banana occurs 3
apple occurs 2

Total occurances is 9

Thanks for any help !
 
M

Max M

stuartc said:
Hi:

I'm not a total newbie, but I'm pretty green. I need to sort a text
file and then get a total for the number of occurances for a part of
the string. Hopefully, this will explain it better:

Here's the text file:

banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange

I would like two output files:

1) Sorted like this, by the fruit name (the name before the dash)
2) Then summarized like this, ordered with the highest occurances
first:

orange occurs 4
banana occurs 3
apple occurs 2

Total occurances is 9


fruity = """banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange"""

# print sorted list
fruits = fruity.split('\n')
fruits.sort()
print '\n'.join(fruits)
print ''

# count occurences
counter = {}
for fruit in fruits:
sort_of, apendix = fruit.split('_')
counter[sort_of] = counter.get(sort_of, 0) + 1

# sort by occurences
decorated = [(counter[key], key) for key in counter.keys()]
decorated.sort()
decorated.reverse()

# print result
sum = 0
for count, sort_of in decorated:
print sort_of, 'occurs', count
sum += count

print ''
print 'Total occurances is', sum


regards Max M
 
B

Behrang Dadsetan

stuartc said:
Hi:

Here's the text file:

banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange

I would like two output files:

1) Sorted like this, by the fruit name (the name before the dash)

2) Then summarized like this, ordered with the highest occurances
first:
Here is some mostly tested code ;)

import re

file = open ("textfile.txt") # your file name instead of textfile.txt
alllines = list(file.readlines())
file.close()

alllines.sort()

fruitre = re.compile('^[a-z]+')
fruits = {}
for line in alllines:
fruitresult = fruitre.search(line)
print line
if fruitresult:
fruit = fruitresult.group(0)
fruits.setdefault(fruit, 0)
fruits[fruit] += 1

totalamount = 0
for fruit, amount in fruits.items():
print fruit, " occurs ", amount
totalamount += amount

print "Total amount of fruits ", totalamount

Regards, Ben.
PS: It looks a little unoptimized to me but it works. Hopefully others
will reply to you as well so I can learn how to make the above better.
 
S

stuartc

Hi Bengt:

Thank you. Your code worked perfectly based on the text file I
provided.

Unfortunately for me, my real text file has one slight variation that
I did not account for. That is, the fruit name does not always have
an "_" after its name. For example, apple below does not an an "_"
attached to it.

banana_c \\yellow
apple \\green
orange_b \\yellow


This variation in my text file caused a problem with the program.
Here's the error.

Traceback (most recent call last):
File "G:/Python22/Sort_Fruit.py", line 47, in ?
for fruit, dummyvar in fruitlist: fruitfreq[fruit] =
fruitfreq.get(fruit, 0)+1
ValueError: unpack list of wrong size

I tried to debug and fix this variation, but I wasn't able to. I did
notice that your split, splits each line in the file into two fields,
as long as there's an "_" with a fruit name. If the fruit name does
not have an "_", then the split does not occur. I think this is
related to the problem, but I couldn't figure out how to fix it.

Any help will be greatly appreciated. Thanks.

- Stuart



Hi:

I'm not a total newbie, but I'm pretty green. I need to sort a text
file and then get a total for the number of occurances for a part of
the string. Hopefully, this will explain it better:

Here's the text file:

banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange

I would like two output files:

1) Sorted like this, by the fruit name (the name before the dash)

apple_a \\green
apple_w \\yellow
banana_c \\yellow
banana_d \\green
banana_e \\green
orange_a \\orange
orange_b \\yellow
orange_x \\yellow
orange_y \\orange

2) Then summarized like this, ordered with the highest occurances
first:

orange occurs 4
banana occurs 3
apple occurs 2

Total occurances is 9

Thanks for any help !

===< stuartc.py >========================================================
import StringIO
textf = StringIO.StringIO(r"""
banana_c \\yellow
apple_a \\green
orange_b \\yellow
banana_d \\green
orange_a \\orange
apple_w \\yellow
banana_e \\green
orange_x \\yellow
orange_y \\orange
""")

# I would like two output files:
# (actually two files ?? Ok)

# 1) Sorted like this, by the fruit name (the name before the dash)

fruitlist = [line.split('_',1) for line in textf if line.strip()]
fruitlist.sort()

# apple_a \\green
# apple_w \\yellow
# banana_c \\yellow
# banana_d \\green
# banana_e \\green
# orange_a \\orange
# orange_b \\yellow
# orange_x \\yellow
# orange_y \\orange

outfile_1 = StringIO.StringIO()
outfile_1.write(''.join(['_'.join(pair) for pair in fruitlist]))

# 2) Then summarized like this, ordered with the highest occurances
# first:

# orange occurs 4
# banana occurs 3
# apple occurs 2

outfile_2 = StringIO.StringIO()
fruitfreq = {}
for fruit, dummyvar in fruitlist: fruitfreq[fruit] = fruitfreq.get(fruit, 0)+1
fruitfreqlist = [(occ,name) for name,occ in fruitfreq.items()]
fruitfreqlist.sort()
fruitfreqlist.reverse()
outfile_2.write('\n'.join(['%s occurs %s'%(name,occ) for occ,name in fruitfreqlist]+['']))

# Total occurances is 9
print >> outfile_2,"Total occurances [sic] is [sic] %s" % reduce(int.__add__, fruitfreq.values())

## show results
print '\nFile 1:\n------------\n%s------------' % outfile_1.getvalue()
print '\nFile 2:\n------------\n%s------------' % outfile_2.getvalue()
=========================================================================
executed:

[15:52] C:\pywk\clp>stuartc.py

File 1:
------------
apple_a \\green
apple_w \\yellow
banana_c \\yellow
banana_d \\green
banana_e \\green
orange_a \\orange
orange_b \\yellow
orange_x \\yellow
orange_y \\orange
------------

File 2:
------------
orange occurs 4
banana occurs 3
apple occurs 2
Total occurances [sic] is [sic] 9
------------

Is that what you wanted?

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,228
Latest member
MikeMichal

Latest Threads

Top