Sorting Countries by Region

P

patrick.waldo

Hi all,

I'm analyzing some data that has a lot of country data. What I need
to do is sort through this data and output it into an excel doc with
summary information. The countries, though, need to be sorted by
region, but the way I thought I could do it isn't quite working out.
So far I can only successfully get the data alphabetically.

Any ideas?

import xlrd
import pyExcelerator

def get_countries_list(list):
countries_list=[]
for country in countries:
if country not in countries_list:
countries_list.append(country)

EU = ["Austria","Belgium", "Cyprus","Czech Republic",
"Denmark","Estonia", "Finland"]
NA = ["Canada", "United States"]
AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
"Japan"]
Regions_tot = {'European Union':EU, 'North America':NA, 'Asia
Pacific':AP,}

path_file = "c:\\1\country_data.xls"
book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
countries= Counts.col_values(0,start_rowx=1, end_rowx=None)

get_countries_list(countries)

wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

n=1
for country in unique_countries:
matrix.write(n,1, country)
n = n+1

wb.save('c:\\1\\matrix.xls')
 
S

Sergio Correia

Just a few notes:

1) get_countries_list

What is the purpose of that function? Besides a few errors (an
argument named list, no value returned), it seems you just want to
remove duplicates from a list called countries. You can do that
transforming the list to a 'set'.

new_countries = list( set(countries) )

2) I would suggest using countries.sort(...) or sorted(countries,...),
specifying cmp or key options too sort by region instead.

3) Instead of doing this:

for country in unique_countries:
matrix.write(n,1, country)
n = n+1

Do something like

for i, cou in enumerate(unique_countries):
matrix.write(1+i, 1, cou)

(so you dont need to increment the 'i' variable manually)
 
M

martyw

Hi all,

I'm analyzing some data that has a lot of country data. What I need
to do is sort through this data and output it into an excel doc with
summary information. The countries, though, need to be sorted by
region, but the way I thought I could do it isn't quite working out.
So far I can only successfully get the data alphabetically.

Any ideas?

import xlrd
import pyExcelerator

def get_countries_list(list):
it isn't a good idea to use a built-in object as a variable name
countries_list=[]
for country in countries:ii
if country not in countries_list:
countries_list.append(country)

EU = ["Austria","Belgium", "Cyprus","Czech Republic",
"Denmark","Estonia", "Finland"]
NA = ["Canada", "United States"]
AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
"Japan"]
Regions_tot = {'European Union':EU, 'North America':NA, 'Asia
Pacific':AP,}
i would create a class to capture country information, e.g.
class country(object):
def __init__(self, name, size = 0, population = 0):
self.name = name
self.size = size
self.poplation = population

def __cmp__(self, other):
if self.name < other.name:
return -1
elif self.name > other.name:
return 1
else:
return 0

then you can set up the world as

world = {'NA': [country("United States"), country("Canada")], \
'Europe': [country("Belgium"), country("Austria")]}


now you can sort and print it easy

for region in world:
print region
lands = world[region]
lands.sort()
for land in lands:
print land.name

the sort works because the country objects have a method __cmp__
path_file = "c:\\1\country_data.xls"
book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
countries= Counts.col_values(0,start_rowx=1, end_rowx=None)

get_countries_list(countries)

wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

n=1
for country in unique_countries:
matrix.write(n,1, country)
n = n+1

wb.save('c:\\1\\matrix.xls')

i'm not familiar with the excel modules so i can't help you with that
 
A

Alan

i would create a class to capture country information, e.g.
<snip>

What is the advantage of this:
def __cmp__(self, other):
if self.name < other.name:
return -1
elif self.name > other.name:
return 1
else:
return 0

over
def __cmp__(self,other):
return cmp(self.name,other.name)

?
 
P

patrick.waldo

Great, this is very helpful. I'm new to Python, so hence the
inefficient or nonsensical code!
2) I would suggest using countries.sort(...) or sorted(countries,...),
specifying cmp or key options too sort by region instead.

I don't understand how to do this. The countries.sort() lists
alphabetically and I tried to do a lambda x,y: cmp() type function,
but it doesn't sort correctly. Help with that?

For martyw's example, I don't need to get any sort of population
info. I'm actually getting the number of various types of documents.
So the entry is like this:

Argentina Food and Consumer Products Food Additives Color
Additives 1
Argentina Food and Consumer Products Food Additives Flavors 1
Argentina Food and Consumer Products Food Additives
General 6
Argentina Food and Consumer Products Food Additives labeling 1
Argentina Food and Consumer Products Food Additives Prohibited
Additives 1
Argentina Food and Consumer Products Food Contact Cellulose 1
Argentina Food and Consumer Products Food Contact Food
Packaging 1
Argentina Food and Consumer Products Food Contact Plastics 4
Argentina Food and Consumer Products Food Contact
Waxes 1
Belize
etc...

So I'll need to add up all the entries for Food Additives and Food
contacts, the other info like Color Additives isn't important.

So I will have an output like this
Food Additives Food Contact
Argentina 10 7
Belize
etc...

Thanks so much for the help!
 
S

Sergio Correia

About the sort:

Check this (also on http://pastebin.com/f12b5b6ca )

def make_regions():

# Values you provided
EU = ["Austria","Belgium", "Cyprus","Czech Republic",
"Denmark","Estonia", "Finland"]
NA = ["Canada", "United States"]
AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
"Japan"]
regions = {'European Union':EU, 'North America':NA, 'Asia Pacific':AP}

ans = {}
for reg_name, reg in regions.items():
for cou in reg:
ans[cou] = reg_name
return ans

def cmp_region(cou1, cou2):
ans = cmp(regions[cou1], regions[cou2])
if ans: # If the region is the same, sort by country
return cmp(cou1, cou2)
else:
return ans

regions = make_regions()
some_countries = ['Austria', 'Canada', 'China', 'India']

print 'Old:', some_countries
some_countries.sort(cmp_region)
print 'New:', some_countries


Why that code?
Because the first thing I want is a dictionary where the key is the
name of the country and the value is the region. Then, I just make a
quick function that compares considering the region and country.
Finally, I sort.

Btw, the code is just a quick hack, as it can be improved -a lot-.


About the rest of your code:
- martyw's example is much more useful than you think. Why? because
you can just iterate across your document, adding the values you get
to the adequate object property. That is, instead of using size or
pop, use the variables you are interested in.

Best, and good luck with python,
Sergio
 
P

patrick.waldo

This is how I solved it last night in my inefficient sort of way and
after re-reading some of my Python books on dictionaries. So far this
gets the job done. However, I'd like to test if there are any
countries in the excel input that are not represented, ie the input is
all the information I have and the dictionary functions as the
information I expect. What I did worked yesterday, but doesn't work
anymore more...see comment

Otherwise I tried doing this:

for i, country in countries_list:
if country in REGIONS_COUNTRIES['European Union']:
matrix.write(i+2, 1, country)
but I got "ValueError: too many values to unpack"

Again, this has been a great help. Any ideas of how I can make this a
bit more efficient, as I'm dealing with 5 regions and numerous
countries, would be greatly appreciated. Here's the code:


#keeping all the countries short
REGIONS_COUNTRIES = {'European Union':["Austria","Belgium"," "France",
"Germany", "Greece"],\
'North America':["Canada", "United States"]}

path_file = "c:\\1\\build\\data\\matrix2\\Update_oct07a.xls"

book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

countries = Counts.col_values(0,start_rowx=1, end_rowx=None)
countries_list = list(set(countries))
countries_list.sort()

#This seems to not work today and I don't know why
#for country in countries_list:
# if country not in REGIONS_COUNTRIES['European Union'] or not in
REGIONS_COUNTRIES['North America']:
# print "%s is not in the expected list", country

#This sorts well
n=2
for country in countries_list:
if country in REGIONS_COUNTRIES['European Union']:
matrix.write(n, 1, country)
n=n+1
for country in countries_list:
if country in REGIONS_COUNTRIES['North America']:
matrix.write(n, 1, country)
n=n+1

wb.save('c:\\1\\matrix.xls')



About the sort:

Check this (also onhttp://pastebin.com/f12b5b6ca)

def make_regions():

# Values you provided
EU = ["Austria","Belgium", "Cyprus","Czech Republic",
"Denmark","Estonia", "Finland"]
NA = ["Canada", "United States"]
AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
"Japan"]
regions = {'European Union':EU, 'North America':NA, 'Asia Pacific':AP}

ans = {}
for reg_name, reg in regions.items():
for cou in reg:
ans[cou] = reg_name
return ans

def cmp_region(cou1, cou2):
ans = cmp(regions[cou1], regions[cou2])
if ans: # If the region is the same, sort by country
return cmp(cou1, cou2)
else:
return ans

regions = make_regions()
some_countries = ['Austria', 'Canada', 'China', 'India']

print 'Old:', some_countries
some_countries.sort(cmp_region)
print 'New:', some_countries

Why that code?
Because the first thing I want is a dictionary where the key is the
name of the country and the value is the region. Then, I just make a
quick function that compares considering the region and country.
Finally, I sort.

Btw, the code is just a quick hack, as it can be improved -a lot-.

About the rest of your code:
- martyw's example is much more useful than you think. Why? because
you can just iterate across your document, adding the values you get
to the adequate object property. That is, instead of using size or
pop, use the variables you are interested in.

Best, and good luck with python,
Sergio

Great, this is very helpful. I'm new to Python, so hence the
inefficient or nonsensical code!
I don't understand how to do this. The countries.sort() lists
alphabetically and I tried to do a lambda x,y: cmp() type function,
but it doesn't sort correctly. Help with that?
For martyw's example, I don't need to get any sort of population
info. I'm actually getting the number of various types of documents.
So the entry is like this:
Argentina Food and Consumer Products Food Additives Color
Additives 1
Argentina Food and Consumer Products Food Additives Flavors 1
Argentina Food and Consumer Products Food Additives
General 6
Argentina Food and Consumer Products Food Additives labeling 1
Argentina Food and Consumer Products Food Additives Prohibited
Additives 1
Argentina Food and Consumer Products Food Contact Cellulose 1
Argentina Food and Consumer Products Food Contact Food
Packaging 1
Argentina Food and Consumer Products Food Contact Plastics 4
Argentina Food and Consumer Products Food Contact
Waxes 1
Belize
etc...
So I'll need to add up all the entries for Food Additives and Food
contacts, the other info like Color Additives isn't important.
So I will have an output like this
Food Additives Food Contact
Argentina 10 7
Belize
etc...
Thanks so much for the help!
--
 
G

Gabriel Genellina

for i, country in countries_list:
if country in REGIONS_COUNTRIES['European Union']:
matrix.write(i+2, 1, country)
but I got "ValueError: too many values to unpack"

Remove the i, and try again...
 
J

J. Clifford Dyer

#This seems to not work today and I don't know why
#for country in countries_list:
# if country not in REGIONS_COUNTRIES['European Union'] or not in
REGIONS_COUNTRIES['North America']:
# print "%s is not in the expected list", country

This snippet fails because you want 'and' rather than 'or'. In this
one, you test to see that country is not in EU, and if you have, say
'France' as your country, the first half evaluates false, so the or
tells it to try again with North America. Lo and behold, france is not
in North America, so the second half returns true. False or True
returns True, so your code prints "France is not in the expected list."

Cheers,
Cliff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top