Sort by domain name?

J

js

Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to


Thanks in advance.
 
P

Paul Rubin

js said:
Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.
 
T

Tim Chase

Here, domain name doesn't contain subdomain, or should I
It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say how to treat
various countries.

In addition, you get very different results even on just "base"
domain-name, such as "whitehouse" based on whether you use the
".gov" or ".com" variant of the TLD. Thus, I'm not sure there's
any way to discern this example from the "yahoo.com" vs.
"yahoo.co.uk" variant without doing a boatload of WHOIS queries,
which in turn might be misleading anyways.

A first-pass solution might look something like:

##############################################################>>>
sites
['http://mail.google.com', 'http://reader.google.com',
'http://mail.yahoo.co.uk', 'http://google.com',
'http://mail.yahoo.com']
>>> sitebits = [site.lower().lstrip('http://').split('.') for site in sites]
>>> for site in sitebits: site.reverse() ....
>>> sorted(sitebits)
[['com', 'google'], ['com', 'google', 'mail'], ['com', 'google',
'reader'], ['co
m', 'yahoo', 'mail'], ['uk', 'co', 'yahoo', 'mail']]
>>> results = ['http://' + ('.'.join(reversed(site))) for site in sorted(sitebits)]
>>> results
['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
##############################################################

which can be wrapped up like this:

##############################################################.... sitebits = [site.lower().lstrip('http://').split('.') for
site in sites]
.... for site in sitebits: site.reverse()
.... return ['http://' + ('.'.join(reversed(site))) for site
in sorted(sitebits)]
....['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
##############################################################

to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.

Just a few ideas,

-tkc
 
G

gene tani

Paul said:
It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.

Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?

Gentle reminder: is this homework? And you can expect better responses
if you show youve bootstrapped yourself on the problem to some extent.
 
J

js

Thanks for your quick reply.
yeah, it's a hard task and unfortunately even google doesn't help me much.

All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.
 
B

bearophileHUGS

Tim Chase:
to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.

With a modern Python you don't need to do all that work, you can do:

sorted(urls, key=cleaner)

Where cleaner is a function the finds the important part of a string of
the ones you have to sort.

Bye,
bearophile
 
B

bearophileHUGS

js:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.

Bye,
bearophile
 
J

jay graves

gene said:
Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?

Now is a good time to point out the urlparse module in the standard
library. It will help the OP with all of this stuff.

just adding my 2 cents.

....
jay graves
 
P

Paul Rubin

js said:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Here's a function I used to use. It makes no attempt to be
exhaustive, but did a reasonable job on the domains I cared about at
the time:

def host_domain(hostname):
parts = hostname.split('.')
if parts[-1] in ('au','uk','nz', 'za', 'jp', 'br'):
# www.foobar.co.uk, etc
host_len = 3
elif len(parts)==4 and re.match('^[\d.]+$', hostname):
host_len = 4 # 2.3.4.5 numeric address
else:
host_len = 2
d = '.'.join(parts[-(host_len):])
# print 'host_domain:', hostname, '=>', d
return d
 
P

Paul McGuire

js said:
Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to


Thanks in advance.

How about sorting the strings as they are reversed?

urls = """\
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com""".split("\n")

sortedList = [ su[1] for su in sorted([ (u[::-1],u) for u in urls ]) ]

for url in sortedList:
print url


Prints:
http://mail.yahoo.co.uk
http://mail.google.com
http://reader.google.com
http://google.com
http://mail.yahoo.com


Close to what you are looking for, might be good enough?

-- Paul
 
J

js

Gentle reminder: is this homework? And you can expect better responses
if you show youve bootstrapped yourself on the problem to some extent.

Sure thing.
First I tried to solve this by using a list of domain found at
http://www.neuhaus.com/domaincheck/domain_list.htm

I converted this to a list (in python) and tried like below

look for url that endswith(domain in domains)
if found:
capture the left side of the domain part(tld) and
save all url to a dictionary that key is the captured string

to me this seems to work but stuck because this solution seems no good.
 
J

js

js:

Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.

well, I think list is so large that that's impossible to
create such a good company-list.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top