Better way to sift parts of URL . . .

B

Ben Wilson

I am working on a script that splits a URL into a page and a url. The
examples below are the conditions I expect a user to pass to the
script. In all cases, "http://www.example.org/test/" is the URL, and
the page comprises parts that have upper case letters (note, 5 & 6 are
the same as earlier examples, sans the 'test').

1. http://www.example.org/test/Main/AnotherPage (page =
Main/AnotherPage)
2. http://www.example.org/test/Main (page = Main + '/' +
default_page)
3. http://www.example.org/test (page = default_group + '/' +
default_page)
4. http://www.example.org/test/ (page = default_group + '/' +
default_page)
5. http://www.example.org/ (page = default_group + '/' +
default_page)
6. http://www.example.org/Main/AnotherPage (page = Main/AnotherPage)

Right now, I'm doing a simple split off condition 1:

page = '.'.join(in.split('/')[-2:])
url = '/'.join(in.split('/')[:-2]) + '/'

Before I start winding my way down a complex path, I wanted to see if
anybody had an elegant approach to this problem.

Thanks in advance.
Ben
 
B

Ben Wilson

Here is what I came up with:

def siftUrl(s):
s = s.split('//')[1]
bits = s.split('/')

if '' in bits: bits.remove('')
if len(bits) > 1:
group = bits[-2]
page = bits[-1]
group.strip('/')
page.strip('/')
else:
group = 'test'
page = 'test'

if group == group.capitalize():
page = '/'.join([group,page])
url = '/'.join(s.split('/')[:-2]) + '/'
elif page == page.capitalize():
page = '/'.join(['Main',page])
url = '/'.join(s.split('/')[:-1]) + '/'
else:
page = '/'.join(['Main','Main'])
url = s

url = 'http://' + url
return url, page
 
S

skip

Ben> I am working on a script that splits a URL into a page and a
Ben> url.

I couldn't tell quite what you mean to accomplish from your example. (In
particular, I don't know what you mean by "default_group", as it's never
defined, and I don't know why the desired output of examples 1 and 6 is the
same, since the URLs are clearly different.) You don't mention having tried
the urlparse module, so I thought I should ask: have you tried using
urlparse?

Skip
 
B

Ben Wilson

Sorry.

I'm writing a python script that retrieves source contents of a wiki
page, edits, and re-posts changed content. The wiki breaks pages into
groups and pages (e.g. ThisGroup/ThisPage). The sections that are
camel cased (or otherwise contain title case) are the group and page
for a given page. When a url is passed that is incomplete (i.e., has
the base URL and the Group, or only the base URL), the wiki resorts to
defaults (e.g. a base URL and Group would return the default page for
that group, and a bare URL returns the base page for the base group).

I'm playing with urlparse now. Looks like I can do the same thing in a
lot fewer steps. I'll post results.

Ben
 
P

Paul McGuire

Ben Wilson said:
I am working on a script that splits a URL into a page and a url. The
examples below are the conditions I expect a user to pass to the
script. In all cases, "http://www.example.org/test/" is the URL, and
the page comprises parts that have upper case letters (note, 5 & 6 are
the same as earlier examples, sans the 'test').

1. http://www.example.org/test/Main/AnotherPage (page =
Main/AnotherPage)
2. http://www.example.org/test/Main (page = Main + '/' +
default_page)
3. http://www.example.org/test (page = default_group + '/' +
default_page)
4. http://www.example.org/test/ (page = default_group + '/' +
default_page)
5. http://www.example.org/ (page = default_group + '/' +
default_page)
6. http://www.example.org/Main/AnotherPage (page = Main/AnotherPage)

Right now, I'm doing a simple split off condition 1:

page = '.'.join(in.split('/')[-2:])
url = '/'.join(in.split('/')[:-2]) + '/'

Before I start winding my way down a complex path, I wanted to see if
anybody had an elegant approach to this problem.

Thanks in advance.
Ben

Standard Python includes urlparse. Possible help?

-- Paul



import urlparse

urls = [
"http://www.example.org/test/Main/AnotherPage", # (page =
Main/AnotherPage)
"http://www.example.org/test/Main", # (page = Main + '/' + default_page)
"http://www.example.org/test", # (page = default_group + '/' +
default_page)
"http://www.example.org/test/", # (page = default_group + '/' +
default_page)
"http://www.example.org/", # (page = default_group + '/' + default_page)
"http://www.example.org/Main/AnotherPage",
]

for u in urls:
print u
parts = urlparse.urlparse(u)
print parts
scheme,netloc,path,params,query,frag = parts
print path.split("/")[1:]
print

prints:
http://www.example.org/test/Main/AnotherPage
('http', 'www.example.org', '/test/Main/AnotherPage', '', '', '')
['test', 'Main', 'AnotherPage']

http://www.example.org/test/Main
('http', 'www.example.org', '/test/Main', '', '', '')
['test', 'Main']

http://www.example.org/test
('http', 'www.example.org', '/test', '', '', '')
['test']

http://www.example.org/test/
('http', 'www.example.org', '/test/', '', '', '')
['test', '']

http://www.example.org/
('http', 'www.example.org', '/', '', '', '')
['']

http://www.example.org
('http', 'www.example.org', '', '', '', '')
[]

http://www.example.org/Main/AnotherPage
('http', 'www.example.org', '/Main/AnotherPage', '', '', '')
['Main', 'AnotherPage']
 
B

Ben Wilson

This is what I ended up with. Slightly different approach:

import urlparse

def sUrl(s):
page = group = ''
bits = urlparse.urlsplit(s)
url = '//'.join([bits[0],bits[1]]) + '/'
query = bits[2].split('/')
if '' in query: query.remove('')
if len(query) > 1: page = query.pop()
if len(query) > 0 and query[-1] == query[-1].capitalize(): group =
query.pop()
if len(query): url += '/'.join(query) + '/'
if page == '': page = 'Main'
if group == '': group = 'Main'
page = '.'.join([group,page])
print " URL: (%s) PAGE: (%s)" % (url, page)


urls = [
"http://www.example.org/test/Main/AnotherPage", # (page =
Main/AnotherPage)
"http://www.example.org/test/Main", # (page = Main + '/' +
default_page)
"http://www.example.org/test", # (page = default_group + '/' +
default_page)
"http://www.example.org/test/", # (page = default_group + '/' +
default_page)
"http://www.example.org/", # (page = default_group + '/' +
default_page)
"http://www.example.org/Main/AnotherPage",
]

for u in urls:
print "Testing:",u
sUrl(u)
 
B

Ben Wilson

In practice, I had to change this:
if len(query) > 0 and query[-1] == query[-1].capitalize(): group =
query.pop()

to this:
if len(query) > 0 and query[-1][0] == query[-1].capitalize()[0]:
group = query.pop()

This is because I only wanted to test the case of the first letter of
the string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top