Better way to sift parts of URL . . .

Ben Wilson · Apr 18, 2006

I am working on a script that splits a URL into a page and a url. The
examples below are the conditions I expect a user to pass to the
script. In all cases, "http://www.example.org/test/" is the URL, and
the page comprises parts that have upper case letters (note, 5 & 6 are
the same as earlier examples, sans the 'test').

1. http://www.example.org/test/Main/AnotherPage (page =
Main/AnotherPage)
2. http://www.example.org/test/Main (page = Main + '/' +
default_page)
3. http://www.example.org/test (page = default_group + '/' +
default_page)
4. http://www.example.org/test/ (page = default_group + '/' +
default_page)
5. http://www.example.org/ (page = default_group + '/' +
default_page)
6. http://www.example.org/Main/AnotherPage (page = Main/AnotherPage)

Right now, I'm doing a simple split off condition 1:

page = '.'.join(in.split('/')[-2:])
url = '/'.join(in.split('/')[:-2]) + '/'

Before I start winding my way down a complex path, I wanted to see if
anybody had an elegant approach to this problem.

Thanks in advance.
Ben

Ben Wilson · Apr 18, 2006

Here is what I came up with:

def siftUrl(s):
s = s.split('//')[1]
bits = s.split('/')

if '' in bits: bits.remove('')
if len(bits) > 1:
group = bits[-2]
page = bits[-1]
group.strip('/')
page.strip('/')
else:
group = 'test'
page = 'test'

if group == group.capitalize():
page = '/'.join([group,page])
url = '/'.join(s.split('/')[:-2]) + '/'
elif page == page.capitalize():
page = '/'.join(['Main',page])
url = '/'.join(s.split('/')[:-1]) + '/'
else:
page = '/'.join(['Main','Main'])
url = s

url = 'http://' + url
return url, page

skip · Apr 18, 2006

Ben> I am working on a script that splits a URL into a page and a
Ben> url.

I couldn't tell quite what you mean to accomplish from your example. (In
particular, I don't know what you mean by "default_group", as it's never
defined, and I don't know why the desired output of examples 1 and 6 is the
same, since the URLs are clearly different.) You don't mention having tried
the urlparse module, so I thought I should ask: have you tried using
urlparse?

Skip

Ben Wilson · Apr 18, 2006

Sorry.

I'm writing a python script that retrieves source contents of a wiki
page, edits, and re-posts changed content. The wiki breaks pages into
groups and pages (e.g. ThisGroup/ThisPage). The sections that are
camel cased (or otherwise contain title case) are the group and page
for a given page. When a url is passed that is incomplete (i.e., has
the base URL and the Group, or only the base URL), the wiki resorts to
defaults (e.g. a base URL and Group would return the default page for
that group, and a bare URL returns the base page for the base group).

I'm playing with urlparse now. Looks like I can do the same thing in a
lot fewer steps. I'll post results.

Ben

Paul McGuire · Apr 18, 2006

Ben Wilson said:
I am working on a script that splits a URL into a page and a url. The
examples below are the conditions I expect a user to pass to the
script. In all cases, "http://www.example.org/test/" is the URL, and
the page comprises parts that have upper case letters (note, 5 & 6 are
the same as earlier examples, sans the 'test').

1. http://www.example.org/test/Main/AnotherPage (page =
Main/AnotherPage)
2. http://www.example.org/test/Main (page = Main + '/' +
default_page)
3. http://www.example.org/test (page = default_group + '/' +
default_page)
4. http://www.example.org/test/ (page = default_group + '/' +
default_page)
5. http://www.example.org/ (page = default_group + '/' +
default_page)
6. http://www.example.org/Main/AnotherPage (page = Main/AnotherPage)

Right now, I'm doing a simple split off condition 1:

page = '.'.join(in.split('/')[-2:])
url = '/'.join(in.split('/')[:-2]) + '/'

Before I start winding my way down a complex path, I wanted to see if
anybody had an elegant approach to this problem.

Thanks in advance.
Ben

Standard Python includes urlparse. Possible help?

-- Paul

import urlparse

urls = [
"http://www.example.org/test/Main/AnotherPage", # (page =
Main/AnotherPage)
"http://www.example.org/test/Main", # (page = Main + '/' + default_page)
"http://www.example.org/test", # (page = default_group + '/' +
default_page)
"http://www.example.org/test/", # (page = default_group + '/' +
default_page)
"http://www.example.org/", # (page = default_group + '/' + default_page)
"http://www.example.org/Main/AnotherPage",
]

for u in urls:
print u
parts = urlparse.urlparse(u)
print parts
scheme,netloc,path,params,query,frag = parts
print path.split("/")[1:]
print

prints:
http://www.example.org/test/Main/AnotherPage
('http', 'www.example.org', '/test/Main/AnotherPage', '', '', '')
['test', 'Main', 'AnotherPage']

http://www.example.org/test/Main
('http', 'www.example.org', '/test/Main', '', '', '')
['test', 'Main']

http://www.example.org/test
('http', 'www.example.org', '/test', '', '', '')
['test']

http://www.example.org/test/
('http', 'www.example.org', '/test/', '', '', '')
['test', '']

http://www.example.org/
('http', 'www.example.org', '/', '', '', '')
['']

http://www.example.org
('http', 'www.example.org', '', '', '', '')
[]

http://www.example.org/Main/AnotherPage
('http', 'www.example.org', '/Main/AnotherPage', '', '', '')
['Main', 'AnotherPage']

Ben Wilson · Apr 19, 2006

This is what I ended up with. Slightly different approach:

import urlparse

def sUrl(s):
page = group = ''
bits = urlparse.urlsplit(s)
url = '//'.join([bits[0],bits[1]]) + '/'
query = bits[2].split('/')
if '' in query: query.remove('')
if len(query) > 1: page = query.pop()
if len(query) > 0 and query[-1] == query[-1].capitalize(): group =
query.pop()
if len(query): url += '/'.join(query) + '/'
if page == '': page = 'Main'
if group == '': group = 'Main'
page = '.'.join([group,page])
print " URL: (%s) PAGE: (%s)" % (url, page)

urls = [
"http://www.example.org/test/Main/AnotherPage", # (page =
Main/AnotherPage)
"http://www.example.org/test/Main", # (page = Main + '/' +
default_page)
"http://www.example.org/test", # (page = default_group + '/' +
default_page)
"http://www.example.org/test/", # (page = default_group + '/' +
default_page)
"http://www.example.org/", # (page = default_group + '/' +
default_page)
"http://www.example.org/Main/AnotherPage",
]

for u in urls:
print "Testing:",u
sUrl(u)

Ben Wilson · Apr 19, 2006

In practice, I had to change this:
if len(query) > 0 and query[-1] == query[-1].capitalize(): group =
query.pop()

to this:
if len(query) > 0 and query[-1][0] == query[-1].capitalize()[0]:
group = query.pop()

This is because I only wanted to test the case of the first letter of
the string.

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
performance problem with time.strptime()	1	Jul 2, 2009
Stripping parts of a path	3	Jul 26, 2008
Average of MultiMode of a list of a list	1	Oct 28, 2022
Is there a better way to do this?	1	Mar 1, 2010
which better for me?session.query or session.execute?	0	Aug 27, 2013
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022

Better way to sift parts of URL . . .

Ben Wilson

Ben Wilson

skip

Ben Wilson

Paul McGuire

Ben Wilson

Ben Wilson

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads