Curious to see alternate approach on a search/replace via regex

rh · Feb 6, 2013

I am curious to know if others would have done this differently. And if so
how so?

This converts a url to a more easily managed filename, stripping the
http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1

def u2f(u):
nx = re.compile(r'https?://(.+)$')
u = nx.search(u).group(1)
ux = re.compile(r'([-:./?&=]+)')
return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a way to
do it all at once. i.e. remove the protocol and replace the chars.

Roy Smith · Feb 6, 2013

rh said:
I am curious to know if others would have done this differently. And if so
how so?

This converts a url to a more easily managed filename, stripping the
http protocol off.

I would have used the urlparse module.

http://docs.python.org/2/library/urlparse.html

Nick Mellor · Feb 7, 2013

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

Nick Mellor · Feb 7, 2013

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

rh · Feb 8, 2013

Hi RH,

translate methods might be faster (and a little easier to read) for
your use case. Just precompute and re-use the translation table
punct_flatten.

Note that the translate method has changed somewhat for Python 3 due
to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len
(flattened_chars)) parts = urlparse
('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query flattened =
unflattened.translate(punct_flatten) print (flattened)

I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query

Cheers,

Nick

I am curious to know if others would have done this differently.
And if so

how so?

This converts a url to a more easily managed filename, stripping the

http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1

def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.

Click to expand...

--

Nick Mellor · Feb 8, 2013

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

Hi RH,

translate methods might be faster (and a little easier to read) for

Click to expand...

your use case. Just precompute and re-use the translation table

Note that the translate method has changed somewhat for Python 3 due

Click to expand...

to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"

Click to expand...

punct_flatten = str.maketrans(flattened_chars, '_' * len

Click to expand...

(flattened_chars)) parts = urlparse

unflattened = parts.netloc + parts.path + parts.query flattened =

Click to expand...

unflattened.translate(punct_flatten) print (flattened)

Click to expand...

I like the idea of using a library but since I'm learning python I wanted

to try out the regex stuff. I haven't looked but I'd think that urllib might

(should?) have a builtin so that one wouldn't have to specify the

flattened_chars list. I'm sure there's a name for those chars but I don't know

it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query

separator and removes it. Just point this out for completeness sake.

This would mimic what I did:

unflattened = parts.netloc + parts.path + '_' + parts.query

Cheers,

Click to expand...

Nick

Click to expand...

I am curious to know if others would have done this differently.
And if so

how so?

This converts a url to a more easily managed filename, stripping the

http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1

def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.

Click to expand...

Click to expand...

--

Nick Mellor · Feb 8, 2013

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

Hi RH,

translate methods might be faster (and a little easier to read) for

Click to expand...

your use case. Just precompute and re-use the translation table

Note that the translate method has changed somewhat for Python 3 due

Click to expand...

to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"

Click to expand...

punct_flatten = str.maketrans(flattened_chars, '_' * len

Click to expand...

(flattened_chars)) parts = urlparse

unflattened = parts.netloc + parts.path + parts.query flattened =

Click to expand...

unflattened.translate(punct_flatten) print (flattened)

Click to expand...

I like the idea of using a library but since I'm learning python I wanted

to try out the regex stuff. I haven't looked but I'd think that urllib might

(should?) have a builtin so that one wouldn't have to specify the

flattened_chars list. I'm sure there's a name for those chars but I don't know

it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query

separator and removes it. Just point this out for completeness sake.

This would mimic what I did:

unflattened = parts.netloc + parts.path + '_' + parts.query

Cheers,

Click to expand...

Nick

Click to expand...

I am curious to know if others would have done this differently.
And if so

how so?

This converts a url to a more easily managed filename, stripping the

http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1

def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.

Click to expand...

Click to expand...

--

[Regex] Search and replace?	1	Nov 13, 2008
A nice way to use regex for complicate parsing	3	Mar 29, 2007
Passing in the replace string for a regex via the command line -HOW?	7	Mar 3, 2005
[perl-python] find & replace strings for all files in a dir	1	Jan 31, 2005
Interactive Find and Replace String Patterns on Multiple Files	0	Jun 14, 2006
Reverse RegEx or How to write a KeywordGenerator for a given	2	Jan 31, 2008
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

Curious to see alternate approach on a search/replace via regex

rh

Roy Smith

Nick Mellor

Nick Mellor

rh

Nick Mellor

Nick Mellor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads