Curious to see alternate approach on a search/replace via regex

R

rh

I am curious to know if others would have done this differently. And if so
how so?

This converts a url to a more easily managed filename, stripping the
http protocol off.

This:

http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1


def u2f(u):
nx = re.compile(r'https?://(.+)$')
u = nx.search(u).group(1)
ux = re.compile(r'([-:./?&=]+)')
return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a way to
do it all at once. i.e. remove the protocol and replace the chars.
 
N

Nick Mellor

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick
 
N

Nick Mellor

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick
 
R

rh

Hi RH,

translate methods might be faster (and a little easier to read) for
your use case. Just precompute and re-use the translation table
punct_flatten.

Note that the translate method has changed somewhat for Python 3 due
to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len
(flattened_chars)) parts = urlparse
('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query flattened =
unflattened.translate(punct_flatten) print (flattened)

I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query
Cheers,

Nick

I am curious to know if others would have done this differently.
And if so

how so?



This converts a url to a more easily managed filename, stripping the

http protocol off.



This:



http://alongnameofasite1234567.com/q?sports=run&a=1&b=1



becomes this:



alongnameofasite1234567_com_q_sports_run_a_1_b_1





def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)



One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.


--
 
N

Nick Mellor

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

Hi RH,

translate methods might be faster (and a little easier to read) for
your use case. Just precompute and re-use the translation table


Note that the translate method has changed somewhat for Python 3 due
to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len
(flattened_chars)) parts = urlparse

unflattened = parts.netloc + parts.path + parts.query flattened =
unflattened.translate(punct_flatten) print (flattened)



I like the idea of using a library but since I'm learning python I wanted

to try out the regex stuff. I haven't looked but I'd think that urllib might

(should?) have a builtin so that one wouldn't have to specify the

flattened_chars list. I'm sure there's a name for those chars but I don't know

it. Maybe just punctuation??



Also my version converts the ? into _ but urllib sees that as the query

separator and removes it. Just point this out for completeness sake.



This would mimic what I did:

unflattened = parts.netloc + parts.path + '_' + parts.query



I am curious to know if others would have done this differently.
And if so

how so?



This converts a url to a more easily managed filename, stripping the

http protocol off.



This:



http://alongnameofasite1234567.com/q?sports=run&a=1&b=1



becomes this:



alongnameofasite1234567_com_q_sports_run_a_1_b_1





def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)



One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.





--
 
N

Nick Mellor

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

Hi RH,

translate methods might be faster (and a little easier to read) for
your use case. Just precompute and re-use the translation table


Note that the translate method has changed somewhat for Python 3 due
to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len
(flattened_chars)) parts = urlparse

unflattened = parts.netloc + parts.path + parts.query flattened =
unflattened.translate(punct_flatten) print (flattened)



I like the idea of using a library but since I'm learning python I wanted

to try out the regex stuff. I haven't looked but I'd think that urllib might

(should?) have a builtin so that one wouldn't have to specify the

flattened_chars list. I'm sure there's a name for those chars but I don't know

it. Maybe just punctuation??



Also my version converts the ? into _ but urllib sees that as the query

separator and removes it. Just point this out for completeness sake.



This would mimic what I did:

unflattened = parts.netloc + parts.path + '_' + parts.query



I am curious to know if others would have done this differently.
And if so

how so?



This converts a url to a more easily managed filename, stripping the

http protocol off.



This:



http://alongnameofasite1234567.com/q?sports=run&a=1&b=1



becomes this:



alongnameofasite1234567_com_q_sports_run_a_1_b_1





def u2f(u):

nx = re.compile(r'https?://(.+)$')

u = nx.search(u).group(1)

ux = re.compile(r'([-:./?&=]+)')

return ux.sub('_', u)



One alternate is to not do the compile step. There must also be a
way to

do it all at once. i.e. remove the protocol and replace the chars.





--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,044
Latest member
RonaldNen

Latest Threads

Top