Need direction on mass find/replacement in HTML files

K

KevinUT

Hello Folks:

I want to globally change the following: <a href="http://
www.mysite.org/?page=contacts"><font color="#269BD5">

into: <a href="pages/contacts.htm"><font color="#269BD5">

You'll notice that the match would be http://www.mysite.org/?page= but
I also need to add a ".htm" to the end of "contacts" so it becomes
"contacts.htm" This part of the URL is variable, so how can I use a
combination of Python and/or a regular expression to replace the match
the above and also add a ".htm" to the end of that variable part?

Here are a few dummy URLs for example so you can see the pattern and
the variable too.

<a href="http://www.mysite.org/?page=newsletter"><font
color="#269BD5">

change to: <a href="pages/newsletter.htm"><font color="#269BD5">

<a href="http://www.mysite.org/?page=faq">

change to: <a href="pages/faq.htm">

So, again the script needs to replace all the full absolute URL links
with nothing and replace the PHP "?page=" with just the variable page
name (i.e. contacts) plus the ".htm"

Is there a combination of Python code and/or regex that can do this?
Any help would be greatly appreciated!

Kevin
 
P

Peter Otten

KevinUT said:
Hello Folks:

I want to globally change the following: <a href="http://
www.mysite.org/?page=contacts"><font color="#269BD5">

into: <a href="pages/contacts.htm"><font color="#269BD5">

You'll notice that the match would be http://www.mysite.org/?page= but
I also need to add a ".htm" to the end of "contacts" so it becomes
"contacts.htm" This part of the URL is variable, so how can I use a
combination of Python and/or a regular expression to replace the match
the above and also add a ".htm" to the end of that variable part?

Here are a few dummy URLs for example so you can see the pattern and
the variable too.

<a href="http://www.mysite.org/?page=newsletter"><font
color="#269BD5">

change to: <a href="pages/newsletter.htm"><font color="#269BD5">

<a href="http://www.mysite.org/?page=faq">

change to: <a href="pages/faq.htm">

So, again the script needs to replace all the full absolute URL links
with nothing and replace the PHP "?page=" with just the variable page
name (i.e. contacts) plus the ".htm"

Is there a combination of Python code and/or regex that can do this?
Any help would be greatly appreciated!

Don't know if the following will in practice be more reliable than a simple
regex, but here goes:

import sys
import urlparse
from BeautifulSoup import BeautifulSoup as BS

if __name__ == "__main__":
html = open(sys.argv[1]).read()
bs = BS(html)
for a in bs("a"):
href = a["href"]
url = urlparse.urlparse(href)
if url.netloc == "www.mysite.org":
qs = urlparse.parse_qs(url.query)
a["href"] = "pages/" + qs[u"page"][0] + ".htm"
print
print bs

Peter
 
T

Tim Chase

I want to globally change the following:<a href="http://
www.mysite.org/?page=contacts"><font color="#269BD5">

into:<a href="pages/contacts.htm"><font color="#269BD5">

Normally I'd just do this with sed on a *nix-like OS:

find . -iname '*.html' -exec sed -i.BAK
's@href="http://www.mysite.org/?page=\([^"]*\)@href="pages/\1.htm@g'
{} \;

This finds all the HTML files (*.html) under the current
directory ('.') calling sed on each one. Sed then does the
substitution you describe, changing

href="http://www.mysite.org/?page=<whatever>

into

href="pages/<whatever>.htm

moving the original file to a .BAK file (you can omit the
"-i.BAK" parameter if you don't want this backup behavior;
alternatively assuming you don't have any pre-existing .BAK
files, after you've vetted the results, you can then use

find . -name '*.BAK' -exec rm {} \;

to delete them all) and then overwrites the original with the
modified results.

Yes, one could hack up something in Python, perhaps adding some
real HTML-parsing brains to it, but for the most part, that
one-liner should do what you need. Unless you're stuck on Win32
with no Cygwin-like toolkit

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top