Need direction on mass find/replacement in HTML files

Discussion in 'Python' started by KevinUT, Apr 30, 2010.

  1. KevinUT

    KevinUT Guest

    Hello Folks:

    I want to globally change the following: <a href="http://
    www.mysite.org/?page=contacts"><font color="#269BD5">

    into: <a href="pages/contacts.htm"><font color="#269BD5">

    You'll notice that the match would be http://www.mysite.org/?page= but
    I also need to add a ".htm" to the end of "contacts" so it becomes
    "contacts.htm" This part of the URL is variable, so how can I use a
    combination of Python and/or a regular expression to replace the match
    the above and also add a ".htm" to the end of that variable part?

    Here are a few dummy URLs for example so you can see the pattern and
    the variable too.

    <a href="http://www.mysite.org/?page=newsletter"><font
    color="#269BD5">

    change to: <a href="pages/newsletter.htm"><font color="#269BD5">

    <a href="http://www.mysite.org/?page=faq">

    change to: <a href="pages/faq.htm">

    So, again the script needs to replace all the full absolute URL links
    with nothing and replace the PHP "?page=" with just the variable page
    name (i.e. contacts) plus the ".htm"

    Is there a combination of Python code and/or regex that can do this?
    Any help would be greatly appreciated!

    Kevin
     
    KevinUT, Apr 30, 2010
    #1
    1. Advertising

  2. KevinUT

    Peter Otten Guest

    KevinUT wrote:

    > Hello Folks:
    >
    > I want to globally change the following: <a href="http://
    > www.mysite.org/?page=contacts"><font color="#269BD5">
    >
    > into: <a href="pages/contacts.htm"><font color="#269BD5">
    >
    > You'll notice that the match would be http://www.mysite.org/?page= but
    > I also need to add a ".htm" to the end of "contacts" so it becomes
    > "contacts.htm" This part of the URL is variable, so how can I use a
    > combination of Python and/or a regular expression to replace the match
    > the above and also add a ".htm" to the end of that variable part?
    >
    > Here are a few dummy URLs for example so you can see the pattern and
    > the variable too.
    >
    > <a href="http://www.mysite.org/?page=newsletter"><font
    > color="#269BD5">
    >
    > change to: <a href="pages/newsletter.htm"><font color="#269BD5">
    >
    > <a href="http://www.mysite.org/?page=faq">
    >
    > change to: <a href="pages/faq.htm">
    >
    > So, again the script needs to replace all the full absolute URL links
    > with nothing and replace the PHP "?page=" with just the variable page
    > name (i.e. contacts) plus the ".htm"
    >
    > Is there a combination of Python code and/or regex that can do this?
    > Any help would be greatly appreciated!


    Don't know if the following will in practice be more reliable than a simple
    regex, but here goes:

    import sys
    import urlparse
    from BeautifulSoup import BeautifulSoup as BS

    if __name__ == "__main__":
    html = open(sys.argv[1]).read()
    bs = BS(html)
    for a in bs("a"):
    href = a["href"]
    url = urlparse.urlparse(href)
    if url.netloc == "www.mysite.org":
    qs = urlparse.parse_qs(url.query)
    a["href"] = "pages/" + qs[u"page"][0] + ".htm"
    print
    print bs

    Peter
     
    Peter Otten, Apr 30, 2010
    #2
    1. Advertising

  3. KevinUT

    Tim Chase Guest

    On 04/30/2010 02:54 PM, KevinUT wrote:
    > I want to globally change the following:<a href="http://
    > www.mysite.org/?page=contacts"><font color="#269BD5">
    >
    > into:<a href="pages/contacts.htm"><font color="#269BD5">


    Normally I'd just do this with sed on a *nix-like OS:

    find . -iname '*.html' -exec sed -i.BAK
    's@href="http://www.mysite.org/?page=\([^"]*\)@href="pages/\1.htm@g'
    {} \;

    This finds all the HTML files (*.html) under the current
    directory ('.') calling sed on each one. Sed then does the
    substitution you describe, changing

    href="http://www.mysite.org/?page=<whatever>

    into

    href="pages/<whatever>.htm

    moving the original file to a .BAK file (you can omit the
    "-i.BAK" parameter if you don't want this backup behavior;
    alternatively assuming you don't have any pre-existing .BAK
    files, after you've vetted the results, you can then use

    find . -name '*.BAK' -exec rm {} \;

    to delete them all) and then overwrites the original with the
    modified results.

    Yes, one could hack up something in Python, perhaps adding some
    real HTML-parsing brains to it, but for the most part, that
    one-liner should do what you need. Unless you're stuck on Win32
    with no Cygwin-like toolkit

    -tkc
     
    Tim Chase, Apr 30, 2010
    #3
  4. One single line regex solution would be:

    re.sub(r'http\://www.mysite.org/\?page=([^"]+)',r'pages/\1.htm',html)
     
    Novocastrian_Nomad, May 1, 2010
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tomás Ó hÉilidhe

    Handy tool for mass-replacement?

    Tomás Ó hÉilidhe, Feb 7, 2008, in forum: C Programming
    Replies:
    4
    Views:
    281
    Robert Latest
    Feb 7, 2008
  2. kin
    Replies:
    0
    Views:
    266
  3. kin
    Replies:
    1
    Views:
    390
    Scott M.
    Mar 3, 2010
  4. Ken Fine
    Replies:
    2
    Views:
    92
    MePadre
    Nov 28, 2003
  5. kin
    Replies:
    1
    Views:
    127
    John G Harris
    Mar 3, 2010
Loading...

Share This Page