Replace text inside html tags?

Discussion in 'Perl Misc' started by squash@peoriadesignweb.com, Jan 30, 2005.

  1. Guest

    I want to able replace text inside html tags. I am using a regex to
    extract the text , but after I modify the text how can I re-assemble
    the html tag? Here is an example:

    <font size=1> HI </font>

    I need to replace HI with BYE and re-assemble html tag like below:

    <font size=1> BYE </font>
    I checked perldoc -q html but could not find the answer there.

    Thx!
     
    , Jan 30, 2005
    #1
    1. Advertising

  2. wrote in news:1107118901.149776.208370
    @z14g2000cwz.googlegroups.com:

    > I want to able replace text inside html tags. I am using a regex to
    > extract the text , but after I modify the text how can I re-assemble
    > the html tag? Here is an example:
    >
    > <font size=1> HI </font>
    >
    > I need to replace HI with BYE and re-assemble html tag like below:
    >
    > <font size=1> BYE </font>
    > I checked perldoc -q html but could not find the answer there.


    The answer to your question can be found in the answer to the FAQ.

    The most correct way (albeit not the fastest) is to use HTML::parser
    from CPAN.

    ....

    Many folks attempt a simple-minded regular expression approach, like
    "s/<.*?>//g", but that fails in many cases because the tags may
    continue over line breaks, they may contain quoted angle-brackets,
    or HTML comment may be present. Plus, folks forget to convert
    entities--like "&lt;" for example.

    That is, you need to use an HTML parser to parse HTML.

    See CPAN for HTML parser modules.

    I had never used HTML::TokeParser::Simple, so I gave that a shot:

    #! /usr/bin/perl

    use strict;
    use warnings;

    use HTML::TokeParser::Simple;

    my $html = <<HTML;
    <font><!--
    <font> HI
    </font>
    -->
    HI
    </font>
    HTML

    my $p = HTML::TokeParser::Simple->new(string => $html);

    my $in_font_tag;

    while(my $token = $p->get_token) {
    if($token->is_start_tag('font')) {
    print $token->as_is;
    $in_font_tag = 1;
    next;
    }
    if($token->is_end_tag('font')) {
    print $token->as_is;
    $in_font_tag = 0;
    next;
    }
    if($in_font_tag and $token->is_text) {
    my $text = $token->as_is;
    $text =~ s/HI/BYE/g;
    print $text;
    next;
    }
    print $token->as_is;
    }

    __END__

    C:\Dload> h
    <font><!--
    <font> HI
    </font>
    -->
    BYE
    </font>

    Seems to work.

    Sinan.
     
    A. Sinan Unur, Jan 30, 2005
    #2
    1. Advertising

  3. wrote:
    > I want to able replace text inside html tags. I am using a regex to
    > extract the text , but after I modify the text how can I re-assemble
    > the html tag? Here is an example:
    >
    > <font size=1> HI </font>
    >
    > I need to replace HI with BYE and re-assemble html tag like below:
    >
    > <font size=1> BYE </font>


    Depending on the complexity of the document, the s/// operator may be
    sufficient.

    > I checked perldoc -q html but could not find the answer there.


    Then you should have seen for instance

    perldoc -q "remove HTML"

    and other entries in perlfaq9 which warn for trying to parse HTML
    documents with regular expressions, and recommend the use of a suitable
    module for HTML parsing.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jan 30, 2005
    #3
  4. wrote:

    > I want to able replace text inside html tags. I am using a regex to
    > extract the text , but after I modify the text how can I re-assemble
    > the html tag? Here is an example:
    >
    > <font size=1> HI </font>
    >
    > I need to replace HI with BYE and re-assemble html tag like below:
    >
    > <font size=1> BYE </font>


    Others have suggested using a parser module - and they're right. That should
    always be your first instinct when working with HTML. However, there are
    some scenarios where a regex is good enough, and faster to write than a
    parser-based solution. For example, if the task at hand is a very simple
    search-and-replace across a number of pages where you know a given pattern
    will match. Or you're fixing pages that are broken beyond a parser's
    ability to cope with them.

    With that in mind, have a look at "perldoc perlretut", paying special
    attention to the section titled "Extracting matches". You can use
    "backreferences" in your regex to use parts of the matched string in the
    replacement, like this:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $html = '<font size=1> HI </font><font size=1> HELLO </font>';

    $html =~ s%(<font size=1>)(.*?)(</font>)%$1 BYE $3%g;

    print $html, "\n";

    Aside from subexpressions and backreferences, another point of note is the
    "non-greedy" quantifier "*?". Without it - i.e. written as "*" - the second
    expression would be "greedy", meaning it would return the longest possible
    string that matches the expression it modifies. In the example above, that
    would mean replacing everything between the first '<font size=1>' and the
    *second* '</font>'. (Try it!)

    That's not what you want - you want the *shortest* string that matches the
    expression, not the longest. That's what the "non-greedy" quantifier gives
    you.

    Just to restate it - regexes are generally *not* the best way to parse HTML,
    particularly arbitrary HTML that's fetched from a web site that's beyond
    your control. But using them *can* useful if the task at hand is extremely
    limited, or if the HTML is broken beyond a parser's ability to handle it.

    References:

    perldoc perlretut
    perldoc perlre

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
     
    Sherm Pendley, Jan 30, 2005
    #4
  5. Bart Lateur Guest

    A. Sinan Unur wrote:

    >I had never used HTML::TokeParser::Simple, so I gave that a shot:


    >my $p = HTML::TokeParser::Simple->new(string => $html);
    >
    >my $in_font_tag;
    >
    >while(my $token = $p->get_token) {
    > if($token->is_start_tag('font')) {
    > print $token->as_is;
    > $in_font_tag = 1;
    > next;
    > }
    > if($token->is_end_tag('font')) {
    > print $token->as_is;
    > $in_font_tag = 0;
    > next;
    > }
    > if($in_font_tag and $token->is_text) {
    > my $text = $token->as_is;
    > $text =~ s/HI/BYE/g;
    > print $text;
    > next;
    > }
    > print $token->as_is;
    >}


    I like to use ".." in code with this kind of functionality. This shows
    IMO an aspect where a tokeparser approach is vastly superior to raw
    usage of HTML::parser.

    while(my $token = $p->get_token) {
    if($token->is_start_tag('font') .. $token->is_end_tag('font')) {
    if($token->is_text) {
    my $text = $token->as_is;
    $text =~ s/HI/BYE/g;
    print $text;
    next;
    }
    }
    print $token->as_is;
    }


    --
    Bart.
     
    Bart Lateur, Jan 31, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean H. Saxe
    Replies:
    0
    Views:
    1,071
    Dean H. Saxe
    Jan 3, 2004
  2. Claudio Biagioli
    Replies:
    1
    Views:
    1,043
    =?Utf-8?B?SmVyZW15?=
    Feb 6, 2004
  3. Rob Nicholson
    Replies:
    3
    Views:
    815
    Rob Nicholson
    May 28, 2005
  4. A. Brinkmann
    Replies:
    2
    Views:
    1,112
    A. Brinkmann
    Apr 16, 2004
  5. Replies:
    1
    Views:
    318
    Diez B. Roggisch
    Sep 27, 2006
Loading...

Share This Page