regex multi-line match/replace issue

Discussion in 'Perl Misc' started by seven.reeds, Apr 24, 2006.

  1. seven.reeds

    seven.reeds Guest

    Hi,

    I'm running perl v5.8.7

    I have a series of files with html tags in them. I man NOT trying to
    strip the tags I am however trying to list the "link phrases"
    associated with all of the "<a href=...>link phrase</a>" sequences in
    each file. I have a script that does what I want. I just need it to
    be improved a bit and that's why I am here.

    the code so far is:

    use strict;
    select(STDIN);
    $|++;

    my $sep = $/;
    undef $/;
    my $text = <>;
    $/ = $sep;
    my $tmp = "";

    while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
    {
    $text = $';
    if ($text =~ /<\s*\/A\s*>/is)
    {
    $tmp = $`;
    #$tmp =~ s/^\s+/ /sg;
    #$tmp =~ s/\s+$/ /sg;
    #$tmp =~ s/\s+/ /sg;
    print STDOUT ">>>$tmp<<<\n";
    $text = $';
    }
    }

    So the "while" looks to see if there is a starting "<A" tag. If there
    is then I reset the text line to the portion of the text following the
    initial match "$text = $';". Next, I look to find a closing "</a>" tag
    and stih the pre-match portion in "$tmp".

    ignore the commented out lines for a second... then I print out $tmp
    and "increment the file-string past the closing A tag.

    Again, this works. It is spitting out the text i expect. but now we
    come to the commented out lines.

    I am trying to pretty-up the text I find by stripping off
    leading/trailing whitespece and compressing internal whitespace.
    Except that bit isn[t working.

    any ideas?
    seven.reeds, Apr 24, 2006
    #1
    1. Advertising

  2. "seven.reeds" <> wrote in
    news::

    > I have a series of files with html tags in them. I man NOT trying to
    > strip the tags I am however trying to list the "link phrases"
    > associated with all of the "<a href=...>link phrase</a>" sequences in
    > each file. I have a script that does what I want.


    You should use an HTML parser to parse HTML.

    > use strict;


    use warnings;

    > select(STDIN);
    > $|++;


    $| = 1;

    > my $sep = $/;
    > undef $/;
    > my $text = <>;
    > $/ = $sep;


    Aaargh!

    my $text = do { local $/; <> };

    Actually, I would just use File::Slurp;

    > my $tmp = "";
    >
    > while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
    > {
    > $text = $';
    > if ($text =~ /<\s*\/A\s*>/is)
    > {
    > $tmp = $`;
    > #$tmp =~ s/^\s+/ /sg;
    > #$tmp =~ s/\s+$/ /sg;
    > #$tmp =~ s/\s+/ /sg;
    > print STDOUT ">>>$tmp<<<\n";
    > $text = $';
    > }
    > }


    ....

    > I am trying to pretty-up the text I find by stripping off
    > leading/trailing whitespece and compressing internal whitespace.
    > Except that bit isn[t working.


    As I said, use an HTML parser to parse HTML.

    Anyway, no need to reinvent to wheel. You can adapt:

    http://search.cpan.org/src/GAAS/HTML-Parser-3.51/eg/hanchors

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Apr 24, 2006
    #2
    1. Advertising

  3. seven.reeds

    seven.reeds Guest

    Thanks

    The anchors script is largely what i am looking for.

    all the best
    seven.reeds, Apr 24, 2006
    #3
  4. seven.reeds <> wrote:

    > I have a series of files with html tags in them. I man NOT trying to
    > strip the tags



    Nonetheless, the primary point in the "How do I remove HTML from a string?"
    FAQ answer is: don't use regular expressions for this.


    > I am however trying to list the "link phrases"
    > associated with all of the "<a href=...>link phrase</a>" sequences in
    > each file.



    I would recommend using a module that already does that for you, such as:

    http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.12/SimpleLinkExtor.pm


    > I have a script that does what I want.



    I think it only "appears" to do what you want.

    You just haven't tried it with a test case that trips it up yet.


    > I just need it to
    > be improved a bit



    It is a dirty hack.

    If proper operation is of importance, then it needs to be thrown
    away and replaced with something more robust.


    > and that's why I am here.



    OK. So let's patch it up anyway, just as a "learning exercise".


    > my $sep = $/;
    > undef $/;
    > my $text = <>;
    > $/ = $sep;



    Let Perl do the save-and-restore for you. This does the same thing:

    my $text;
    { local $/; # a naked block creates a scope
    $text = <>;
    }
    # $/ has been restored to its previous value here


    Or, probably even better:

    my $text = do { local $/; <> };


    > my $tmp = "";
    >
    > while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

    ^^^
    ^^^

    Spaces are not allowed there, so you should not allow spaces there.

    The m//s modifier changes the meaning of dot, it is useless when
    your pattern contains no dot.


    > {
    > $text = $';
    > if ($text =~ /<\s*\/A\s*>/is)



    No unallowed spaces, no "s" modifier, as above.

    If you choose an alternate delimiter for your m//, then you
    won't have to backslash slashes:

    if ($text =~ m#</A\s*>#i)


    > {
    > $tmp = $`;
    > #$tmp =~ s/^\s+/ /sg;
    > #$tmp =~ s/\s+$/ /sg;
    > #$tmp =~ s/\s+/ /sg;
    > print STDOUT ">>>$tmp<<<\n";
    > $text = $';
    > }
    > }



    Try your code with these:

    <a name="perl" href="http://www.perl.org">Perl Mongers</a>

    <a href="http://www.perl.org" name=">>>perl<<<">Perl Mongers</a>

    <!--
    <a href="not_a_link.com">Don't report me as a link!</a>
    -->


    > any ideas?



    Start over (with a module).


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Apr 24, 2006
    #4
  5. seven.reeds

    DJ Stunks Guest

    Tad McClellan wrote:
    > seven.reeds <> wrote:
    >
    > > I am however trying to list the "link phrases"
    > > associated with all of the "<a href=...>link phrase</a>" sequences in
    > > each file.

    >
    >
    > I would recommend using a module that already does that for you, such as:
    >
    > http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.12/SimpleLinkExtor.pm


    I don't believe HTML::LinkExtor (upon which HTML::SimpleLinkExtor is
    built) extracts the link text, only the link itself.

    -jp
    DJ Stunks, Apr 25, 2006
    #5
  6. seven.reeds

    Lukas Mai Guest

    seven.reeds <> schrob:
    >
    > the code so far is:
    >
    > use strict;
    > select(STDIN);


    The other posters seem to have missed this.
    select() changes the current _output_ filehandle. I have no idea what
    you're trying to achieve by selecting STDIN.

    > $|++;


    $| changes the behavior of print. This line has no effect as you don't
    print to STDIN.

    > my $sep = $/;
    > undef $/;
    > my $text = <>;
    > $/ = $sep;


    Eww, use File::Slurp or local $/ here.

    > my $tmp = "";
    >
    > while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

    ^
    This /s has no effect. Why did you put it there?

    > {
    > $text = $';
    > if ($text =~ /<\s*\/A\s*>/is)

    ^
    This /s has no effect. Why did you put it there?
    > {

    [snip]

    MJD's Good Advice #11924 comes to mind.

    Lukas
    --
    fflush(stdin) is wrong, too.
    Lukas Mai, Apr 25, 2006
    #6
  7. seven.reeds

    Anno Siegel Guest

    Tad McClellan <> wrote in comp.lang.perl.misc:
    > seven.reeds <> wrote:


    [good advice snipped]

    > > {
    > > $tmp = $`;
    > > #$tmp =~ s/^\s+/ /sg;
    > > #$tmp =~ s/\s+$/ /sg;
    > > #$tmp =~ s/\s+/ /sg;
    > > print STDOUT ">>>$tmp<<<\n";
    > > $text = $';
    > > }
    > > }


    Apart from everything else, uncommenting the commented substitutions will
    change what $' contains at the end of the block. "$text = $'" should
    come before any additional matches. Also, the commented s/// do not
    strip leading and trailing white space but reduce them to a single blank.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Apr 25, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    627
  2. John Gordon
    Replies:
    13
    Views:
    457
    Ian Kelly
    Dec 20, 2011
  3. Replies:
    3
    Views:
    168
    Brian McCauley
    Sep 12, 2005
  4. banker123

    Multi Line Match and Regex

    banker123, Nov 28, 2006, in forum: Perl Misc
    Replies:
    2
    Views:
    110
    banker123
    Nov 28, 2006
  5. jwcarlton
    Replies:
    1
    Views:
    458
    Martin Honnen
    Feb 5, 2011
Loading...

Share This Page