question on processing HTML with a regex

Discussion in 'Perl Misc' started by cuneyt, Dec 13, 2006.

  1. cuneyt

    cuneyt Guest

    Hi,

    I would like to process an HTML file in the form

    <tr>
    row1
    </tr>
    <tr>
    row2
    </tr>

    The snippet I wrote is

    while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
    my $r = $1;
    print "found: $r\n";
    }

    But when I run this I get
    found:
    row1
    </tr>
    <tr>
    row2

    How can I modify the regex so that it is not so greedy and pulls one
    <tr></tr> pair at a time?

    Thanks a lot

    Cuneyt
    cuneyt, Dec 13, 2006
    #1
    1. Advertising

  2. cuneyt

    Guest

    "cuneyt" <> wrote:

    > How can I modify the regex so that it is not so greedy and pulls one
    > <tr></tr> pair at a time?


    perldoc perlre, search for "greedy".

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Dec 13, 2006
    #2
    1. Advertising

  3. cuneyt

    Paul Lalli Guest

    cuneyt wrote:
    > Hi,
    >
    > I would like to process an HTML file in the form
    >
    > <tr>
    > row1
    > </tr>
    > <tr>
    > row2
    > </tr>
    >
    > The snippet I wrote is
    >
    > while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
    > my $r = $1;
    > print "found: $r\n";
    > }


    You should not, in general, be processing HTML with regular
    expressions. You should be using an HTML parser. There are many
    available on CPAN. I recommend HTML::TokeParser.

    > But when I run this I get
    > found:
    > row1
    > </tr>
    > <tr>
    > row2
    >
    > How can I modify the regex so that it is not so greedy and pulls one
    > <tr></tr> pair at a time?


    You need to read a decent tutorial on regular expressions, as this is a
    very basic question. The answer is that you need a ? after the *, but
    if you didn't already know that, you *really* need to read:
    perldoc perlretut

    Hope this helps,
    Paul Lalli
    Paul Lalli, Dec 13, 2006
    #3
  4. cuneyt

    John Bokma Guest

    "cuneyt" <> wrote:

    > while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {


    I recommend not to use CamelCase in Perl, but use _ instead, e.g.
    $html_source.

    > my $r = $1;


    What is $r? You might be asking yourself that question after a month or
    so.

    > print "found: $r\n";
    > }


    As for using regexp, it's way to fragile in too many cases. Have a look at
    HTML::TreeBuilder:

    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_content( $html_source );
    my @tr_elements = $tree->look_down( _tag => 'tr' );
    for my $tr_element ( @tr_elements ) {

    # ...
    }


    http://johnbokma.com/perl/

    has several HTML::TreeBuilder examples.

    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Dec 13, 2006
    #4
  5. cuneyt

    cuneyt Guest

    I started to use HTML::Tokeparser and what a relief! Makes parsing a
    breeze even for a regex newbie like me.

    Thanks for all the comments

    Cuneyt

    On Dec 13, 3:11 pm, "Paul Lalli" <> wrote:
    > cuneyt wrote:
    > > Hi,

    >
    > > I would like to process an HTML file in the form

    >
    > > <tr>
    > > row1
    > > </tr>
    > > <tr>
    > > row2
    > > </tr>

    >
    > > The snippet I wrote is

    >
    > > while( $HtmlSource =~ m{<tr>(.*)</tr>}sg ) {
    > > my $r = $1;
    > > print "found: $r\n";
    > > }You should not, in general, be processing HTML with regular

    > expressions. You should be using an HTML parser. There are many
    > available on CPAN. I recommend HTML::TokeParser.
    >
    > > But when I run this I get
    > > found:
    > > row1
    > > </tr>
    > > <tr>
    > > row2

    >
    > > How can I modify the regex so that it is not so greedy and pulls one
    > > <tr></tr> pair at a time?You need to read a decent tutorial on regular expressions, as this is a

    > very basic question. The answer is that you need a ? after the *, but
    > if you didn't already know that, you *really* need to read:
    > perldoc perlretut
    >
    > Hope this helps,
    > Paul Lalli
    cuneyt, Dec 13, 2006
    #5
  6. Paul Lalli <> wrote:
    > cuneyt wrote:


    >> I would like to process an HTML file in the form
    >>
    >> <tr>
    >> row1
    >> </tr>
    >> <tr>
    >> row2
    >> </tr>


    > You should not, in general, be processing HTML with regular
    > expressions. You should be using an HTML parser. There are many
    > available on CPAN. I recommend HTML::TokeParser.



    And if you only need the data that is in tables,
    then HTML::TableExtract is Very Nice.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 14, 2006
    #6
  7. [OT] Stomach hurts from ten continuous minutes of laughing (Was Re:question on processing HTML with a regex)

    On 12/16/2006 12:54 AM, wrote:
    > On Sat, 16 Dec 2006 02:56:26 GMT, wrote:
    >
    > And the funny thing is my friends the regulars here call this
    > CRAP code !!!
    >
    > HAHAHAHAAAAAAAAA
    >
    > God is here my friend.....
    >
    > robic0
    >


    Now my stomach hurts from ten continuous minutes of uncontrollable
    laughter.

    Robic0, you actually /believe/ your program is good code.

    I can't ... I can't ... ROTFLOL ....

    I can't imagine what's going on in the brain of someone who would think
    that.

    If I've gotten a hernia I'm sending you the bill. Don't worry, however;
    I won't be able to see your response to it. Buh bye.


    --

    http://home.earthlink.net/~mumia.w.18.spam/
    Mumia W. (on aioe), Dec 16, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    688
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,604
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    589
  4. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    408
    Michael Foord
    Sep 17, 2004
  5. Replies:
    3
    Views:
    728
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page