Pulling out data between <TD> tags using regular expressions

Discussion in 'Perl Misc' started by tdmailbox@yahoo.com, May 26, 2005.

  1. Guest

    If I had this tag and wanted to return 123 how would I do it? I have
    tried countless methods but can not get the only the 123 without the
    <TD> tags

    <TD class=tblform3 id=L_listing width=23>123</TD>

    After 3 hours I am giving up and asking the experts.
    , May 26, 2005
    #1
    1. Advertising

  2. wrote:
    > If I had this tag and wanted to return 123 how would I do it? I have
    > tried countless methods but can not get the only the 123 without the
    > <TD> tags
    >
    > <TD class=tblform3 id=L_listing width=23>123</TD>
    >
    > After 3 hours I am giving up and asking the experts.


    Did you study the applicable docs during those 3 hours?

    perldoc perlrequick
    perldoc perlretut
    perldoc perlre
    perldoc -f m
    perldoc perlop

    Or did you read this FAQ entry

    perldoc -q "remove HTML"

    which lets you know that you'd better think twice before attempting to
    use regexes for this task?

    If you have studied those documents, please post the code you have and
    somebody may be able to help you fix it.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 26, 2005
    #2
    1. Advertising

  3. writes:
    > If I had this tag and wanted to return 123 how would I do it? I have
    > tried countless methods but can not get the only the 123 without the
    > <TD> tags
    >
    > <TD class=tblform3 id=L_listing width=23>123</TD>
    >
    > After 3 hours I am giving up and asking the experts.


    If you'd asked your computer, you'd have had the answer much faster:

    perldoc -q HTML

    And the first returned result is:

    "How do I remove HTML from a string?"

    Which is exactly what you need. If you get in the habit of searching
    your local documentation first, then you'll get better answers faster,
    as you won't have to wait for an answer here, and also the people who
    can give you the best answers to your questions are tired of answering
    them all the time, which is why they wrote the FAQ in the first place!
    So if you ask FAQs here, then you will by definition only get the
    less-experienced people answering your questions, as a rule.

    But I'm feeling generous, also I'd been meaning to poke at
    HTML::parser for a while anyhow. So I whipped up this little example:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use HTML::parser ();

    sub start_handler
    {
    return if shift ne "td";
    my $self = shift;
    $self->handler(text => sub { print shift }, "dtext");
    $self->handler(end => sub { shift->eof if shift eq "td"; },
    "tagname,self");
    }

    my $p = HTML::parser->new(api_version => 3);
    $p->handler( start => \&start_handler, "tagname, self" );
    $p->parse( <<EODATA );
    <TD class=tblform3 id=L_listing width=23>123</TD>
    EODATA
    print "\n";
    __END__

    For future reference, if you have a problem, you're going to get the
    best results here if you can create an example of it that looks
    something like that-- short (I went to 21 lines, and that's about as
    big as I try to let them get), complete, and clearly state what is
    happening, and how that differs from what you wanted to happen.

    Also, note that the above example stops parsing after the first </TD>;
    if you are going to parse text containing multiple TD elements, you'll
    want to read the HTML::parser documentation to find out better ways of
    doing that.

    -=Eric
    --
    Come to think of it, there are already a million monkeys on a million
    typewriters, and Usenet is NOTHING like Shakespeare.
    -- Blair Houghton.
    Eric Schwartz, May 26, 2005
    #3
  4. Guest

    ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
    , May 27, 2005
    #4
  5. Eric Schwartz wrote:
    >
    > use HTML::parser ();
    >
    > sub start_handler
    > {
    > return if shift ne "td";
    > my $self = shift;
    > $self->handler(text => sub { print shift }, "dtext");
    > $self->handler(end => sub { shift->eof if shift eq "td"; },
    > "tagname,self");
    > }
    >
    > my $p = HTML::parser->new(api_version => 3);
    > $p->handler( start => \&start_handler, "tagname, self" );
    > $p->parse( <<EODATA );
    > <TD class=tblform3 id=L_listing width=23>123</TD>
    > EODATA
    > print "\n";


    And this is a "simple-minded" way:

    print '<TD class=tblform3 id=L_listing width=23>123</TD>'
    =~ m{<td.*?>([^<]+)</td>}is, "\n";

    If I was to parse a whole HTML page, possibly with nested elements, and
    whose design I don't control, I wouldn't dream of using regular
    expressions. If, OTOH, the task actually is as simple as the literal
    question asked by the OP, I wouldn't dream of using a parsing module.

    Which way is most suitable depends reasonably on the complexity of the
    task together with how much you know about regular expressions.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 27, 2005
    #5
  6. writes:
    > ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);


    Hrm.

    #!/usr/bin/perl
    use warnings;
    use strict;

    my $bunch_of_html = <<EOHTML;
    <td><img src='closetd.jpg' alt='image of </td>' /></td>
    EOHTML
    my ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
    print "result: [$result]\n";
    __END__

    gives:

    result: [<img src='foo.jpg' alt='image of ]

    Parsing HTML with a regex is, ultimately, an exercise in futility.
    You can do it for one small subset, but as soon as you change it even
    a small amount, your solution can easily break. And then you tweak.
    And then it breaks again. It's easier to spend a little effort
    up-front with HTML::parser or the like, than to constantly be fixing
    regex-based hacks.

    -=Eric
    --
    Come to think of it, there are already a million monkeys on a million
    typewriters, and Usenet is NOTHING like Shakespeare.
    -- Blair Houghton.
    Eric Schwartz, May 27, 2005
    #6
  7. John Bokma Guest

    wrote:

    > If I had this tag and wanted to return 123 how would I do it? I have
    > tried countless methods but can not get the only the 123 without the
    > <TD> tags
    >
    > <TD class=tblform3 id=L_listing width=23>123</TD>
    >
    > After 3 hours I am giving up and asking the experts.


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    :
    :

    my $root = HTML::TreeBuilder->
    new_from_content( $content );

    my $td = $root->look_down( _tag => 'td',
    class => 'tblform3', id => 'L_listing' );

    defined $td or die "TD not found";

    print $td->as_text, "\n";


    (untested, assumes $content contains the HTML)

    see also:

    http://johnbokma.com/perl/phpbb-remote-backup.html
    http://johnbokma.com/perl/froogle-script.html

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, May 27, 2005
    #7
  8. Gunnar Hjalmarsson <> writes:
    > And this is a "simple-minded" way:
    >
    > print '<TD class=tblform3 id=L_listing width=23>123</TD>'
    > =~ m{<td.*?>([^<]+)</td>}is, "\n";


    Which, as you knew, fails if the <TD> has comments in it:

    $ perl -e 'print "<TD class=tblform3 id=L_listing width=23>\n123<!-- this is the item ID from the database -->\n</td>" =~ m{<td.*?>([^<]+)</td>}is, "\n";'

    $

    If there is content on both sides of the comment, only the
    post-comment parts get printed, but if the content is after the
    comment, it will do what it's supposed to. This is the sort of thing
    that causes me to lose sleep and pull out my hair before its time. I
    know you knew that, I'm just pointing out to the OP how fragile a
    regex-based solution can be. It may work now, in one place, but
    there's all sorts of things that could cause it to fail later, some of
    which can be very subtle.

    > Which way is most suitable depends reasonably on the complexity of the
    > task together with how much you know about regular expressions.


    Also the likelihood of your input changing-- a regex solution might be
    right in at first, but can easily fail later-- as well as the intended
    scope of use. Subroutines have a way around here of quickly migrating
    out into general-use modules, where they are used by people in very
    different contexts from where they originated. What works for one
    particular task is likely to need serious changes if used for others.

    -=Eric
    --
    Come to think of it, there are already a million monkeys on a million
    typewriters, and Usenet is NOTHING like Shakespeare.
    -- Blair Houghton.
    Eric Schwartz, May 27, 2005
    #8
  9. Guest

    That's true if you are writing a web crawler but most of the time the
    purpose for doing this is to strip spread sheet style data from a
    website you don't control and insert it into your own database in which
    case the html formating of the target HTML is likely to be fairly
    consistant and in this case it's quicker for me to write that regex
    than install and learn how to use HTML::parser. Add that to the fact
    that your case example is silly.
    , May 27, 2005
    #9
  10. Eric Schwartz wrote:
    > I'm just pointing out to the OP how fragile a
    > regex-based solution can be.


    Agreed. You need to know that no comments will be inserted that way, and
    that there are no attributes containing '>' characters, etc., etc.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 27, 2005
    #10
  11. writes:
    > That's true if you are writing a web crawler but most of the time the
    > purpose for doing this is to strip spread sheet style data from a
    > website you don't control and insert it into your own database in which
    > case the html formating of the target HTML is likely to be fairly
    > consistant and in this case it's quicker for me to write that regex
    > than install and learn how to use HTML::parser. Add that to the fact
    > that your case example is silly.


    Please quote the messages you're replying to, at least enough so that
    we can tell what you're replying to. Guessing that you're replying to
    my reply to you, the fact that you don't control the HTML is exactly
    why you need something like HTML::parser-- if you control the HTML,
    you can force it to always be produced so your regex can parse it. If
    you don't, though, the producer of that HTML can do all kinds of
    things to break your regex. Inserting comments in the middle of table
    data is only one of the most obvious ways a regex can break; see my
    reply to Gunnar's regex solution for more detail.

    -=Eric
    --
    Come to think of it, there are already a million monkeys on a million
    typewriters, and Usenet is NOTHING like Shakespeare.
    -- Blair Houghton.
    Eric Schwartz, May 27, 2005
    #11
  12. Guest

    <TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>

    That works.. however it returns the whole <TD> tag.. I just want the
    value inside the tag. That is my core issue that I cant find the
    solution to. I can find plenty of expressions that will find the right
    <TD> tag but not one that will just give me the data between the tags
    , May 27, 2005
    #12
  13. Guest

    "the fact that you don't control the HTML is exactly why you need
    something like HTML::parser"

    You don't realy know that the target html is as dirty as you assume.
    Unless the poster says he or she is writing a long-term use and robust
    data miner I'm assuming it's a one-off script where the html and data
    in question is uniform because this is most often the case.
    , May 27, 2005
    #13
  14. Scott Bryce Guest

    wrote:

    > You don't realy know that the target html is as dirty as you assume.


    Maybe not dirty. Maybe just subject to change. I have been bit by that
    even using an HTML parser.

    > Unless the poster says he or she is writing a long-term use and robust
    > data miner I'm assuming it's a one-off script


    We don't know that, so the discussion has merit.
    Scott Bryce, May 27, 2005
    #14
  15. wrote:
    > <TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>
    >
    > That works.. however it returns the whole <TD> tag..


    No, it doesn't. It doesn't return anything.

    Have you read any of the replies in this thread??

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 27, 2005
    #15
  16. Guest

    Since L_listing is what makes the take you unique I took your code and
    modified it to
    <TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>

    and I get the right tag..

    However the issue is that I only want to return the data between the
    tag. The expression above includes the tag.
    <TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>


    thanks an advance for any help on that.
    , May 27, 2005
    #16
  17. [ Please provide some context when replying!! Most people are not
    reading this group via Google Groups. ]

    wrote:
    > Gunnar Hjalmarsson wrote:
    >> And this is a "simple-minded" way:
    >>
    >> print '<TD class=tblform3 id=L_listing width=23>123</TD>'
    >> =~ m{<td.*?>([^<]+)</td>}is, "\n";

    >
    > Since L_listing is what makes the take you unique I took your code and
    > modified it to
    > <TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>
    >
    > and I get the right tag..
    >
    > However the issue is that I only want to return the data between the
    > tag. The expression above includes the tag.
    > <TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>


    Don't try to just explain in English what you are doing, but post a
    short but complete program that demonstrates the problem you are having.

    Also, have you read the description of the m// operator in "perldoc perlop"?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 27, 2005
    #17
  18. Guest

    You must be accessing the result of the match wrong because the match
    that is found between the ( ) will not include the entire td tag but
    it's possible that some other variable does. Try printing $1 after the
    match is supposed to occur and see if it prints the value you want to
    parse out.
    , May 27, 2005
    #18
  19. Paul Guest

    wrote:
    > <TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>
    >
    > That works.. however it returns the whole <TD> tag.. I just want the
    > value inside the tag. That is my core issue that I cant find the
    > solution to. I can find plenty of expressions that will find the right
    > <TD> tag but not one that will just give me the data between the tags
    >

    Read up on HTML::TableExtract.

    Getting this sort of data using regex or similar is tricky and the page
    definition may change ( will change ).

    If the tables are not well structured you may have to search by depth
    and count to get the right table. You will have to come to grips with
    the structure of the data you are dealing with - the tables and the form.

    Start here
    "http://search.cpan.org/~msisk/HTML-TableExtract-1.08/lib/HTML/TableExtract.pm"

    Happy reading.
    Paul, May 27, 2005
    #19
  20. Joe Smith Guest

    wrote:
    > Since L_listing is what makes the take you unique I took your code and
    > modified it to
    > <TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>
    >
    > and I get the right tag..
    >
    > However the issue is that I only want to return the data between the
    > tag. The expression above includes the tag.


    No, it doesn't. You must not be using the regex in the proper manner.

    Hint: /(.*)/; m/(.*)/; m%(.*)%;, m{(.*)}; m[(.*)]; m<(.*)>;

    -Joe
    Joe Smith, May 27, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    594
    Jay Douglas
    Aug 15, 2003
  2. Replies:
    3
    Views:
    494
    Wibble
    May 28, 2005
  3. Mudcat
    Replies:
    2
    Views:
    970
    Mudcat
    Dec 14, 2008
  4. Replies:
    4
    Views:
    562
    J├╝rgen Exner
    Apr 12, 2005
  5. Noman Shapiro
    Replies:
    0
    Views:
    222
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page