regular expressions problem

Discussion in 'Perl Misc' started by Shailesh Humbad, Dec 9, 2004.

  1. I want to parse the values from the second-to-last row in an html
    table.

    ....
    <tr class="odd">
    <td style="text-align: right;" nowrap="nowrap">99</td>
    <td style="text-align: right;" nowrap="nowrap">111</td>
    <td style="text-align: right;" nowrap="nowrap">52255</td>
    <td style="text-align: right;" nowrap="nowrap">333</td>
    <td style="text-align: right;" nowrap="nowrap">2323</td>
    </tr>
    <tr class="totals">
    ....

    I can identify the last row by the "totals" class. So I want the regex
    to work backward from there and get the values in each of the cells of
    the previous row. It should ignore all prior content and whitespace
    between tags. Can anyone help? Here is what I have so far:
    /([\s\S]*?)<tr class\=\"totals/
    Shailesh Humbad, Dec 9, 2004
    #1
    1. Advertising

  2. Shailesh Humbad

    Keith Keller Guest

    On 2004-12-09, Shailesh Humbad <> wrote:
    > I want to parse the values from the second-to-last row in an html
    > table.


    Have you looked at the various HTML parsers available on CPAN? Doing
    this with a regex is bound to cause problems. (I'm partial to
    HTML::TreeBuilder, myself, but I'm sure that others can make additional
    suggestions.)

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom
    Keith Keller, Dec 9, 2004
    #2
    1. Advertising

  3. Keith Keller wrote:
    > On 2004-12-09, Shailesh Humbad <> wrote:
    > > I want to parse the values from the second-to-last row in an html
    > > table.

    >
    > Have you looked at the various HTML parsers available on CPAN? Doing
    > this with a regex is bound to cause problems. (I'm partial to
    > HTML::TreeBuilder, myself, but I'm sure that others can make

    additional
    > suggestions.)
    >
    > --keith
    >
    > --
    > -francisco.ca.us
    > (try just my userid to email me)
    > AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom


    Trouble is, I am using regular expressions in a VBScript file, so I
    don't have any Perl support... Even then, the page is probably not
    valid HTML. I could use multiple regular expressions in steps. At
    least, is there a way to match from "<tr class=\"totals" to the
    immediately previous "<tr"? From there I could figure it out. Maybe
    I'll try searching within a reversed copy of the string.
    Shailesh Humbad, Dec 9, 2004
    #3
  4. Shailesh Humbad wrote:
    > I want to parse the values from the second-to-last row in an html
    > table.
    >
    > ...
    > <tr class="odd">
    > <td style="text-align: right;" nowrap="nowrap">99</td>
    > <td style="text-align: right;" nowrap="nowrap">111</td>
    > <td style="text-align: right;" nowrap="nowrap">52255</td>
    > <td style="text-align: right;" nowrap="nowrap">333</td>
    > <td style="text-align: right;" nowrap="nowrap">2323</td>
    > </tr>
    > <tr class="totals">


    As has been mentioned here _very_ frequently parsing HTML correctly using
    REs is insane. It hasn't even been proven if the extended REs in Perl would
    be powerful enough to do it (normal REs are definitely not sufficient!), let
    alone finding a usable RE to do it.

    Use an HTML parser to parse HTML. There are several on CPAN.
    And please read the FAQ before asking frequently asked questions (perldoc -q
    "remove HTML").

    jue
    Jürgen Exner, Dec 9, 2004
    #4
  5. Shailesh Humbad wrote:

    > Trouble is, I am using regular expressions in a VBScript file, so I
    > don't have any Perl support...


    The VBScript group is down the hall on your left. Don't let the door hit
    you on the way out.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
    Sherm Pendley, Dec 9, 2004
    #5
  6. Shailesh Humbad <> wrote:

    > I want to parse the values from the second-to-last row in an html
    > table.



    use HTML::TableExtract;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Dec 9, 2004
    #6
  7. Shailesh Humbad

    Scott Bryce Guest

    Sherm Pendley wrote:

    > The VBScript group is down the hall on your left. Don't let the door hit
    > you on the way out.


    Which, when translated, means...

    Regular expressions in VBScript are different than regular expressions
    in Perl. Any help we give you may not carry over into VBScript. Asking
    in a Perl newsgroup about programming in VBScript is a waste of our time
    and yours.
    Scott Bryce, Dec 9, 2004
    #7
  8. Shailesh Humbad

    Bill Karwin Guest

    Shailesh Humbad wrote:
    > Trouble is, I am using regular expressions in a VBScript file, so I
    > don't have any Perl support... Even then, the page is probably not
    > valid HTML.


    There are XML & HTML parsers for Microsoft languages. You'll be much
    more successful using something like that than trying to create a custom
    regular expression. These types of problems tend to mutate, and very
    quickly any regular expression(s) you create will not be appropriate for
    the task. Better to use the right tool for the job.

    Here's an introduction to the Microsoft XML parser, which supports
    several languages including VBScript and Perl (see? on topic! ;-)

    http://www.w3schools.com/dom/dom_parser.asp

    Regards,
    Bill K.
    Bill Karwin, Dec 9, 2004
    #8
  9. Ask for regex help in a VBScript forum? Cmon. Besides, my OP didn't
    mention VBScript, but seeked a regex solution. Anyway, I solved it on
    my own, and I present it here in Perl for those pedants who would
    rather complain about formalities than help someone.

    #!/usr/bin/perl -W

    $TestString = qq{
    <td style="text-align: right;" nowrap="nowrap">433</td>
    </tr>
    <tr class="odd">
    <td style="text-align: right;" nowrap="nowrap">99</td>
    <td style="text-align: right;" nowrap="nowrap">111</td>
    <td style="text-align: right;" nowrap="nowrap">52255</td>
    <td style="text-align: right;" nowrap="nowrap">333</td>
    <td style="text-align: right;" nowrap="nowrap">2323</td>
    </tr>
    <tr class="totals">
    <td style="text-align: right;" nowrap="nowrap">122</td>
    };

    # get the second-to-last row
    $TestString = reverse($TestString);
    $TestString =~ m/slatot\"=ssalc rt<\s*>rt\/<([\s\S]*?)>rt\/</gi;
    $LastRow = reverse($1);
    print $LastRow."\n";

    # Get the columns in the second-to-last row
    $LastRow =~ m/\s*<tr[\s\S]*?<td[\s\S]*?>([\s\S]*?)<\/td>
    \s*<td[\s\S]*?>([\s\S]*?)<\/td>
    \s*<td[\s\S]*?>([\s\S]*?)<\/td>/gix;
    print $1."\n";
    print $2."\n";
    print $3."\n";
    # etc.
    Shailesh Humbad, Dec 10, 2004
    #9
  10. Shailesh Humbad

    Uri Guttman Guest

    >>>>> "SH" == Shailesh Humbad <> writes:

    SH> Ask for regex help in a VBScript forum? Cmon. Besides, my OP didn't
    SH> mention VBScript, but seeked a regex solution. Anyway, I solved it on
    SH> my own, and I present it here in Perl for those pedants who would
    SH> rather complain about formalities than help someone.

    and i bet your regex solution isn't even compatible with vbscript's.

    SH> # get the second-to-last row
    SH> $TestString = reverse($TestString);
    SH> $TestString =~ m/slatot\"=ssalc rt<\s*>rt\/<([\s\S]*?)>rt\/</gi;
    ^^^^^^

    why do that? slow and for sure that is a perlish feature.
    why escape the "? it is not special in a regex.

    SH> $LastRow = reverse($1);
    SH> print $LastRow."\n";

    SH> # Get the columns in the second-to-last row
    SH> $LastRow =~ m/\s*<tr[\s\S]*?<td[\s\S]*?>([\s\S]*?)<\/td>
    SH> \s*<td[\s\S]*?>([\s\S]*?)<\/td>
    SH> \s*<td[\s\S]*?>([\s\S]*?)<\/td>/gix;

    and that is impossible to read. choose an alternate delimiter. use /x
    properly by breaking it up more and adding comments.

    so as a pedant, i say your solution is poor and not as useful as you
    claim it is. /x will almost surely be another perlish thing that other
    regexes don't have.

    so try again. see if you can keep up the high quality of your work while
    answering all the posts that are off topic. why don't you help the
    electrician track down his stalker too?

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Dec 10, 2004
    #10
  11. That code is a contrived and translated version of my actual
    (VBScript--actually windows scripting) code solely to show the solution
    here in the ng, so that it might help someone in the future. Last I
    checked, there is no newsgroup for regular expressions, so I thought
    this would be the closest thing.

    My question should really have been this. Is there a way, in Perl
    regular expressions, to search backward in a string after searching
    forward to a particular anchor point? In words, the algorithm would
    be:

    1. Search forward until you match 'b'.
    2. Then search backward until you match 'a'.
    3. Give me the contents between 'a' and 'b'.
    ('a' and 'b' are some pattern)

    So is there a regex way to do this?
    Shailesh Humbad, Dec 10, 2004
    #11
  12. Shailesh Humbad

    Anno Siegel Guest

    Shailesh Humbad <> wrote in comp.lang.perl.misc:
    > That code is a contrived and translated version of my actual
    > (VBScript--actually windows scripting) code solely to show the solution
    > here in the ng, so that it might help someone in the future. Last I
    > checked, there is no newsgroup for regular expressions, so I thought
    > this would be the closest thing.
    >
    > My question should really have been this. Is there a way, in Perl
    > regular expressions, to search backward in a string after searching
    > forward to a particular anchor point? In words, the algorithm would
    > be:
    >
    > 1. Search forward until you match 'b'.
    > 2. Then search backward until you match 'a'.
    > 3. Give me the contents between 'a' and 'b'.
    > ('a' and 'b' are some pattern)
    >
    > So is there a regex way to do this?


    Why bother to ask this in a newsgroup full of "pedants who would rather
    complain about formalities than help someone"?

    Anno
    Anno Siegel, Dec 10, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    595
    Jay Douglas
    Aug 15, 2003
  2. Jeff
    Replies:
    1
    Views:
    1,231
    Joris Gillis
    Feb 25, 2005
  3. Oriana

    Regular Expressions Problem

    Oriana, Sep 9, 2004, in forum: Python
    Replies:
    3
    Views:
    410
    Brian Szmyd
    Sep 10, 2004
  4. Gabriel Genellina
    Replies:
    2
    Views:
    260
    Miles
    Sep 25, 2007
  5. Noman Shapiro
    Replies:
    0
    Views:
    222
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page