Re: Regular expression to find <tr> tags in 2nd level HTML tables

Discussion in 'Perl' started by Shannon Jacobs, Jan 9, 2004.

  1. Brian Genisio <> wrote in message news:<>...
    > Shannon Jacobs wrote:
    >

    <snip>
    > Take a look at the TidyLib. It is a C library that will parse HTML for
    > you, in DOM-Like nodes, which you can traverse like a tree. It was
    > originally developed via the W3C, but it is available via SourceForge

    <snip>
    > Using a RegExp will break as soon as the HTML format changes, but a
    > smart tree traversal will likely be more robust.
    >
    > If you go the TidyLib method, you can manipulate the data quickly, and
    > easily develop your palm database via C routines.


    From your description, this doesn't really sound like an approach I
    want to take. It's not a matter of simple access, but pruning
    manipulation. If I really wanted to follow this approach, the most
    bankable-for-use-in-the-real-office approach would be the Excel macro
    programming approach I mentioned. However, anytime anyone mentions
    Microsoft or Visual <anything> I feel like I want to hold up a silver
    cross and scream "Return to Hades, you evil demons!"

    However, due to your hint and another source, I thought to explore the
    DOM tree to get a better understanding of the problem. Mozilla has a
    DOM explorer that was quite good for this, and I can clarify the
    problem now. Here is a reduction of the situation:

    <table>
    <tr>
    <tr>
    <table>
    <tr>
    <tr>
    ....
    <tr>
    <tr>
    <tr>
    <tr>
    <tr>
    <tr>
    ...
    <tr>

    In the outermost table, there is some useful data worth saving in the
    first <tr> row. In the 2nd level table, there is some useful data,
    mostly numbers, in each of those <tr> rows. Returning to the outer
    table, the 7th <tr> row also contains some information that would be
    worth saving. That's the legend I mentioned in the earlier post, but
    which I still feel would be too difficult to parse in a robust way.

    The rest of it is basically dross, and my current regexes toss it away
    quite nicely. The main problem is that the line breaks associated with
    those second level <tr> tags are useful and significant, and I want to
    keep them.

    There seem to be two regex-based approaches that are possible. One is
    to use one regex to mark them in a way that prevents them from being
    tossed, and then restore them as at the end after the other line
    breaks have been removed, basically with the reverse regex. I'm
    already doing that with some other information that needs to be
    preserved.

    The other approach would be to just save the immediately preceding
    line breaks while tossing all the others. I think I favor this
    approach because it strikes me as most elegant and in keeping with the
    spirit of the great regex of the heading of 137 degrees. ;-) A related
    approach to this one would be to toss all the line breaks at the
    beginning, and then insert the correct ones before throwing the other
    dross.

    I actually found a rather similar recent thread in the comp.lang.perl
    newsgroup, so I've cross-posted to that newsgroup, too. That involved
    using

    s/<[^>]*>//g;

    to remove all of the HTML tags, but I need to be more selective.

    I also wanted to include a response to the other reply, snide though
    it was.

    His first snide question was "Why?", in response to my preference for
    a regex-based solution. I've already mostly answered that question,
    but I'll add that I think regex-based solutions can be quite elegant,
    and apparently I sometimes like having my head bent through the regex
    dimension.

    He then recommended using a HTML parsing module and suggested asking
    in a JavaScript newsgroup. In the original post I had already
    explained why I wanted this direct approach, and I had already asked
    in the JavaScript newsgroup with the original cross-post. I suspect
    him of being a wannabe Perler, since real Perl people tend to be very
    observant of all details. The regex experts even more so. However, I
    just wanted to note that his attitude is one of the main reasons I
    quit working in Perl. IMNSHO, it's rather too common among Perl users,
    and I'd hate to wind up like that.
    Shannon Jacobs, Jan 9, 2004
    #1
    1. Advertising

  2. "Alan J. Flavell" <> wrote in message news:<>...
    > On Fri, 9 Jan 2004, Shannon Jacobs wrote:
    >
    > > By the way, I've relinked the Perl group which is accessible from this
    > > particular server. In spite of the attitude thing, I still think the best
    > > regex people are Perl-centric.

    >
    > And they will presumably tell you, as I've seen them doing many times
    > before, that regexes are not the way to parse HTML. Then what? Will
    > you be griping about "attitude" again, or deferring to their
    > expertise?


    Yeah, I think I will be griping. You certainly haven't exhibited any
    "expertise" to defer to. This time your "attitude" reminds me of the
    religious zealots. I still seek truth and beauty and all that jazz,
    but when I was much younger I thought the zealots might know something
    about them--after all, they were SO certain of their "expertise".

    I certainly have managed to understand that you say that a regex
    replacement of the <tr> tags in the second level <table> is not a
    perfect solution. I also believe:

    1. It will work well enough for my narrow purpose,
    2. A regex may be elegant, and
    3. I will also learn something from studying it.

    I think an actual expert could craft the kernel regex in the same time
    required to write your four-line negativistic reply--and that expert
    would actually understand its limitations, too. If the expert was
    feeling really helpful (though I have no reason to expect such
    helpfulness except for fading memories of when usenet was a much more
    friendly and helpful place), he or she would provide a regex solution
    and share additional wisdom, such as the comparable solution written
    with a better approach, or a concrete example of the most obvious
    problem with the regex.

    Time for a hats trick:

    Putting on my mathematician's hat, I like elegance and love learning
    about new ways to solve problems. And I still miss working in APL.

    Putting on my engineer's hat, Excel is a practical and available tool
    and regular expressions are just a waste of time. Don't waste time on
    elegance. Mea culpa.

    Putting on my technical historian's hat, regular expressions and Perl
    are elitist technologies and are fading into insignificance. Just an
    observation.


    --
    Did you know that is sometimes a black hole?
    That's right, resident Dubya does NOT care what you emailing peasants
    think.
    Shannon Jacobs, Jan 11, 2004
    #2
    1. Advertising

  3. Shannon Jacobs wrote:
    > [...] or a concrete example of the most obvious
    > problem with the regex.


    Which parts of the negative examples in FAQ "How do I remove HTML from a
    string?" do you have problem with when trying to adapt them to your concrete
    "<tr>" problem?

    jue
    Jürgen Exner, Jan 11, 2004
    #3
  4. "Jürgen Exner" <> wrote in message news:<mW2Mb.4024$>...
    > Shannon Jacobs wrote:
    > > [...] or a concrete example of the most obvious
    > > problem with the regex.

    >
    > Which parts of the negative examples in FAQ "How do I remove HTML from a
    > string?" do you have problem with when trying to adapt them to your concrete
    > "<tr>" problem?
    >
    > jue


    Thank you for the reference to
    http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
    category of structural problem that I encountered is not covered
    there, and my source HTML does not include any of the problems covered
    in the "tricky cases". If the FAQ included any examples of the use of
    HTML::FormatText, or a more concrete reference, it might have been
    more helpful.

    As it stands, I've decided to return to Excel. Ugly and inelegant (and
    typical of Microsoft), but useful and adequate.

    With regards to the other recent comments in this thread, I will note:

    1. Just because a particular NNTP server does not carry a particular
    newsgroup, that does not mean that the newsgroup in question does not
    exist.

    2. With regards to the unhelpful advice to stop using Perl, I already
    have (except for infrequent maintenance work on a few CGI/Perl systems
    I wrote some years ago). As noted several times earlier, I am
    currently working from a JavaScript perspective, but sought out Perl
    people because of the compatibility of the regex implementations and
    because of old memories of their expertise (though not found this time
    around).

    3. I used the term "elitist" in the sense of high technical expertise.
    Perhaps I should have tried the XSL community. Recently all I have
    seen around Perl are the laziness, impatience, and hubris, but without
    the justification of results.
    Shannon Jacobs, Jan 11, 2004
    #4
  5. Shannon Jacobs wrote:
    > "Jürgen Exner" <> wrote in message
    > news:<mW2Mb.4024$>...
    >> Shannon Jacobs wrote:
    >>> [...] or a concrete example of the most obvious
    >>> problem with the regex.

    >>
    >> Which parts of the negative examples in FAQ "How do I remove HTML
    >> from a string?" do you have problem with when trying to adapt them
    >> to your concrete "<tr>" problem?
    >>
    >> jue

    >
    > Thank you for the reference to
    > http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
    > category of structural problem that I encountered is not covered
    > there, and my source HTML does not include any of the problems covered
    > in the "tricky cases".


    Well, ok. Your call. But please keep in mind that first of all these are
    just a few examples for illustration. There are more ways to break RE-based
    parser code.
    And second unless you own and control the source HTML code (which may or may
    not be the case, I don't know) this source code can change at any moment
    without notice.

    > If the FAQ included any examples of the use of
    > HTML::FormatText, or a more concrete reference, it might have been
    > more helpful.


    That would be a poor use of the FAQ, because instructions and examples are
    included in the standard documentation for each module already.

    jue
    Jürgen Exner, Jan 11, 2004
    #5
  6. Shannon Jacobs <> wrote:


    > 1. Just because a particular NNTP server does not carry a particular
    > newsgroup, that does not mean that the newsgroup in question does not
    > exist.



    Just because a particular newsgroup _is_ listed on a
    server does not mean that the newsgroup actually exists.
    That server may be wrong.

    comp.lang.perl was rmgroup'd many years ago, servers that still
    list it as a valid newsgroup look like they've been neglected
    for many years.


    > 2. With regards to the unhelpful advice to stop using Perl, I already
    > have



    Thank you.

    We will miss your valuable contributions to the community.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jan 11, 2004
    #6
  7. I'm so sorry to hear that the Google Groups system has been "neglected
    for many years", as you put it so thoughtfully. It really is
    unfortunate that so many people regard Google as a useful information
    resource, isn't it?

    Incidentally, when I finally had a bit of free time this morning, I
    rethought the technical problem and did come up with a trivial
    regex-based solution. It did exactly what I required on the first
    attempt, confirming that the technical problem was pretty much as
    trivial as I had thought it was. I guess it's just too bad that none
    of you "experts" and "community contributors" were able to help.

    However, this does lead to a new question:

    Why did the newsgroups fail to produce the technically trivial answer?

    While I can be abrasive or even rude when provoked, there is nothing
    like that in my original query. I asked a simple technical question,
    and wound up being dragged into a religious war about proper ways to
    handle HTML. Not very useful.

    If the religious issue of HTML was the problem, my advice to other
    people seeking similar help is to avoid mentioning HTML. Try
    describing your problem as structured database output, and maybe
    you'll have better "luck" than I had.

    I still regard regular expressions as useful and worthy of further
    study. I cannot say the same thing about most of the people who
    responded so religiously to my trivial question.

    Oh yeah, I suppose I should give a hint about the solution, even
    though it's a bit embarrassing. (I don't mind much as long as I can
    feel I learned something along the way.) Returning to the problem
    fresh and without the "box" around my thoughts, I looked at the data
    files again and asked myself whether there was some other unique
    string associated with the data that was associated with the second
    level <tr> tags. I picked one of the likely candidates, and sure
    enough, it worked. I still think there is a more clever way to do it
    considering the logical structure of the HTML tags and the powerful
    features of regular expressions, and I'd have been quite glad to learn
    something new about those features. That would have been more
    instructional than just solving the original rather trivial problem.

    (By the way, the Excel-based solution was just TOO ugly to bear.)

    (Tad McClellan) wrote in message news:<>...
    > Shannon Jacobs <> wrote:
    >
    >
    > > 1. Just because a particular NNTP server does not carry a particular
    > > newsgroup, that does not mean that the newsgroup in question does not
    > > exist.

    >
    >
    > Just because a particular newsgroup _is_ listed on a
    > server does not mean that the newsgroup actually exists.
    > That server may be wrong.
    >
    > comp.lang.perl was rmgroup'd many years ago, servers that still
    > list it as a valid newsgroup look like they've been neglected
    > for many years.
    >
    >
    > > 2. With regards to the unhelpful advice to stop using Perl, I already
    > > have

    >
    >
    > Thank you.
    >
    > We will miss your valuable contributions to the community.
    Shannon Jacobs, Jan 23, 2004
    #7
  8. Shannon Jacobs

    Matt Garrish Guest

    "Shannon Jacobs" <> wrote in message
    news:...
    > I'm so sorry to hear that the Google Groups system has been "neglected
    > for many years", as you put it so thoughtfully. It really is
    > unfortunate that so many people regard Google as a useful information
    > resource, isn't it?
    >


    Well, if Google still archives the messages then it must be a group. Someone
    should re-revise this horribly outdated faq:

    http://www.perldoc.com/perl5.8.0/po...groups-on-Usenet---Where-do-I-post-questions-

    >
    > Why did the newsgroups fail to produce the technically trivial answer?
    >


    Because the point of this newsgroup is NOT to produce technically trivial
    answers, because technically trivial answers are useless. So what if you
    found some way you think might work for you? What good would posting some
    bad advice that's bound to fail but that might do the job for you do for
    someone searching on the same topic? Parsing html questions come up every
    few days. Do you think people here want to sit and answer them with
    technically trivial answers over and over again? Do you think they want to
    be responding to questions along the lines of "Duh, how come this trivial
    answer didn't work for me?"?

    Get a life. You got flamed for asking a stupid question. If you had any
    knowledge of markup languages you wouldn't have even asked it. And if you
    don't like being told you're dumb, don't post to usenet.

    Matt
    Matt Garrish, Jan 23, 2004
    #8
  9. Shannon Jacobs wrote:
    > I'm so sorry to hear that the Google Groups system has been "neglected
    > for many years", as you put it so thoughtfully. It really is
    > unfortunate that so many people regard Google as a useful information
    > resource, isn't it?


    Google Groups is an archive, and, as such, obviously does not delete
    obsolete groups.

    > Incidentally, when I finally had a bit of free time this morning, I
    > rethought the technical problem and did come up with a trivial
    > regex-based solution.


    No you didn't, because it's impossible. Either you misstated your
    requirement, your "solution" does not work, or it is not "regex-based".

    > Oh yeah, I suppose I should give a hint about the solution, even
    > though it's a bit embarrassing. (I don't mind much as long as I can
    > feel I learned something along the way.) Returning to the problem
    > fresh and without the "box" around my thoughts, I looked at the data
    > files again and asked myself whether there was some other unique
    > string associated with the data that was associated with the second
    > level <tr> tags. I picked one of the likely candidates, and sure
    > enough, it worked.


    In other words, you came up with an ad-hoc solution that does not
    involve the use of regex's for parsing (which regex's cannot do), and
    which no-one here could possibly have thought of, since it involves
    facts that you never mentioned.

    That's a cute job of drawing your target around the bullet holes, but
    you can't really expect adults to be impressed by that, can you?

    --
    John W. Kennedy
    "But now is a new thing which is very old--
    that the rich make themselves richer and not poorer,
    which is the true Gospel, for the poor's sake."
    -- Charles Williams. "Judgement at Chelmsford"
    John W. Kennedy, Jan 24, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean H. Saxe
    Replies:
    0
    Views:
    1,015
    Dean H. Saxe
    Jan 3, 2004
  2. VSK
    Replies:
    2
    Views:
    2,272
  3. pabbu
    Replies:
    8
    Views:
    711
    Marc Boyer
    Nov 7, 2005
  4. Shannon Jacobs
    Replies:
    19
    Views:
    190
    John W. Kennedy
    Jan 24, 2004
  5. Shannon Jacobs
    Replies:
    18
    Views:
    145
    Uri Guttman
    Jan 23, 2004
Loading...

Share This Page