Regular expression to find <tr> tags in 2nd level HTML tables

S

Shannon Jacobs

Trying to solve this with a regex approach rather than the programmatic
approach of counting up and down the levels. I have a fairly complicated
HTML page that I want to simplify. I've been able to mung most of it using
several regular expressions, but I've become stuck at this point. I can't
figure out how to grab only the <tr> tags that are associated with tables
that are two levels deep. I feel like I got close, but it seems that
something about the line breaks between the various <table ...> and </table>
tags is still messing me up.

Not sure if the details will help, but I'm actually working in JavaScript
(for convenience). The data is actually train schedules generated from a
database, but I don't have control over the presentation, and that system
lays the timetables out in an almost ridiculous way, with about 4 levels of
table, mostly for trivial effects. I think the ultimate solution is beyond
my capabilities, but in theory, it might be possible to recognize and match
certain key characters in the legend which appears at the bottom and
preserve those key characters during the transformation steps, while still
stripping out the extraneous HTML junk. But there is still a kicker... The
target is a Palm OS device.

For your further amusement, I'll confess that the approach I've been using
to date involves massaging the tables with a spreadsheet. It's actually not
that tedious, but I'm always thinking of easier ways to handle these things.
Macro programming is actually another alternative if the regex approach is
too cumbersome. However, so far the first stages of this approach seemed
pretty smooth...
 
B

Ben Morrow

Shannon Jacobs said:
Trying to solve this with a regex approach rather than the programmatic
approach of counting up and down the levels.
Why?

I can't
figure out how to grab only the <tr> tags that are associated with tables
that are two levels deep.

Use one of the HTML parsing modules. Regexen are good for many things;
parsing HTML is not one of them. (Not on their own, at any rate.)
Not sure if the details will help, but I'm actually working in JavaScript

Oh, right. Ask in a JS group.

Ben
 
B

Brian Genisio

Shannon said:
Trying to solve this with a regex approach rather than the programmatic
approach of counting up and down the levels. I have a fairly complicated
HTML page that I want to simplify. I've been able to mung most of it using
several regular expressions, but I've become stuck at this point. I can't
figure out how to grab only the <tr> tags that are associated with tables
that are two levels deep. I feel like I got close, but it seems that
something about the line breaks between the various <table ...> and </table>
tags is still messing me up.

Not sure if the details will help, but I'm actually working in JavaScript
(for convenience). The data is actually train schedules generated from a
database, but I don't have control over the presentation, and that system
lays the timetables out in an almost ridiculous way, with about 4 levels of
table, mostly for trivial effects. I think the ultimate solution is beyond
my capabilities, but in theory, it might be possible to recognize and match
certain key characters in the legend which appears at the bottom and
preserve those key characters during the transformation steps, while still
stripping out the extraneous HTML junk. But there is still a kicker... The
target is a Palm OS device.

For your further amusement, I'll confess that the approach I've been using
to date involves massaging the tables with a spreadsheet. It's actually not
that tedious, but I'm always thinking of easier ways to handle these things.
Macro programming is actually another alternative if the regex approach is
too cumbersome. However, so far the first stages of this approach seemed
pretty smooth...

Take a look at the TidyLib. It is a C library that will parse HTML for
you, in DOM-Like nodes, which you can traverse like a tree. It was
originally developed via the W3C, but it is available via SourceForge
(www.sourceforge.net) now. It has a great license, that allows you to
use it for virtually any purpose, for free.

If you still choose to use Javascript, have a browser bring it up, and
use javascript to traverse through the DOM tree, much the same way as I
mentioned. Find the table node you are looking for, and traverse the tree.

Using a RegExp will break as soon as the HTML format changes, but a
smart tree traversal will likely be more robust.

If you go the TidyLib method, you can manipulate the data quickly, and
easily develop your palm database via C routines.

Brian
 
S

Shannon Jacobs

Brian Genisio said:
Shannon Jacobs wrote:
Take a look at the TidyLib. It is a C library that will parse HTML for
you, in DOM-Like nodes, which you can traverse like a tree. It was
originally developed via the W3C, but it is available via SourceForge
Using a RegExp will break as soon as the HTML format changes, but a
smart tree traversal will likely be more robust.

If you go the TidyLib method, you can manipulate the data quickly, and
easily develop your palm database via C routines.

From your description, this doesn't really sound like an approach I
want to take. It's not a matter of simple access, but pruning
manipulation. If I really wanted to follow this approach, the most
bankable-for-use-in-the-real-office approach would be the Excel macro
programming approach I mentioned. However, anytime anyone mentions
Microsoft or Visual <anything> I feel like I want to hold up a silver
cross and scream "Return to Hades, you evil demons!"

However, due to your hint and another source, I thought to explore the
DOM tree to get a better understanding of the problem. Mozilla has a
DOM explorer that was quite good for this, and I can clarify the
problem now. Here is a reduction of the situation:

<table>
<tr>
<tr>
<table>
<tr>
<tr>
....
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
...
<tr>

In the outermost table, there is some useful data worth saving in the
first <tr> row. In the 2nd level table, there is some useful data,
mostly numbers, in each of those <tr> rows. Returning to the outer
table, the 7th <tr> row also contains some information that would be
worth saving. That's the legend I mentioned in the earlier post, but
which I still feel would be too difficult to parse in a robust way.

The rest of it is basically dross, and my current regexes toss it away
quite nicely. The main problem is that the line breaks associated with
those second level <tr> tags are useful and significant, and I want to
keep them.

There seem to be two regex-based approaches that are possible. One is
to use one regex to mark them in a way that prevents them from being
tossed, and then restore them as at the end after the other line
breaks have been removed, basically with the reverse regex. I'm
already doing that with some other information that needs to be
preserved.

The other approach would be to just save the immediately preceding
line breaks while tossing all the others. I think I favor this
approach because it strikes me as most elegant and in keeping with the
spirit of the great regex of the heading of 137 degrees. ;-) A related
approach to this one would be to toss all the line breaks at the
beginning, and then insert the correct ones before throwing the other
dross.

I actually found a rather similar recent thread in the comp.lang.perl
newsgroup, so I've cross-posted to that newsgroup, too. That involved
using

s/<[^>]*>//g;

to remove all of the HTML tags, but I need to be more selective.

I also wanted to include a response to the other reply, snide though
it was.

His first snide question was "Why?", in response to my preference for
a regex-based solution. I've already mostly answered that question,
but I'll add that I think regex-based solutions can be quite elegant,
and apparently I sometimes like having my head bent through the regex
dimension.

He then recommended using a HTML parsing module and suggested asking
in a JavaScript newsgroup. In the original post I had already
explained why I wanted this direct approach, and I had already asked
in the JavaScript newsgroup with the original cross-post. I suspect
him of being a wannabe Perler, since real Perl people tend to be very
observant of all details. The regex experts even more so. However, I
just wanted to note that his attitude is one of the main reasons I
quit working in Perl. IMNSHO, it's rather too common among Perl users,
and I'd hate to wind up like that.
 
B

Brian Genisio

Shannon said:
From your description, this doesn't really sound like an approach I
want to take. It's not a matter of simple access, but pruning
manipulation. If I really wanted to follow this approach, the most
bankable-for-use-in-the-real-office approach would be the Excel macro
programming approach I mentioned. However, anytime anyone mentions
Microsoft or Visual <anything> I feel like I want to hold up a silver
cross and scream "Return to Hades, you evil demons!"

However, due to your hint and another source, I thought to explore the
DOM tree to get a better understanding of the problem. Mozilla has a
DOM explorer that was quite good for this, and I can clarify the
problem now. Here is a reduction of the situation:

<table>
<tr>
<tr>
<table>
<tr>
<tr>
....
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
...
<tr>

In the outermost table, there is some useful data worth saving in the
first <tr> row. In the 2nd level table, there is some useful data,
mostly numbers, in each of those <tr> rows. Returning to the outer
table, the 7th <tr> row also contains some information that would be
worth saving. That's the legend I mentioned in the earlier post, but
which I still feel would be too difficult to parse in a robust way.

The rest of it is basically dross, and my current regexes toss it away
quite nicely. The main problem is that the line breaks associated with
those second level <tr> tags are useful and significant, and I want to
keep them.

There seem to be two regex-based approaches that are possible. One is
to use one regex to mark them in a way that prevents them from being
tossed, and then restore them as at the end after the other line
breaks have been removed, basically with the reverse regex. I'm
already doing that with some other information that needs to be
preserved.

Without seeing the actual code, it is difficult to tell, but I still
believe a DOM tree traversal in JavaScript would be the most useful,
since you can get the text nodes, as-is, with carriage returns and all.
Regexps are great for lexing tokens, but not used as directly for
parsing the tokens, which is what I understand you need.

There is a MS technology called HTA (Hyper-Text Applications), which
would let you load up the page in question in one frame, and in the
other frame, dissect the DOM tree, and produce an output file to your
liking. I did something very similar to create a DOM Explorer for IE,
that (in effect) works very similar to the Mozilla one.

Is there any way to see all of the code? Is the website you are talking
about public?

Thanks,
Brian
 
S

Shannon Jacobs

Brian Genisio wrote:
Without seeing the actual code, it is difficult to tell, but I still
believe a DOM tree traversal in JavaScript would be the most useful,
since you can get the text nodes, as-is, with carriage returns and
all. Regexps are great for lexing tokens, but not used as directly
for parsing the tokens, which is what I understand you need.

There is a MS technology called HTA (Hyper-Text Applications), which
would let you load up the page in question in one frame, and in the
other frame, dissect the DOM tree, and produce an output file to your
liking. I did something very similar to create a DOM Explorer for IE,
that (in effect) works very similar to the Mozilla one.

Is there any way to see all of the code? Is the website you are
talking about public?

Yes, it's a public site, but it's Japanese, so I don't know if you'll like
what you see... However, the stuff I'm trying to pull is basically the
numbers. Here is a URL for a typical example:

http://www.odakyu-group.co.jp/train/timetable/o_sagami-ono_u_w.html

The approach you describe sounds interesting, even though it mentions
Microsoft. However, my simple-minded approach has just been to create little
editing tools in JavaScript. This particular one is just more complicated
than my previous efforts.

By the way, I've relinked the Perl group which is accessible from this
particular server. In spite of the attitude thing, I still think the best
regex people are Perl-centric.
 
A

Alan J. Flavell

By the way, I've relinked the Perl group which is accessible from this
particular server. In spite of the attitude thing, I still think the best
regex people are Perl-centric.

And they will presumably tell you, as I've seen them doing many times
before, that regexes are not the way to parse HTML. Then what? Will
you be griping about "attitude" again, or deferring to their
expertise?
 
S

Shannon Jacobs

Alan J. Flavell said:
And they will presumably tell you, as I've seen them doing many times
before, that regexes are not the way to parse HTML. Then what? Will
you be griping about "attitude" again, or deferring to their
expertise?

Yeah, I think I will be griping. You certainly haven't exhibited any
"expertise" to defer to. This time your "attitude" reminds me of the
religious zealots. I still seek truth and beauty and all that jazz,
but when I was much younger I thought the zealots might know something
about them--after all, they were SO certain of their "expertise".

I certainly have managed to understand that you say that a regex
replacement of the <tr> tags in the second level <table> is not a
perfect solution. I also believe:

1. It will work well enough for my narrow purpose,
2. A regex may be elegant, and
3. I will also learn something from studying it.

I think an actual expert could craft the kernel regex in the same time
required to write your four-line negativistic reply--and that expert
would actually understand its limitations, too. If the expert was
feeling really helpful (though I have no reason to expect such
helpfulness except for fading memories of when usenet was a much more
friendly and helpful place), he or she would provide a regex solution
and share additional wisdom, such as the comparable solution written
with a better approach, or a concrete example of the most obvious
problem with the regex.

Time for a hats trick:

Putting on my mathematician's hat, I like elegance and love learning
about new ways to solve problems. And I still miss working in APL.

Putting on my engineer's hat, Excel is a practical and available tool
and regular expressions are just a waste of time. Don't waste time on
elegance. Mea culpa.

Putting on my technical historian's hat, regular expressions and Perl
are elitist technologies and are fading into insignificance. Just an
observation.
 
J

Jürgen Exner

Shannon said:
[...] or a concrete example of the most obvious
problem with the regex.

Which parts of the negative examples in FAQ "How do I remove HTML from a
string?" do you have problem with when trying to adapt them to your concrete
"<tr>" problem?

jue
 
B

Ben Morrow

Yeah, I think I will be griping. You certainly haven't exhibited any
"expertise" to defer to. This time your "attitude" reminds me of the
religious zealots.

Putting on my mathematician's hat, I like elegance and love learning
about new ways to solve problems. And I still miss working in APL.

Putting on *mine*, the problem of parsing HTML cannot be solved with a
single regex. A regex which almost-but-not-quite solves your problem
will certainly not be a thing of elegance.
Putting on my engineer's hat, Excel is a practical and available tool
and regular expressions are just a waste of time. Don't waste time on
elegance. Mea culpa.

Again, putting on mine, a practical way of solving your problem in
Perl is to use one of the HTML parsing modules. A practical way in
JavaScript is to use your browser's DOM. Don't waste time pursuing
paths which those who have trod them before tell you lead nowhere.
Putting on my technical historian's hat, regular expressions and Perl
are elitist technologies and are fading into insignificance. Just an
observation.

Of all the questionable uses of the word 'elitist' I have come across,
this must be one of the strangest... I wonder what on earth you think
you mean by it?

*PLONK*

Ben
 
T

Tad McClellan

[ Newsgroups trimmed to those that actually exist ]


Shannon Jacobs said:
regular expressions and Perl
are elitist technologies and are fading into insignificance.


Then stop using Perl.
 
S

Shannon Jacobs

Jürgen Exner said:
Shannon said:
[...] or a concrete example of the most obvious
problem with the regex.

Which parts of the negative examples in FAQ "How do I remove HTML from a
string?" do you have problem with when trying to adapt them to your concrete
"<tr>" problem?

jue

Thank you for the reference to
http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
category of structural problem that I encountered is not covered
there, and my source HTML does not include any of the problems covered
in the "tricky cases". If the FAQ included any examples of the use of
HTML::FormatText, or a more concrete reference, it might have been
more helpful.

As it stands, I've decided to return to Excel. Ugly and inelegant (and
typical of Microsoft), but useful and adequate.

With regards to the other recent comments in this thread, I will note:

1. Just because a particular NNTP server does not carry a particular
newsgroup, that does not mean that the newsgroup in question does not
exist.

2. With regards to the unhelpful advice to stop using Perl, I already
have (except for infrequent maintenance work on a few CGI/Perl systems
I wrote some years ago). As noted several times earlier, I am
currently working from a JavaScript perspective, but sought out Perl
people because of the compatibility of the regex implementations and
because of old memories of their expertise (though not found this time
around).

3. I used the term "elitist" in the sense of high technical expertise.
Perhaps I should have tried the XSL community. Recently all I have
seen around Perl are the laziness, impatience, and hubris, but without
the justification of results.
 
J

Jürgen Exner

Shannon said:
Jürgen Exner said:
Shannon said:
[...] or a concrete example of the most obvious
problem with the regex.

Which parts of the negative examples in FAQ "How do I remove HTML
from a string?" do you have problem with when trying to adapt them
to your concrete "<tr>" problem?

jue

Thank you for the reference to
http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
category of structural problem that I encountered is not covered
there, and my source HTML does not include any of the problems covered
in the "tricky cases".

Well, ok. Your call. But please keep in mind that first of all these are
just a few examples for illustration. There are more ways to break RE-based
parser code.
And second unless you own and control the source HTML code (which may or may
not be the case, I don't know) this source code can change at any moment
without notice.
If the FAQ included any examples of the use of
HTML::FormatText, or a more concrete reference, it might have been
more helpful.

That would be a poor use of the FAQ, because instructions and examples are
included in the standard documentation for each module already.

jue
 
U

Uri Guttman

SJ> 1. Just because a particular NNTP server does not carry a particular
SJ> newsgroup, that does not mean that the newsgroup in question does not
SJ> exist.

bzzztt!!! wrong. comp.lang.perl was removed many years ago. just because
some poorly administered news servers ignored that removal doesn't mean
it exists. non-existant group removed.

SJ> 2. With regards to the unhelpful advice to stop using Perl, I
SJ> already have (except for infrequent maintenance work on a few
SJ> CGI/Perl systems I wrote some years ago). As noted several times
SJ> earlier, I am currently working from a JavaScript perspective, but
SJ> sought out Perl people because of the compatibility of the regex
SJ> implementations and because of old memories of their expertise
SJ> (though not found this time around).

it is helpful advice for the perl community. people with your attitude
aren't helpful to the rest of the community and so it is best if they
don't code in perl. perl is too good for you.

SJ> 3. I used the term "elitist" in the sense of high technical
SJ> expertise. Perhaps I should have tried the XSL
SJ> community. Recently all I have seen around Perl are the laziness,
SJ> impatience, and hubris, but without the justification of results.

then you are blind as well as dumb. maybe you only have the sense of
smell left working?

now, go home and cry to mommy!

uri
 
T

Tad McClellan

1. Just because a particular NNTP server does not carry a particular
newsgroup, that does not mean that the newsgroup in question does not
exist.


Just because a particular newsgroup _is_ listed on a
server does not mean that the newsgroup actually exists.
That server may be wrong.

comp.lang.perl was rmgroup'd many years ago, servers that still
list it as a valid newsgroup look like they've been neglected
for many years.

2. With regards to the unhelpful advice to stop using Perl, I already
have


Thank you.

We will miss your valuable contributions to the community.
 
S

Shannon Jacobs

I'm so sorry to hear that the Google Groups system has been "neglected
for many years", as you put it so thoughtfully. It really is
unfortunate that so many people regard Google as a useful information
resource, isn't it?

Incidentally, when I finally had a bit of free time this morning, I
rethought the technical problem and did come up with a trivial
regex-based solution. It did exactly what I required on the first
attempt, confirming that the technical problem was pretty much as
trivial as I had thought it was. I guess it's just too bad that none
of you "experts" and "community contributors" were able to help.

However, this does lead to a new question:

Why did the newsgroups fail to produce the technically trivial answer?

While I can be abrasive or even rude when provoked, there is nothing
like that in my original query. I asked a simple technical question,
and wound up being dragged into a religious war about proper ways to
handle HTML. Not very useful.

If the religious issue of HTML was the problem, my advice to other
people seeking similar help is to avoid mentioning HTML. Try
describing your problem as structured database output, and maybe
you'll have better "luck" than I had.

I still regard regular expressions as useful and worthy of further
study. I cannot say the same thing about most of the people who
responded so religiously to my trivial question.

Oh yeah, I suppose I should give a hint about the solution, even
though it's a bit embarrassing. (I don't mind much as long as I can
feel I learned something along the way.) Returning to the problem
fresh and without the "box" around my thoughts, I looked at the data
files again and asked myself whether there was some other unique
string associated with the data that was associated with the second
level <tr> tags. I picked one of the likely candidates, and sure
enough, it worked. I still think there is a more clever way to do it
considering the logical structure of the HTML tags and the powerful
features of regular expressions, and I'd have been quite glad to learn
something new about those features. That would have been more
instructional than just solving the original rather trivial problem.

(By the way, the Excel-based solution was just TOO ugly to bear.)
 
M

Matt Garrish

Shannon Jacobs said:
I'm so sorry to hear that the Google Groups system has been "neglected
for many years", as you put it so thoughtfully. It really is
unfortunate that so many people regard Google as a useful information
resource, isn't it?

Well, if Google still archives the messages then it must be a group. Someone
should re-revise this horribly outdated faq:

http://www.perldoc.com/perl5.8.0/po...groups-on-Usenet---Where-do-I-post-questions-
Why did the newsgroups fail to produce the technically trivial answer?

Because the point of this newsgroup is NOT to produce technically trivial
answers, because technically trivial answers are useless. So what if you
found some way you think might work for you? What good would posting some
bad advice that's bound to fail but that might do the job for you do for
someone searching on the same topic? Parsing html questions come up every
few days. Do you think people here want to sit and answer them with
technically trivial answers over and over again? Do you think they want to
be responding to questions along the lines of "Duh, how come this trivial
answer didn't work for me?"?

Get a life. You got flamed for asking a stupid question. If you had any
knowledge of markup languages you wouldn't have even asked it. And if you
don't like being told you're dumb, don't post to usenet.

Matt
 
J

John W. Kennedy

Shannon said:
I'm so sorry to hear that the Google Groups system has been "neglected
for many years", as you put it so thoughtfully. It really is
unfortunate that so many people regard Google as a useful information
resource, isn't it?

Google Groups is an archive, and, as such, obviously does not delete
obsolete groups.
Incidentally, when I finally had a bit of free time this morning, I
rethought the technical problem and did come up with a trivial
regex-based solution.

No you didn't, because it's impossible. Either you misstated your
requirement, your "solution" does not work, or it is not "regex-based".
Oh yeah, I suppose I should give a hint about the solution, even
though it's a bit embarrassing. (I don't mind much as long as I can
feel I learned something along the way.) Returning to the problem
fresh and without the "box" around my thoughts, I looked at the data
files again and asked myself whether there was some other unique
string associated with the data that was associated with the second
level <tr> tags. I picked one of the likely candidates, and sure
enough, it worked.

In other words, you came up with an ad-hoc solution that does not
involve the use of regex's for parsing (which regex's cannot do), and
which no-one here could possibly have thought of, since it involves
facts that you never mentioned.

That's a cute job of drawing your target around the bullet holes, but
you can't really expect adults to be impressed by that, can you?

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,477
Members
44,898
Latest member
BlairH7607

Latest Threads

Top