Regex matching non-contiguous sheds of text

D

DM

I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)

It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

Here is some sample output:

================================================================================

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />

================================================================================

In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.

Here is a sample of that file containing the unexpected match:

================================================================================

<td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
<p> </p>
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel)<br />
</li>
</ul></td>
</tr>
</table>
<p><br />
<strong>TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number</strong><br />

================================================================================

Any help would be greatly appreciated.

Thanks,

dm
 
J

Jon Ericson

DM said:
I'm trying to design a regular expression to match the href attribute
of <a> tags. I'm testing it on the command line (on Redhat Linux
Enterprise Server) using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at
runtime in the actual Perl script.)

It almost works as expected. I set the color and -o options in order
to clearly show the highlighted match. In most cases it *does* match
exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

If you were actually using perl, this wouldn't be too difficult with
the HTML::parser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Jon
 
D

DM

Jon said:
I'm trying to design a regular expression to match the href attribute
of <a> tags. I'm testing it on the command line (on Redhat Linux
Enterprise Server) using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

[ ... ]
If you were actually using perl, this wouldn't be too difficult with
the HTML::parser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Jon

Thanks for the reply. I don't see how the HTML::parser module would help me in
the task I described in my original post.

I checked perldoc as you recommended, but the "pitfalls" mentioned don't seem to
apply to what I'm doing.

As I explained in my original post, I'm not trying to do some kind of general
HTML parsing operation, such as stripping out HTML tags. I'm trying to find this
string:

href="[SOME_URL_FRAGMENT].pdf">

My regex almost works, but is acting really weird in a few cases. I'm trying to
nail down the reason for that. Perhaps I have a misconception or
misunderstanding of regex syntax?
 
P

Paul Lalli

DM said:
I'm trying to design a regular expression to match the href attribute
of said:
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Paul Lalli
 
P

Paul Lalli

Paul Lalli said:
DM said:
I'm trying to design a regular expression to match the href
attribute
of said:
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Of course, this applies to the first * in your regexp as well.

href=.*?

Paul Lalli
 
D

DM

Here's the command I'm using:
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.


Of course, this applies to the first * in your regexp as well.

href=.*?

Paul Lalli

OK, thanks. That seems to help.

dm
 
J

Jon Ericson

DM said:
Jon said:
If you were actually using perl, this wouldn't be too difficult with
the HTML::parser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Thanks for the reply. I don't see how the HTML::parser module would
help me in the task I described in my original post.

I checked perldoc as you recommended, but the "pitfalls" mentioned
don't seem to apply to what I'm doing.

As I explained in my original post, I'm not trying to do some kind of
general HTML parsing operation, such as stripping out HTML tags. I'm
trying to find this string:

href="[SOME_URL_FRAGMENT].pdf">

My regex almost works, but is acting really weird in a few cases. I'm
trying to nail down the reason for that. Perhaps I have a
misconception or misunderstanding of regex syntax?

It looks like you got some help with the regex itself. I hope this
works for you.

If you're doing something quick and dirty, and you don't mind the
occational mistake, there's nothing wrong with the regex approach.
But little scripts sometimes become mission-critical. If that
happens, the regex might not be a good idea.

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top