D
DM
I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.
Here's the command I'm using:
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)
It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.
However, in a few cases what is matched is totally unexpected.
Here is some sample output:
================================================================================
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />
================================================================================
In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.
Here is a sample of that file containing the unexpected match:
================================================================================
<td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
<p> </p>
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel)<br />
</li>
</ul></td>
</tr>
</table>
<p><br />
<strong>TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number</strong><br />
================================================================================
Any help would be greatly appreciated.
Thanks,
dm
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.
Here's the command I'm using:
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)
It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.
However, in a few cases what is matched is totally unexpected.
Here is some sample output:
================================================================================
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />
================================================================================
In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.
Here is a sample of that file containing the unexpected match:
================================================================================
<td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
<p> </p>
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel)<br />
</li>
</ul></td>
</tr>
</table>
<p><br />
<strong>TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number</strong><br />
================================================================================
Any help would be greatly appreciated.
Thanks,
dm