Regex matching non-contiguous sheds of text

DM · Oct 20, 2004

I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)

It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

Here is some sample output:

================================================================================

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />

================================================================================

In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.

Here is a sample of that file containing the unexpected match:

================================================================================

<td bgcolor="#CCFFFF">DOWNLOAD: <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
 
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel) 
</li>
</ul></td>
</tr>
</table>
 
TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number 

================================================================================

Any help would be greatly appreciated.

Thanks,

dm

Jon Ericson · Oct 20, 2004

DM said:
I'm trying to design a regular expression to match the href attribute
of <a> tags. I'm testing it on the command line (on Redhat Linux
Enterprise Server) using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at
runtime in the actual Perl script.)

It almost works as expected. I set the color and -o options in order
to clearly show the highlighted match. In most cases it *does* match
exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

If you were actually using perl, this wouldn't be too difficult with
the HTML:

arser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Jon

DM · Oct 20, 2004

Jon said:
I'm trying to design a regular expression to match the href attribute
of <a> tags. I'm testing it on the command line (on Redhat Linux
Enterprise Server) using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

Click to expand...

[ ... ]

If you were actually using perl, this wouldn't be too difficult with
the HTML:arser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Jon

Thanks for the reply. I don't see how the HTML:

arser module would help me in
the task I described in my original post.

I checked perldoc as you recommended, but the "pitfalls" mentioned don't seem to
apply to what I'm doing.

As I explained in my original post, I'm not trying to do some kind of general
HTML parsing operation, such as stripping out HTML tags. I'm trying to find this
string:

href="[SOME_URL_FRAGMENT].pdf">

My regex almost works, but is acting really weird in a few cases. I'm trying to
nail down the reason for that. Perhaps I have a misconception or
misunderstanding of regex syntax?

Paul Lalli · Oct 20, 2004

DM said:
I'm trying to design a regular expression to match the href attribute

of said:
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Paul Lalli

Paul Lalli · Oct 20, 2004

Paul Lalli said:
DM said:

I'm trying to design a regular expression to match the href

Click to expand...

attribute

of said:

tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

Click to expand...

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Of course, this applies to the first * in your regexp as well.

href=.*?

Paul Lalli

DM · Oct 20, 2004

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

Click to expand...

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Click to expand...

Of course, this applies to the first * in your regexp as well.

href=.*?

Paul Lalli

OK, thanks. That seems to help.

dm

Jon Ericson · Oct 20, 2004

DM said:
Jon said:

If you were actually using perl, this wouldn't be too difficult with
the HTML:arser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Click to expand...

Thanks for the reply. I don't see how the HTML:arser module would
help me in the task I described in my original post.

I checked perldoc as you recommended, but the "pitfalls" mentioned
don't seem to apply to what I'm doing.

As I explained in my original post, I'm not trying to do some kind of
general HTML parsing operation, such as stripping out HTML tags. I'm
trying to find this string:

href="[SOME_URL_FRAGMENT].pdf">

My regex almost works, but is acting really weird in a few cases. I'm
trying to nail down the reason for that. Perhaps I have a
misconception or misunderstanding of regex syntax?

It looks like you got some help with the regex itself. I hope this
works for you.

If you're doing something quick and dirty, and you don't mind the
occational mistake, there's nothing wrong with the regex approach.
But little scripts sometimes become mission-critical. If that
happens, the regex might not be a good idea.

Jon

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Dropdown menu in a fixed navbar	1	Apr 1, 2024
I need help fixing my website	2	Oct 15, 2023
Song requests	4	Aug 16, 2023
Help with my responsive home page	2	Dec 14, 2022
Javascript scroll to sections and also scroll to section but open relevant nav-tab	4	Feb 25, 2022
Need help with code on website (noob)	2	Jul 18, 2022
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023

Regex matching non-contiguous sheds of text

DM

Jon Ericson

DM

Paul Lalli

Paul Lalli

DM

Jon Ericson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads