can some one please explain this regex?!

Geoff Cox · Dec 7, 2003

Hello,

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is matched
as the code does not work for me. If I could learn from this I could
probably sort it out for myself ..

Thanks

Geoff

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {

Matt Garrish · Dec 7, 2003

Geoff Cox said:
Hello,

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is matched
as the code does not work for me. If I could learn from this I could
probably sort it out for myself ..

To break it down piece by piece:

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

matches "head" (you have the /i switch on, so it will match any case)
followed by one or more whitespace characters, followed by "teacher",
followed by one or more characters up to an opening <td. You then have a
negated character class, so it will match all text up to the next closing >,
and then another negated character class will match and capture anything up
to the next opening <.

I imagine this might be where your problem is. None of your match patterns
allow for zero occurrences, which means that there has to be at least one
character between the <td and closing >. In other words, your pattern would
never match <td>, but only something like <td class="foo">.

Moving on, you then have two non-greedy matches (.+?). The first will match
anything up to "address" and the second will match anything up to the next
<td. The regex then repeats itself with the two negated classes: one looking
for the end of the <td> and the other capturing everything up to the next
opening <. And once again, your pattern will fail unless there is at least
one character between the <td and >.

(I removed the /x from your original posting because it just allows
whitespace and comments in your regex, which didn't help the readability of
it, in my opinion of course.)

Matt

Geoff Cox · Dec 7, 2003

On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff

My code is as follows but it does not work!

---------------------------
use strict;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

while (defined($line=<IN>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example

<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

Cheers

Geoff

Bob Walton · Dec 7, 2003

Geoff said:
On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff

My code is as follows but it does not work!

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.

use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example

<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ....

Geoff

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;

#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.

Gunnar Hjalmarsson · Dec 7, 2003

Geoff said:
here is the whole code and a sample of html which it is to work on

And, as I suspected, the problem has nothing to do with the regex...
Read Bob's explanation carefully!

Gunnar Hjalmarsson · Dec 7, 2003

Matt said:
Geoff said:

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is
matched as the code does not work for me. If I could learn from
this I could probably sort it out for myself ..

Click to expand...

To break it down piece by piece:

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

I imagine this might be where your problem is. None of your match
patterns allow for zero occurrences, which means that there has to
be at least one character between the <td and closing >. In other
words, your pattern would never match <td>, but only something like
<td class="foo">.

Yeah, you are right, of course. Both the occurrences of

<TD[^>]+>

should better be

<TD[^>]*>

(But, as explained in other posts, that limitation was not the reason
why OP's code didn't "work".)

Geoff Cox · Dec 7, 2003

On Sun, 07 Dec 2003 19:53:03 GMT, Bob Walton

Bob,

many thanks for your thoughts - the following code gets the first set
of name/address data but stops at that point - 'afraid I haven't used
your form of slurp before and do not see how to move through the rest
of the file containing the name/address data?

Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (IN);
close (OUT);

Geoff said:
Geoff said:

On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff

My code is as follows but it does not work!

Click to expand...

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.

use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Click to expand...

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Click to expand...

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example

<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ...

Geoff

Click to expand...

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;

#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.

Geoff Cox · Dec 7, 2003

And, as I suspected, the problem has nothing to do with the regex...
Read Bob's explanation carefully!

Gunnar

must be almost there - I have posted my version based on Bob's code
.... but it only gets the first name/address info - not clear how I
move through the rest of the file?

by the way - your code seems to work fine minus my suggestion re the
additional < ?!

Cheers

Geoff

Geoff Cox · Dec 7, 2003

To break it down piece by piece:

Matt,

many thanks - will read in a minute - but you might like to look at
following code - thsi works OK except that it only gets the first set
of name/address data - I do not see at the moment how to move along
the slurped input to get the other sets of name/address info ..? any
ideas?! Cheers Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (DATA);
close (OUT);

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

matches "head" (you have the /i switch on, so it will match any case)
followed by one or more whitespace characters, followed by "teacher",
followed by one or more characters up to an opening <td. You then have a
negated character class, so it will match all text up to the next closing >,
and then another negated character class will match and capture anything up
to the next opening <.

I imagine this might be where your problem is. None of your match patterns
allow for zero occurrences, which means that there has to be at least one
character between the <td and closing >. In other words, your pattern would
never match <td>, but only something like <td class="foo">.

Moving on, you then have two non-greedy matches (.+?). The first will match
anything up to "address" and the second will match anything up to the next
<td. The regex then repeats itself with the two negated classes: one looking
for the end of the <td> and the other capturing everything up to the next
opening <. And once again, your pattern will fail unless there is at least
one character between the <td and >.

(I removed the /x from your original posting because it just allows
whitespace and comments in your regex, which didn't help the readability of
it, in my opinion of course.)

Matt

Gunnar Hjalmarsson · Dec 7, 2003

Geoff said:
Bob,

many thanks for your thoughts - the following code gets the first
set of name/address data but stops at that point - 'afraid I
haven't used your form of slurp before and do not see how to move
through the rest of the file containing the name/address data?

Well, you haven't told us before that there are more than one
name/address pair.

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Try to change that to

/isx ) {

and that to

/gisx ) {
-------------------^

Geoff Cox · Dec 7, 2003

On Sun, 07 Dec 2003 19:53:03 GMT, Bob Walton

Bob,

many thanks for your thoughts - the following code gets the first set
of name/address data but stops at that point - 'afraid I haven't used
your form of slurp before and do not see how to move through the rest
of the file containing the name/address data?

Obvious really !! just need to use while instead of if and add the g
option ..

Thanks everyone for all the help!

Cheers

Geoff

Geoff said:
Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (IN);
close (OUT);

Geoff said:

On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff

My code is as follows but it does not work!

Click to expand...

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.

use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;

open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Click to expand...

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Click to expand...

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example

<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ...

Geoff

Click to expand...

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;

#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /  (.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2>Head Teacher</TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2>Address</TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.

Click to expand...

Geoff Cox · Dec 7, 2003

Well, you haven't told us before that there are more than one
name/address pair.

Gunnar,

sorry - I thought I had made it clear that the text I'd given was just
a sample of the file ... any way - all's wel that ends well !

Many thanks for all your help. I've learnt quite a bit tonight!

Cheers

Geoff

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Click to expand...

Try to change that to

/isx ) {

Click to expand...

and that to

/gisx ) {
-------------------^

Tad McClellan · Dec 7, 2003

Geoff Cox said:
open(DATA, "$namehtml");

^ ^

perldoc -q vars

What's wrong with always quoting "$vars"?

The DATA filehandle is special, I leave it for its special uses,
choose some other name.

You should always, yes *always*, check the return value from open():

open(NAME, $namehtml) or die "could not open '$namehtml' $!";

[ snip 150 lines of full-quote. Please stop doing that. Soon. ]

Geoff Cox · Dec 8, 2003

^ ^

perldoc -q vars

What's wrong with always quoting "$vars"?

Tad, not sure what you mean above?

The DATA filehandle is special, I leave it for its special uses,
choose some other name.

OK will do.

You should always, yes *always*, check the return value from open():

open(NAME, $namehtml) or die "could not open '$namehtml' $!";

Thanks for the reminder.

[ snip 150 lines of full-quote. Please stop doing that. Soon. ]

ditto..

Geoff

Tintin · Dec 8, 2003

Geoff Cox said:
Tad, not sure what you mean above?

Have you read what it says? Is there something you don't understand in the
documentation?

Geoff Cox · Dec 8, 2003

I assuem you mean simply that there is no need to have the quotes
round $vars if it is on its own?

Geoff

Tad McClellan · Dec 9, 2003

Geoff Cox said:
I assuem you mean simply that there is no need to have the quotes
round $vars if it is on its own?

Yes.

Eric Schwartz · Dec 9, 2003

Geoff Cox said:
I assuem you mean simply that there is no need to have the quotes
round $vars if it is on its own?

No, he means you should run 'perldoc -q vars' in a shell window, and
read the answer titled 'What's wrong with always quoting "$vars"?'.
It sounds like you should probably run 'perldoc perldoc' as well.
Perl comes with probably the best built-in documentation I've seen for
a programming language, but it's useless if you don't spend a bit of
time learning how to read it.

-=Eric

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Only one table shows up with the information	2	Mar 29, 2023
Need help again please	19	Feb 14, 2020
My regex kung-fu is not strong =(	0	Apr 4, 2020
Who can explain this bug?	57	Apr 17, 2013
The Horror of pointers...	5	Jan 11, 2025
Can anyone explain this code?	9	Sep 1, 2006
Some help in refining this regex for CSV files	2	Dec 6, 2012

can some one please explain this regex?!

Geoff Cox

Matt Garrish

Geoff Cox

Bob Walton

Gunnar Hjalmarsson

Gunnar Hjalmarsson

Geoff Cox

Geoff Cox

Geoff Cox

Gunnar Hjalmarsson

Geoff Cox

Geoff Cox

Tad McClellan

Geoff Cox

Tintin

Geoff Cox

Tad McClellan

Eric Schwartz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads