can some one please explain this regex?!

G

Geoff Cox

Hello,

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is matched
as the code does not work for me. If I could learn from this I could
probably sort it out for myself ..

Thanks

Geoff

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
 
M

Matt Garrish

Geoff Cox said:
Hello,

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is matched
as the code does not work for me. If I could learn from this I could
probably sort it out for myself ..

To break it down piece by piece:

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

matches "head" (you have the /i switch on, so it will match any case)
followed by one or more whitespace characters, followed by "teacher",
followed by one or more characters up to an opening <td. You then have a
negated character class, so it will match all text up to the next closing >,
and then another negated character class will match and capture anything up
to the next opening <.

I imagine this might be where your problem is. None of your match patterns
allow for zero occurrences, which means that there has to be at least one
character between the <td and closing >. In other words, your pattern would
never match <td>, but only something like <td class="foo">.

Moving on, you then have two non-greedy matches (.+?). The first will match
anything up to "address" and the second will match anything up to the next
<td. The regex then repeats itself with the two negated classes: one looking
for the end of the <td> and the other capturing everything up to the next
opening <. And once again, your pattern will fail unless there is at least
one character between the <td and >.

(I removed the /x from your original posting because it just allows
whitespace and comments in your regex, which didn't help the readability of
it, in my opinion of course.)

Matt
 
G

Geoff Cox

On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff


My code is as follows but it does not work!

---------------------------
use strict;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

while (defined($line=<IN>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>


Cheers

Geoff
 
B

Bob Walton

Geoff said:
On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff


My code is as follows but it does not work!

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.


use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ....


Geoff

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;


#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.
 
G

Gunnar Hjalmarsson

Geoff said:
here is the whole code and a sample of html which it is to work on

And, as I suspected, the problem has nothing to do with the regex...
Read Bob's explanation carefully!
 
G

Gunnar Hjalmarsson

Matt said:
Geoff said:
this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is
matched as the code does not work for me. If I could learn from
this I could probably sort it out for myself ..

To break it down piece by piece:

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

I imagine this might be where your problem is. None of your match
patterns allow for zero occurrences, which means that there has to
be at least one character between the <td and closing >. In other
words, your pattern would never match <td>, but only something like
<td class="foo">.

Yeah, you are right, of course. Both the occurrences of

<TD[^>]+>

should better be

<TD[^>]*>

(But, as explained in other posts, that limitation was not the reason
why OP's code didn't "work".)
 
G

Geoff Cox

On Sun, 07 Dec 2003 19:53:03 GMT, Bob Walton

Bob,

many thanks for your thoughts - the following code gets the first set
of name/address data but stops at that point - 'afraid I haven't used
your form of slurp before and do not see how to move through the rest
of the file containing the name/address data?

Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (IN);
close (OUT);



Geoff said:
On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff


My code is as follows but it does not work!

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.


use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ...


Geoff

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;


#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.
 
G

Geoff Cox

And, as I suspected, the problem has nothing to do with the regex...
Read Bob's explanation carefully!

Gunnar

must be almost there - I have posted my version based on Bob's code
.... but it only gets the first name/address info - not clear how I
move through the rest of the file?

by the way - your code seems to work fine minus my suggestion re the
additional < ?!

Cheers

Geoff
 
G

Geoff Cox

To break it down piece by piece:

Matt,

many thanks - will read in a minute - but you might like to look at
following code - thsi works OK except that it only gets the first set
of name/address data - I do not see at the moment how to move along
the slurped input to get the other sets of name/address info ..? any
ideas?! Cheers Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (DATA);
close (OUT);



/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

matches "head" (you have the /i switch on, so it will match any case)
followed by one or more whitespace characters, followed by "teacher",
followed by one or more characters up to an opening <td. You then have a
negated character class, so it will match all text up to the next closing >,
and then another negated character class will match and capture anything up
to the next opening <.

I imagine this might be where your problem is. None of your match patterns
allow for zero occurrences, which means that there has to be at least one
character between the <td and closing >. In other words, your pattern would
never match <td>, but only something like <td class="foo">.

Moving on, you then have two non-greedy matches (.+?). The first will match
anything up to "address" and the second will match anything up to the next
<td. The regex then repeats itself with the two negated classes: one looking
for the end of the <td> and the other capturing everything up to the next
opening <. And once again, your pattern will fail unless there is at least
one character between the <td and >.

(I removed the /x from your original posting because it just allows
whitespace and comments in your regex, which didn't help the readability of
it, in my opinion of course.)

Matt
 
G

Gunnar Hjalmarsson

Geoff said:
Bob,

many thanks for your thoughts - the following code gets the first
set of name/address data but stops at that point - 'afraid I
haven't used your form of slurp before and do not see how to move
through the rest of the file containing the name/address data?

Well, you haven't told us before that there are more than one
name/address pair.
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Try to change that to


and that to

/gisx ) {
-------------------^
 
G

Geoff Cox

On Sun, 07 Dec 2003 19:53:03 GMT, Bob Walton

Bob,

many thanks for your thoughts - the following code gets the first set
of name/address data but stops at that point - 'afraid I haven't used
your form of slurp before and do not see how to move through the rest
of the file containing the name/address data?

Obvious really !! just need to use while instead of if and add the g
option ..

Thanks everyone for all the help!

Cheers

Geoff


Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (IN);
close (OUT);



Geoff said:
On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff


My code is as follows but it does not work!

-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.


use warnings;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.

while (defined($line=<IN>)) {

Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.

# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR> ...


Geoff

Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;


#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.
 
G

Geoff Cox

Well, you haven't told us before that there are more than one
name/address pair.

Gunnar,

sorry - I thought I had made it clear that the text I'd given was just
a sample of the file ... any way - all's wel that ends well !

Many thanks for all your help. I've learnt quite a bit tonight!

Cheers

Geoff
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Try to change that to


and that to

/gisx ) {
-------------------^
 
T

Tad McClellan

Geoff Cox said:
open(DATA, "$namehtml");
^ ^

perldoc -q vars

What's wrong with always quoting "$vars"?


The DATA filehandle is special, I leave it for its special uses,
choose some other name.


You should always, yes *always*, check the return value from open():

open(NAME, $namehtml) or die "could not open '$namehtml' $!";




[ snip 150 lines of full-quote. Please stop doing that. Soon. ]
 
G

Geoff Cox

^ ^

perldoc -q vars

What's wrong with always quoting "$vars"?

Tad, not sure what you mean above?

The DATA filehandle is special, I leave it for its special uses,
choose some other name.

OK will do.
You should always, yes *always*, check the return value from open():

open(NAME, $namehtml) or die "could not open '$namehtml' $!";

Thanks for the reminder.
[ snip 150 lines of full-quote. Please stop doing that. Soon. ]

ditto..

Geoff
 
G

Geoff Cox

I assuem you mean simply that there is no need to have the quotes
round $vars if it is on its own?

Geoff
 
E

Eric Schwartz

Geoff Cox said:
I assuem you mean simply that there is no need to have the quotes
round $vars if it is on its own?

No, he means you should run 'perldoc -q vars' in a shell window, and
read the answer titled 'What's wrong with always quoting "$vars"?'.
It sounds like you should probably run 'perldoc perldoc' as well.
Perl comes with probably the best built-in documentation I've seen for
a programming language, but it's useless if you don't spend a bit of
time learning how to read it.

-=Eric
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top