Printing only a portion of a matched regex -- newbie quesiton

  • Thread starter DIAMOND Mark R.
  • Start date
D

DIAMOND Mark R.

My apologies to begin with. I am a relatively new, and infrequent user of
perl.

I have a series of html files with contact information for doctors. The
files have enormous amounts of other stuff in them including script, image
links and so on.
But the names all appear between a particular <span ...> tag and a </b> tag,
with the words like "level7Name" or "level2Contact" (the quotes are in the
tag) marking the particlar spans.
Line breaks don't seem to follow any particular pattern. The two structures
<span ... level.Name> .... nametoprint</b> and the equivalent for the
contact address are quite distinct without any strange embedding of the two.

What I'd like to do is print out the names, and the contact information, but
I've obviously gone wrong somewhere. I couldn't work out whether I should or
should not have a global at the end of the s///, but in either case, I still
have a problem. Any help would be very much appreciated.

$/ = ".\n";
$doctorlistfile = "c:\\tmp\\doctors.tmp";
open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
$doctorlistfile \n";
while(<>) {
s/<span +class=\"level[0-9]Name\"><b>([^<]*)<\/b>/ $1 /;
print DOCTORLISTFILE $1;
s/<span +class=\"level[0-9]Contact\"><b>([^<]*)<\/b>/ $1 /;
print DOCTORLISTFILE $1;
}
 
D

DIAMOND Mark R.

I should have added that I have searched the NG on Google groups, but part
of the problem is that I'm not quite sure what I should be searching for
"print only match OR matching" pointed me to solutions which printed only
*lines* with an appropriate match.

mark
 
D

DIAMOND Mark R.

Thanks, Brian. You are quite right. I just want to match, not change. And I
do want those newlines.. But it only prints the first instance of a name. I
have made two slight changes . The first so that the print is conditional,
the second because I realised that the tag that marks the end of the name or
contact is not always the same, so I have checked for the beginning of the
tag only in the following.

$/ = ".\n";
$doctorlistfile = "c:\\tmp\\doctors.tmp";
open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
$doctorlistfile \n";
while(<>) {
print DOCTORLISTFILE "$1\n" if m/<span
+class="level[0-9]Name"><b>([^<]*)</;
print DOCTORLISTFILE "$1\n" if m/<span
+class="level[0-9]Contact"><b>([^<]*)</;
}

but as I say, only a single name (the first correct match) is extracted from
the file.

Another question to which I am unsure of the answer is whether the second
appearance of $1 is correct, or whether the indices of the $ increase
throughout the loop rather than just within each regex; i.e. is the first
match in the second regex actually called $2 ?

Cheers.
 
J

Joe Smith

DIAMOND said:
$/ = ".\n";
while(<>) {

If your file does not have any lines that end with a period, then
the entire file will be read in by <>, and the code inside the while{}
block will be executed only once. Try
print "$. = '$_'\n";
as a debugging aid.
print DOCTORLISTFILE "$1\n" if m/<span
+class="level[0-9]Name"><b>([^<]*)</;
print DOCTORLISTFILE "$1\n" if m/<span
+class="level[0-9]Contact"><b>([^<]*)</;
Another question to which I am unsure of the answer is whether the second
appearance of $1 is correct

In each regex, $1 corresponds to the first set of capturing parentheses in
that regex. The presence of any other regex in the file does not change this.
-Joe
 
G

gnari

DIAMOND Mark R. said:
Thanks, Brian. You are quite right. I just want to match, not change. And I
do want those newlines.. But it only prints the first instance of a name. I
have made two slight changes . The first so that the print is conditional,
the second because I realised that the tag that marks the end of the name or
contact is not always the same, so I have checked for the beginning of the
tag only in the following.

$/ = ".\n";
this looks a bit tentative in light of your first post.
skip it
$doctorlistfile = "c:\\tmp\\doctors.tmp";
open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
$doctorlistfile \n";
while(<>) {
print DOCTORLISTFILE "$1\n" if m/<span
+class="level[0-9]Name"><b>([^<]*)</;

you were almost there.
change the if to a while and add a /g:
print DOCTORLISTFILE "$1\n"
while m/ said:
but as I say, only a single name (the first correct match) is extracted from
the file.

consistent with your $/ , probably
Another question to which I am unsure of the answer is whether the second
appearance of $1 is correct, or whether the indices of the $ increase
throughout the loop rather than just within each regex; i.e. is the first
match in the second regex actually called $2 ?

each regex resets the $n variables

gnari
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top