Beginner: read $array with line breaks line by line

M

Marek Stepanek

Hello happy Perlers,


my aim is to transform a large Address list in html to a LaTeX-Address-List
for a series (?) letter (=the same letter with many different addresses).

My html-address-list you may find on following internet address:

http://podiuminternational.org/addresses/competitions/competitionsfunds.htm

This file I am reading in without line breaks, or say I read in the whole
file as one line (I understand this like that, I am beginner), setting $/ to
undef. First I am reading in the file, removing the html; result is the
array @complete_address. The next step is to transform these entries of
@complete_address into a LaTeX File of the form:

\addrentry
{Lastname}
{Firstname}
{Address}
{Telephone}
{F1 } = m (ännlich) w (eiblich) u (zwitter oder unbekannt)
{F2 } = Firma
{F3 } = email
{F4 } = Kommentar
{KEY} = Schlüssel

I find these LaTeX-entries far too short - if somebody has an Perl-solution
for a series letter in LaTeX, I would really appreciate a hint! So this part
is my question, and it is still in work. Problem is: @complete_address
contains variables, with different lines, which I would like to read in line
by line. So I set $/ = "\n"; but this seems not to work.

And a construct of

foreach my $addr (@competitions)
{ $/ = "\n";
while (<$addr>)
{
...
}
}

seems not valid Perl.


Thank you for your patience


marek


Here my script so far:


#! /usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

$/ = undef;

my (@competitions, @complete_address);

while (<>)
{
foreach my $entry (m"<dd>(.+?)</dd>"g)
{
push (@complete_address, $entry);
}
}

foreach my $e (@complete_address)
{
$e =~ s!<span\s+class="comp2">([^<]+)</span>!"Competition: " . $1 .
"\n\n"!ge;
$e =~ s!<br />!\n!g;
$e =~ s!<[^>]+>!!g;
push (@competitions, $e);
}

my $out_file1 = 'letter_comp_addr_01.adr';
open OUT1, "> $out_file1" or die "Connot create your out_file: $!";
my $out_file2 = 'letter_comp_addr_02.adr';
open OUT2, ">> $out_file2" or die "Connot create your out_file: $!";

my ($competition, $email, $first_name, $last_name, $gender);
foreach my $addr (@competitions)
{
$/ = "\n";
($competition) = $addr =~ m"^Competition:\s+(.+)";
$addr =~ s/^(International|National) Competition\s*$//i; # not working
($gender, $first_name, $last_name) = $addr =~
/^(Mr\.?|Mrs\.?)?([A-Z][a-z]+(?:\s+[A-Z][a-z]+\.?)?)\s+([A-Z][a-z]+)\s*$/;
# not working either
if ($gender)
{
if ($gender eq m/Mrs\.?/ )
{
$gender = "w";
}
elsif ($gender eq m/Mr\.?/ )
{
$gender = "m";
}
elsif ($gender == 'undef' )
{
$gender = "u";
}
}
($email) = $addr =~ m"((&#\d+;)+)";
$email = decode_entities($email) if $email;

}

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";
print OUT2 "\\addrentry\n";
print OUT2 "\t{$first_name}\n" if $first_name;
print OUT2 "\t{$last_name}\n" if $last_name;
print OUT2 "\t{$competition}\n";
print OUT2 "\t{$gender}\n" if $gender;
print OUT2 "\t{$email}\n" if $email;

close OUT1;
close OUT2;
 
P

Peter J. Holzer

Problem is: @complete_address contains variables, with different
lines, which I would like to read in line by line. So I set $/ = "\n";
but this seems not to work.

And a construct of

foreach my $addr (@competitions)
{ $/ = "\n";
while (<$addr>)
{
...
}
}

seems not valid Perl.

The <> operator reads from a file handle, not a string. You probably
want to use the split function here:

foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr)
{
...
}
}


hp
 
T

Tad McClellan

Marek Stepanek said:
for a series (?) letter (=the same letter with many different addresses).
^^^^^^^^^^^^^^^^^


I think the term you are looking for is "mail merge".
 
M

Marek Stepanek

foreach my $addr (@competitions) {
open my $fh, '<', \$addr or die "D'Oh! $!\n";
local $/ = "\n";
while (<$fh>) {
# ...
}
}

BUT BEFORE I GET CHASTISED FOR POINTING YOU TO THIS "SOLUTION", let me
tell you that you DON'T want to do so. You most probably want to
split() on \n, instead.


Michele

:) Looks funny your trick!

Thank you Michele, Thank you Peter,


for your answers. Something is not working. I am understanding, what you
mean with split(/\n/, $addr)). But my script is hanging now! So I suppose,
the global Variable, which I inserted $_ is not working on it?

I am sure there is an obvious mistake; sorry to bother you again, which this
long script:


#! /usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

$/ = undef;

my (@competitions, @complete_address);

while (<>)
{
foreach my $entry (m"<dd>(.+?)</dd>"g)
{
push (@complete_address, $entry);
}
}

foreach my $e (@complete_address)
{
$e =~ s!<span\s+class="comp2">([^<]+)</span>!"Competition: " . $1 .
"\n\n"!ge;
$e =~ s!<br />!\n!g;
$e =~ s!<[^>]+>!!g;
push (@competitions, $e);
}

my $out_file1 = 'letter_comp_addr_01.adr';
open OUT1, "> $out_file1" or die "Connot create your out_file: $!";
my $out_file2 = 'letter_comp_addr_02.adr';
open OUT2, ">> $out_file2" or die "Connot create your out_file: $!";

my ($competition, $email, $first_name, $last_name, $gender);
foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr)) #<-- did I understand it well?
{
($competition) = $_ =~ m"^Competition:\s+(.+)";
$_ =~ s/^(International|National) Competition\s*$//i;
($gender, $first_name, $last_name) = $_ =~
/^(Mr\.?|Mrs\.?)?([A-Z][a-z]+(?:\s+[A-Z][a-z]+\.?)?)\s+([A-Z][a-z]+)\s*$/;
if ($gender)
{
if ( $gender eq m/Mrs\.?/ )
{
$gender = "w";
}
elsif ( $gender eq m/Mr\.?/ )
{
$gender = "m";
}
elsif ( $gender == 'undef' )
{
$gender = "u";
}
}
($email) = $_ =~ m"((&#\d+;)+)";
$email = decode_entities($email) if $email;
}
}

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";
print OUT2 "\\addrentry\n";
print OUT2 "\t{$first_name}\n" if $first_name;
print OUT2 "\t{$last_name}\n" if $last_name;
print OUT2 "\t{$competition}\n" if $competition;
print OUT2 "\t{$gender}\n" if $gender;
print OUT2 "\t{$email}\n" if $email;

close OUT1;
close OUT2;
 
P

Peter J. Holzer

:) Looks funny your trick!

Thank you Michele, Thank you Peter,


for your answers. Something is not working. I am understanding, what you
mean with split(/\n/, $addr)). But my script is hanging now!

I don't see where your script could "hang" except while reading its
input file and you didn't change that. It terminates just fine if I
invoke it as

../marek competitionsfunds.htm

Of course if you omit the file, it will read from STDIN, so you will
have to type in the html file :).

So I suppose, the global Variable, which I inserted $_ is not working
on it?

I don't think I understand that sentence.
foreach (split(/\n/, $addr)) #<-- did I understand it well?
{ ....
}

split returns a list of the lines in $addr. The loop will run once for
each line, with $_ set to each line in turn ("for (@list)" is actually a
shorthand for "for local $_ (@list)"). So if $addr contains something
like

"Competition: Gradus ad Parnassum


National competition
Mrs. Barbara Schierl
Musik der Jugend
Promenade 37
A-4021 Linz
Fon: +43 732772015483
...."

$_ will be "Competition: Gradus ad Parnassum" during the first run of
the loop, "" during the second and third, "National competition" during
the fourth, etc.

I am sure there is an obvious mistake;

There are a few obvious mistakes in your script, but none that would
cause it to hang.
my ($competition, $email, $first_name, $last_name, $gender);
foreach my $addr (@competitions)
{ [...]
}

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";
print OUT2 "\\addrentry\n";
print OUT2 "\t{$first_name}\n" if $first_name;
print OUT2 "\t{$last_name}\n" if $last_name;
print OUT2 "\t{$competition}\n" if $competition;
print OUT2 "\t{$gender}\n" if $gender;
print OUT2 "\t{$email}\n" if $email;

There a lot of addresses in your input file, yet you write only one to your
output file. Since you wrote earlier that you wanted to create a serial
letter (question to native speakers: is serial letter the right word?),
I guess you want all of them, so you have to move the print statements
into the loop.

foreach (split(/\n/, $addr)) #<-- did I understand it well?
{
($competition) = $_ =~ m"^Competition:\s+(.+)";
$_ =~ s/^(International|National) Competition\s*$//i;
($gender, $first_name, $last_name) = $_ =~
/^(Mr\.?|Mrs\.?)?([A-Z][a-z]+(?:\s+[A-Z][a-z]+\.?)?)\s+([A-Z][a-z]+)\s*$/;

You assign a value to the variables $competition, $gender, etc. on every
run through the loop. After the loop you will have only the information
from the last line. You should assign these variables only if the
information you look for is in the line you are currently processing,
e.g.:

$competition = $1 if /^Competition:\s+(.+)/;

I think you are making it more difficult by splitting the
address block into lines. Just use regexps to extract the data you are
interested in from $addr;

if ($gender)
{ [...]
elsif ( $gender == 'undef' )

There are two errors in this line. First, the undefined value is not the
same as the string 'undef'. To check if $gender is undef you would have
to write

elsif ( !defined($gender) )

Second, if you really wanted to compare $gender to the string 'undef',
you would have to use the string comparison operator eq, not the
numerical comparison operator ==.

Oh, and third, if $gender is true, it has to be defined, so the test is
useless as it can never succeed.

hp
 
T

Tad McClellan

Marek Stepanek said:
$e =~ s!<span\s+class="comp2">([^<]+)</span>!"Competition: " . $1 .
"\n\n"!ge;


The replacement string part of s/// is "double quotish" so you
get backslash escapes (\n) and interpolation ($1) for free.

No need for the eval (e) modifier:

$e =~ s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
 
M

Marek Stepanek

$e =~ s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;

Thank you all for all the answers! I get only the competitions, and here and
there some email-addresses into my out-file. But tomorrow I will probably
find the mistake(s) myself. (I am online only the evening). I am learning
enormously with all your hints. Until now my script looks like follows:

#! /usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

$/ = undef;

my (@competitions, @complete_address);

while (<>)
{
foreach my $entry (m"<dd>(.+?)</dd>"g)
{
push (@complete_address, $entry);
}
}

foreach my $e (@complete_address)
{
$e =~ s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
$e =~ s!<br />!\n!g;
$e =~ s!<[^>]+>!!g;
push (@competitions, $e);
}

my $out_file1 = 'letter_comp_addr_01.adr';
open OUT1, "> $out_file1" or die "Connot create your out_file: $!";
my $out_file2 = 'letter_comp_addr_02.adr';
open OUT2, ">> $out_file2" or die "Connot create your out_file: $!";

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";

my ($competition, $email, $first_name, $last_name, $gender, $phone);
foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr))
{
($competition) = $1 if m/^Competition:\s+(.+)/;
s/^(International|National) Competition\s*$//i;
($gender, $first_name, $last_name) = $_ =~
/^(Mr\.?|Mrs\.?\s+)?([A-Z][a-z]+(?:\s+[A-Z][a-z]+\.?)?)\s+([A-Z][a-z]+(?:[-A
-Z][a-z]+)?)\s*$/; # this regex needs some refinement ... in work ...
# need some ideas for address and phone numbers again ... in work ...
if ($gender)
{
if ( $gender =~ m/Mrs\.?/ )
{
$gender = "w";
}
elsif ( $gender =~ m/Mr\.?/ )
{
$gender = "m";
}
else
{
$gender = "u";
}
}
($email) = $_ =~ m"((&#\d+;)+)";
$email = decode_entities($email) if $email;
}
if ($competition)
{
print OUT2 "\\addrentry\n";
print OUT2 "\t{$first_name}\n" if $first_name;
print OUT2 "\t{$last_name}\n" if $last_name;
print OUT2 "\t{$competition}\n";
print OUT2 "\t{$gender}\n" if $gender;
print OUT2 "\t{$email}\n" if $email;
}
}

close OUT1;
close OUT2;
 
T

Tad McClellan

Marek Stepanek said:
foreach my $entry (m"<dd>(.+?)</dd>"g)
{
push (@complete_address, $entry);
}


You can replace that whole foreach loop with:

push @complete_address, m"<dd>(.+?)</dd>"g;

There is no need to put them in one-at-a-time.


open OUT1, "> $out_file1" or die "Connot create your out_file: $!";


The error message should contain the name of the file:

open OUT1, '>', $out_file1 or die "Connot create your file '$out_file' $!";
 
B

Brian McCauley

Marek said:
foreach my $e (@complete_address)
{
$e =~ s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
$e =~ s!<br />!\n!g;
$e =~ s!<[^>]+>!!g;
push (@competitions, $e);
}

The note control variable, $e, is an _alias_ to elements of
@complete_address not a copy of them so at the end of that loop
@competitions and @complete_address will have the same content. (Given
that @competitions was emply initially).

You can therefore discard one of them.

This is also a case where using the implicit $_ would look tidier.

foreach (@complete_address)
{
s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
s!<br />!\n!g;
s!<[^>]+>!!g;
}

But really, unless you have very tight control over the input file
format, you should be using a real HTML parser.
 
M

Marek Stepanek

I don't know, how to thank you for all your input. I came back this evening,
thinking to submit you once again my new script, but now I realize, I have
first to digest all your suggestions. So probably until tomorrow evening.

greetings from Munich


marek
 
M

Marek Stepanek

I am sorry, but I am still stuck with my script. If a kind soul could check,
why some regex are not producing anything (see comment in the script). Once
again the original, which has to be red in:

http://podiuminternational.org/htdocs/addresses/competitions/competitionsfun
ds.htm

thank you again for your patience and your great help


marek


#! /usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

$/ = undef;

my (@competitions);

while (<>)
{
push @competitions, m"<dd>(.+?)</dd>"g;
}


foreach (@competitions)
{
s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
s!<br />!\n!g;
s!<p [^>]+>!\n!g;
s!<[^>]+>!!g;
}

my $out_file1 = 'letter_comp_addr_01.adr';
open OUT1, '>', $out_file1 or die "Connot create your out_file $out_file1:
$!";
my $out_file2 = 'letter_comp_addr_02.adr';
open OUT2, '>', $out_file2 or die "Connot create your out_file $out_file2:
$!";

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";

my ($competition, $email, $first_name, $last_name, $gender, $phone,
$comment);
foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr))
{
($competition) = $1 if m/^Competition:\s+(.+)/;
if ($comment and m/^\s?((?:International|National)
Competition)\s*$/i)
{
$comment .= "$1. ";
}
else
{
$comment = "$1. " if m/^\s?((?:International|National)
Competition)\s*$/i;
}
if ($comment and
m/^(Categories.+|Age.+|Application.+|mentioned.+)/i)
{
$comment .= "$1. ";
}
else
{
$comment = "$1. " if
m/^(Categories.+|Age.+|Application.+|mentioned.+)/i;
}
($gender) = $& if m/(Mrs(?:\.|\b)|Mr(?:\.|\b))/i; # here nothing
# is red in ... why? follows a longer regex, to read in the names too, which
# was not working either. I was stuck here, so I did not attacked the
# problem with the addresses yet ...
#
# ($gender, $first_name, $last_name) = $_ =~ /^
?((?:Mr\.?|Mrs\.?)\s+)?([A-Z][a-z]+(?:\s+[A-Z][a-z]*\.?)?)\s+([A-Z][a-z]+(?:
[-A-Z][a-z]+)?)\s?$/;
if ($phone and m/^\s*(?:(Fon:? .+)|(Fon:? .+))/i)
{
$phone .= "$1 ";
}
else
{
$phone = "$1 " if m/^\s*(?:(Fon:? .+)|(Fon:? .+))/i;
}
if ($gender)
{
if ( $gender =~ m/Mrs/ )
{
$gender = "w";
}
elsif ( $gender =~ m/Mr/ )
{
$gender = "m";
}
else
{
$gender = "u";
}
}
($email) = $_ =~ m"((&#\d+;)+)";
$email = decode_entities($email) if $email;
}
if ($competition)
{
print OUT2 "\\addrentry\n";
if ($last_name)
{
print OUT2 "\t{$last_name}\n";
}
else
{
print OUT2 "\t{last_name}\n";
}
if ($first_name)
{
print OUT2 "\t{$first_name}\n";
}
else
{
print OUT2 "\t{first_name}\n";
}
if ($phone)
{
print OUT2 "\t{$phone}\n";
}
else
{
print OUT2 "\t{phone}\n";
}
print OUT2 "\t{$competition}\n";
if ($gender)
{
print OUT2 "\t{$gender}\n";
}
else
{
print OUT2 "\t{gender}\n";
}
if ($email)
{
print OUT2 "\t{$email}\n";
}
else
{
print OUT2 "\t{email}\n";
}
if ($comment)
{
print OUT2 "\t{$comment}\n";
}
else
{
print OUT2 "\t{comment}\n"
}

}
($comment, $email, $gender, $competition, $last_name, $first_name,
$phone) = '';
}

close OUT1;
close OUT2;
 
M

Marek Stepanek

All in all clumsy clumsy stuff. How 'bout an $addr example along with
a description of how you want it to be parsed?


ok, dear Michele, I see I abused your time and patience. I will try this
weekend to make run this script. Meanwhile I changed it like follows, but
$first_name and $last_name, $gender, and $address is not working or inserted
only in a wacky way. Probably the data are not regular enough, to automat an
output with Perl ...

If you read in this script

% ./addresses_comp.pl competitionsfunds.htm

against the page on:

http://podiuminternational.org/htdocs/addresses/competitions/competitionsfun
ds.htm

you will get the example you are asking about.


thanx again Michele and all, who responded, I learned a lot with this
script, also if I did not achieve, finally ...


marek


#! /usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

$/ = undef;

my (@competitions);

while (<>)
{
push @competitions, (m!<a name="([^"]+)">.+?<dd>(.+?)</dd>!g);
}


foreach (@competitions)
{
s!<span\s+class="comp2">([^<]+)</span>!Competition: $1\n\n!g;
s!<br />!\n!g;
s!<p [^>]+>!\n!g;
s!<[^>]+>!!g;
}

my $out_file1 = 'letter_comp_addr_01.adr';
open OUT1, '>', $out_file1 or die "Connot create your out_file $out_file1:
$!";
my $out_file2 = 'letter_comp_addr_02.adr';
open OUT2, '>', $out_file2 or die "Connot create your out_file $out_file2:
$!";

print OUT1 join ("\n\n", @competitions);
print OUT1 "\n\n";

my ($last_name, $first_name, $address, $phone, $gender, $competition,
$email, $comment);
foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr))
{
($first_name, $last_name) = $_ =~
/^([A-Z][a-z]+(?:\s+[A-Z][a-z]*\.?)?)\s+([A-Z][a-z]+(?:[-A-Z][a-z]+)?)\s?$/;
if ($address and m/^[-,.\s\w]+$/)
{
$address .= "$&";
}
else
{
$address = "$& \\\\" if m/^[-,.\s\w]+$/;
}
if ($phone and m/^\s*(Fon:? .+|Fax:? .+|http:.+)/i)
{
$phone .= "$& ";
}
else
{
$phone = "$& " if m/^\s*(?:(Fon:? .+)|(Fon:? .+))/i;
}
($gender) = $& if m/(Mrs(?:\.|\b)|Mr(?:\.|\b).+)/i;
if ($gender)
{
if ( $gender =~ m/Mrs.?/ )
{
$gender = "w";
}
elsif ( $gender =~ m/Mr.?/ )
{
$gender = "m";
}
else
{
$gender = "u";
}
}
($competition) = $1 if m/^Competition:\s+(.+)/;
($email) = $_ =~ m"((&#\d+;)+)";
$email = decode_entities($email) if $email;
if ($comment and m/^\s?((?:International|National)
Competition)\s*$/i)
{
$comment .= "$1. ";
}
else
{
$comment = "$1. " if m/^\s?((?:International|National)
Competition)\s*$/i;
}
if ($comment and
m/^(Categories.+|Age.+|Application.+|mentioned.+)/i)
{
$comment .= "$1. ";
}
else
{
$comment = "$1. " if
m/^(Categories.+|Age.+|Application.+|mentioned.+)/i;
}
}
if ($competition)
{
print OUT2 "\\addrentry\n";
print OUT2 $last_name ? "\t{$last_name}\n" : "\t{last_name}\n";
print OUT2 $first_name ? "\t{$first_name}\n" : "\t{first_name}\n";
print OUT2 $address ? "\t{$address}\n" : "\t{address}\n";
print OUT2 $phone ? "\t{$phone}\n" : "\t{phone}\n";
print OUT2 "\t{$competition}\n";
print OUT2 $gender ? "\t{$gender}\n" : "\t{gender}\n";
print OUT2 $email ? "\t{$email}\n" : "\t{email}\n";
print OUT2 $comment ? "\t{$comment}\n" : "\t{comment}\n";
}
($last_name, $first_name, $address, $phone, $gender, $competition,
$email, $comment) = '';
}

close OUT1;
close OUT2;
 
P

Peter J. Holzer

my ($competition, $email, $first_name, $last_name, $gender, $phone,
$comment);
foreach my $addr (@competitions)
{
foreach (split(/\n/, $addr))
{
($competition) = $1 if m/^Competition:\s+(.+)/;
if ($comment and m/^\s?((?:International|National)
Competition)\s*$/i)
[snip]

All in all clumsy clumsy stuff.

Clumsiness is the mark of the beginner. Elegance comes with practice.

And maybe we should have formulated our suggestions in a more clumsy
manner, because while Marek has copied pieces of code we suggested he
may not have understood them. For example I suggested the line

(although without the first set of parentheses, IIRC), but of course my
advice didn't apply only to this line but to all lines where he extracts
variables in a similar manner. Maybe it would have been more clear what
his error was (and still is in other places of the code) if I had
suggested

if (m/^Competition:\s+(.+)/) {
$competition = $1;
}

instead.


Instead of giving more advice on the code, I'd like to suggest to Marek
to learn to use the Perl debugger (perl -d script.pl). Step through your
script line by line and see how what values the variables have: That way
you will get a better understanding on how your program works and where
the errors are.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top