Assigning pattern matches to an array

G

Graham Stow

The following is a crude attempt at matching occurrences of email addresses
within files in a directory. However, I can't figure out why line 15 doesn't
assign the pattern matches to the @matches array. Any ideas gang, or have I
been eating too much turkey?

#!/usr/local/bin/perl
use File::Find;
@directories = ("c:/email2");
find (\&wanted, @directories);
sub wanted {
$filename=$File::Find::name;
if ($filename =~ /\.\w{3}$/) {
push(@files, $filename);
}
}
foreach $file (@files) {
open (DATA, "$file") || die "Error opening $file\n";
@whole_file = <DATA>;
foreach $line (@whole_file) {
@matches = /\b\w+@\w+\b/g;
}
close DATA || die "Unable to close $file\n";
# closes the current file
}
foreach $match (@matches) {
print "$match\n";
}
$count += @matches;
print "$count matches\n";
 
J

John W. Krahn

Graham said:
The following is a crude attempt at matching occurrences of email addresses
within files in a directory. However, I can't figure out why line 15 doesn't
assign the pattern matches to the @matches array. Any ideas gang, or have I
been eating too much turkey?

#!/usr/local/bin/perl

use warnings;
use strict;
use File::Find;
@directories = ("c:/email2");
find (\&wanted, @directories);
sub wanted {
$filename=$File::Find::name;
if ($filename =~ /\.\w{3}$/) {
push(@files, $filename);
}
}
foreach $file (@files) {
open (DATA, "$file") || die "Error opening $file\n";
@whole_file = <DATA>;
foreach $line (@whole_file) {
@matches = /\b\w+@\w+\b/g;

That line is short for:

@matches = $_ =~ /\b\w+@\w+\b/g;

But the current line is in $line not in $_ so you have to do:

@matches = $line =~ /\b\w+@\w+\b/g;

}
close DATA || die "Unable to close $file\n";
# closes the current file
}
foreach $match (@matches) {
print "$match\n";
}
$count += @matches;
print "$count matches\n";




John
 
G

Graham Stow

John W. Krahn said:
use warnings;
use strict;


That line is short for:

@matches = $_ =~ /\b\w+@\w+\b/g;

But the current line is in $line not in $_ so you have to do:

@matches = $line =~ /\b\w+@\w+\b/g;






John
Makes sense John, but doesn't work -I still get 0 matches (and I'm certain I
should be getting some).
Graham
 
U

Uri Guttman

JWK> That line is short for:

JWK> @matches = $_ =~ /\b\w+@\w+\b/g;

JWK> But the current line is in $line not in $_ so you have to do:

JWK> @matches = $line =~ /\b\w+@\w+\b/g;

and that will overwrite any matches for the previous line. so the print
loop will only see the matches on the last line of a file. push is
needed here. or a map can be used which will remove the loop:

@matches = map /\b\w+@\w+\b/g, @lines ;

if that is in a file loop, use push also.

uri
 
G

Graham Stow

Uri Guttman said:
JWK> That line is short for:

JWK> @matches = $_ =~ /\b\w+@\w+\b/g;

JWK> But the current line is in $line not in $_ so you have to do:

JWK> @matches = $line =~ /\b\w+@\w+\b/g;

and that will overwrite any matches for the previous line. so the print
loop will only see the matches on the last line of a file. push is
needed here. or a map can be used which will remove the loop:

@matches = map /\b\w+@\w+\b/g, @lines ;

if that is in a file loop, use push also.

uri

--
Uri Guttman ------ (e-mail address removed) --------
http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and
Coding-
Search or Offer Perl Jobs ----------------------------
http://jobs.perl.org

Thanks Uri!
push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
The pattern doesn't match an email address, but I can work on that...
Graham
 
D

DJ Stunks

Graham said:
push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
The pattern doesn't match an email address, but I can work on that...

well, for one thing, the \w metacharacter doesn't match a literal .

don't roll your own email address regexp.

perldoc Email::Address

-jp
 
G

Graham Stow

DJ Stunks said:
well, for one thing, the \w metacharacter doesn't match a literal .

don't roll your own email address regexp.

perldoc Email::Address

-jp

Emaill::Address doesn't grab me
Done a quick test between
use Email::Address
push(@matches, Email::Address->parse($line));
and
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
The latter pulled up a number of correct email address, while the former
pulled these up plus other stuff that weren't true email addresses
Graham
 
P

Paul Lalli

Graham said:
Emaill::Address doesn't grab me
Done a quick test between
use Email::Address
push(@matches, Email::Address->parse($line));
and
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
The latter pulled up a number of correct email address, while the former
pulled these up plus other stuff that weren't true email addresses

Says you. I trust Email::Address's belief of what a "true" email
address is a hell of a lot better than yours. Just because they don't
look like what you might consider "normal" addresses doesn't mean they
aren't valid. Email::Address follows the RFC. Your handrolled
solution does not.

Paul Lalli
 
G

Gunnar Hjalmarsson

Paul said:
Graham said:
Done a quick test between
use Email::Address
push(@matches, Email::Address->parse($line));
and
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
The latter pulled up a number of correct email address, while the former
pulled these up plus other stuff that weren't true email addresses

Could you post some example data showing that?
Says you. I trust Email::Address's belief of what a "true" email
address is a hell of a lot better than yours. Just because they don't
look like what you might consider "normal" addresses doesn't mean they
aren't valid. Email::Address follows the RFC. Your handrolled
solution does not.

I suspect that a library that accepts _all_ RFC 822 compliant addresses
isn't an adequate tool for parsing out substrings from any document that
are likely email addresses.
 
D

DJ Stunks

Graham said:
DJ Stunks said:
well, for one thing, the \w metacharacter doesn't match a literal .

don't roll your own email address regexp.

perldoc Email::Address

Emaill::Address doesn't grab me
Done a quick test between
use Email::Address
push(@matches, Email::Address->parse($line));
and
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
The latter pulled up a number of correct email address, while the former
pulled these up plus other stuff that weren't true email addresses

try (untested):

push @matches, map { $_->address } Email::Address->parse($line);

-jp
 
G

Graham Stow

DJ Stunks said:
Graham said:
DJ Stunks said:
Graham Stow wrote:
push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
The pattern doesn't match an email address, but I can work on that...

well, for one thing, the \w metacharacter doesn't match a literal .

don't roll your own email address regexp.

perldoc Email::Address

Emaill::Address doesn't grab me
Done a quick test between
use Email::Address
push(@matches, Email::Address->parse($line));
and
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
The latter pulled up a number of correct email address, while the former
pulled these up plus other stuff that weren't true email addresses

try (untested):

push @matches, map { $_->address } Email::Address->parse($line);

-jp
Using the above line of
push @matches, map { $_->address } Email::Address->parse($line);
on a directory including one 'Word' document containing four email addresses, I got the output:-
(e-mail address removed)

(e-mail address removed)}}}{\f1\fs20\lang1033\langfe1033\langnp1033

(e-mail address removed)

(e-mail address removed)}}}{\f1\fs20\lang1033\langfe1033\langnp1033

(e-mail address removed)

(e-mail address removed)}}}{\f1\fs20\lang1033\langfe1033\langnp1033

(e-mail address removed)

(e-mail address removed)}}}{\f1\fs20\lang1033\langfe1033\langnp1033

8 matches

Using 'my' line of
push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
on a directory including one textfile and one 'Word' document, both containing a few email addresses, I go the output:-
(e-mail address removed)

(e-mail address removed)

(e-mail address removed)

(e-mail address removed)

(e-mail address removed)

5 matches

Interestingly neither are perfect (both can't resolve edward_woodward@ correctly), but at least mine doesn't produce the additional characters beyond the email address that using Email::Address produces
 
G

Gunnar Hjalmarsson

Graham said:
Using the above line of
push @matches, map { $_->address } Email::Address->parse($line);
on a directory including one 'Word' document containing four email
addresses, I got the output:-

(e-mail address removed)

(e-mail address removed)}}}{\f1\fs20\lang1033\langfe1033\langnp1033

If I have understood it correctly, RFC 822 does not accept backslashes
in the domain part of an address. A bug in Email::Address?
 
B

Ben Bacarisse

Gunnar Hjalmarsson said:
If I have understood it correctly, RFC 822 does not accept backslashes
in the domain part of an address. A bug in Email::Address?

The governing RFC is now 2822 and, no, it does not allow \ in the
domain. A quick look at the source suggests that this is a simple
omission. While the RFC defines what *is* allowed in a "dot-atom",
Email::Address lists what is to be excluded (control characters and
"special" characters) and \ is not there. The bug seems to be in the
line:

my $special = q[()<>\\[\\]:;@\\,."];

the intent being, presumably, to have \\\\ rather than to quote the
comma.

To the OP: fixing this will not solve your problem as curly brackets
*are* allowed so even a corrected Email::Address will parse more than
you'd like. If you were to use a package that can pull apart a Word
document so that you match only in the text parts you might not see
this problem since I image that the {}s are part of the document
structure rather than its text. Otherwise your only option is to
match less than the RFC allows. A reasonable heuristic might be that
no TLDs contain { or }.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top