Assigning pattern matches to an array

Discussion in 'Perl Misc' started by Graham Stow, Dec 30, 2006.

  1. Graham Stow

    Graham Stow Guest

    The following is a crude attempt at matching occurrences of email addresses
    within files in a directory. However, I can't figure out why line 15 doesn't
    assign the pattern matches to the @matches array. Any ideas gang, or have I
    been eating too much turkey?

    #!/usr/local/bin/perl
    use File::Find;
    @directories = ("c:/email2");
    find (\&wanted, @directories);
    sub wanted {
    $filename=$File::Find::name;
    if ($filename =~ /\.\w{3}$/) {
    push(@files, $filename);
    }
    }
    foreach $file (@files) {
    open (DATA, "$file") || die "Error opening $file\n";
    @whole_file = <DATA>;
    foreach $line (@whole_file) {
    @matches = /\b\w+@\w+\b/g;
    }
    close DATA || die "Unable to close $file\n";
    # closes the current file
    }
    foreach $match (@matches) {
    print "$match\n";
    }
    $count += @matches;
    print "$count matches\n";
     
    Graham Stow, Dec 30, 2006
    #1
    1. Advertising

  2. Graham Stow wrote:
    > The following is a crude attempt at matching occurrences of email addresses
    > within files in a directory. However, I can't figure out why line 15 doesn't
    > assign the pattern matches to the @matches array. Any ideas gang, or have I
    > been eating too much turkey?
    >
    > #!/usr/local/bin/perl


    use warnings;
    use strict;

    > use File::Find;
    > @directories = ("c:/email2");
    > find (\&wanted, @directories);
    > sub wanted {
    > $filename=$File::Find::name;
    > if ($filename =~ /\.\w{3}$/) {
    > push(@files, $filename);
    > }
    > }
    > foreach $file (@files) {
    > open (DATA, "$file") || die "Error opening $file\n";
    > @whole_file = <DATA>;
    > foreach $line (@whole_file) {
    > @matches = /\b\w+@\w+\b/g;


    That line is short for:

    @matches = $_ =~ /\b\w+@\w+\b/g;

    But the current line is in $line not in $_ so you have to do:

    @matches = $line =~ /\b\w+@\w+\b/g;


    > }
    > close DATA || die "Unable to close $file\n";
    > # closes the current file
    > }
    > foreach $match (@matches) {
    > print "$match\n";
    > }
    > $count += @matches;
    > print "$count matches\n";





    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Dec 30, 2006
    #2
    1. Advertising

  3. Graham Stow

    Graham Stow Guest

    "John W. Krahn" <> wrote in message
    news:Vsxlh.96488$YV4.8365@edtnps89...
    > Graham Stow wrote:
    >> The following is a crude attempt at matching occurrences of email
    >> addresses
    >> within files in a directory. However, I can't figure out why line 15
    >> doesn't
    >> assign the pattern matches to the @matches array. Any ideas gang, or have
    >> I
    >> been eating too much turkey?
    >>
    >> #!/usr/local/bin/perl

    >
    > use warnings;
    > use strict;
    >
    >> use File::Find;
    >> @directories = ("c:/email2");
    >> find (\&wanted, @directories);
    >> sub wanted {
    >> $filename=$File::Find::name;
    >> if ($filename =~ /\.\w{3}$/) {
    >> push(@files, $filename);
    >> }
    >> }
    >> foreach $file (@files) {
    >> open (DATA, "$file") || die "Error opening $file\n";
    >> @whole_file = <DATA>;
    >> foreach $line (@whole_file) {
    >> @matches = /\b\w+@\w+\b/g;

    >
    > That line is short for:
    >
    > @matches = $_ =~ /\b\w+@\w+\b/g;
    >
    > But the current line is in $line not in $_ so you have to do:
    >
    > @matches = $line =~ /\b\w+@\w+\b/g;
    >
    >
    >> }
    >> close DATA || die "Unable to close $file\n";
    >> # closes the current file
    >> }
    >> foreach $match (@matches) {
    >> print "$match\n";
    >> }
    >> $count += @matches;
    >> print "$count matches\n";

    >
    >
    >
    >
    > John
    > --
    > Perl isn't a toolbox, but a small machine shop where you can special-order
    > certain sorts of tools at low cost and in short order. -- Larry Wall

    Makes sense John, but doesn't work -I still get 0 matches (and I'm certain I
    should be getting some).
    Graham
     
    Graham Stow, Dec 30, 2006
    #3
  4. Graham Stow

    Uri Guttman Guest

    >>>>> "JWK" == John W Krahn <> writes:


    JWK> That line is short for:

    JWK> @matches = $_ =~ /\b\w+@\w+\b/g;

    JWK> But the current line is in $line not in $_ so you have to do:

    JWK> @matches = $line =~ /\b\w+@\w+\b/g;

    and that will overwrite any matches for the previous line. so the print
    loop will only see the matches on the last line of a file. push is
    needed here. or a map can be used which will remove the loop:

    @matches = map /\b\w+@\w+\b/g, @lines ;

    if that is in a file loop, use push also.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Dec 30, 2006
    #4
  5. Graham Stow

    Graham Stow Guest

    "Uri Guttman" <> wrote in message
    news:...
    >>>>>> "JWK" == John W Krahn <> writes:

    >
    >
    > JWK> That line is short for:
    >
    > JWK> @matches = $_ =~ /\b\w+@\w+\b/g;
    >
    > JWK> But the current line is in $line not in $_ so you have to do:
    >
    > JWK> @matches = $line =~ /\b\w+@\w+\b/g;
    >
    > and that will overwrite any matches for the previous line. so the print
    > loop will only see the matches on the last line of a file. push is
    > needed here. or a map can be used which will remove the loop:
    >
    > @matches = map /\b\w+@\w+\b/g, @lines ;
    >
    > if that is in a file loop, use push also.
    >
    > uri
    >
    > --
    > Uri Guttman ------ --------
    > http://www.stemsystems.com
    > --Perl Consulting, Stem Development, Systems Architecture, Design and
    > Coding-
    > Search or Offer Perl Jobs ----------------------------
    > http://jobs.perl.org


    Thanks Uri!
    push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
    The pattern doesn't match an email address, but I can work on that...
    Graham
     
    Graham Stow, Dec 30, 2006
    #5
  6. Graham Stow

    DJ Stunks Guest

    Graham Stow wrote:
    > push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
    > The pattern doesn't match an email address, but I can work on that...


    well, for one thing, the \w metacharacter doesn't match a literal .

    don't roll your own email address regexp.

    perldoc Email::Address

    -jp
     
    DJ Stunks, Dec 30, 2006
    #6
  7. Graham Stow

    Graham Stow Guest

    "DJ Stunks" <> wrote in message
    news:...
    > Graham Stow wrote:
    >> push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
    >> The pattern doesn't match an email address, but I can work on that...

    >
    > well, for one thing, the \w metacharacter doesn't match a literal .
    >
    > don't roll your own email address regexp.
    >
    > perldoc Email::Address
    >
    > -jp
    >


    Emaill::Address doesn't grab me
    Done a quick test between
    use Email::Address
    push(@matches, Email::Address->parse($line));
    and
    push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    The latter pulled up a number of correct email address, while the former
    pulled these up plus other stuff that weren't true email addresses
    Graham
     
    Graham Stow, Dec 30, 2006
    #7
  8. Graham Stow

    Dr.Ruud Guest

    Graham Stow schreef:

    > /\b\w+@\w+\b/g;



    The \b's are superfluous there.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Dec 30, 2006
    #8
  9. Graham Stow

    Paul Lalli Guest

    Graham Stow wrote:

    > Emaill::Address doesn't grab me
    > Done a quick test between
    > use Email::Address
    > push(@matches, Email::Address->parse($line));
    > and
    > push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    > The latter pulled up a number of correct email address, while the former
    > pulled these up plus other stuff that weren't true email addresses


    Says you. I trust Email::Address's belief of what a "true" email
    address is a hell of a lot better than yours. Just because they don't
    look like what you might consider "normal" addresses doesn't mean they
    aren't valid. Email::Address follows the RFC. Your handrolled
    solution does not.

    Paul Lalli
     
    Paul Lalli, Dec 30, 2006
    #9
  10. Paul Lalli wrote:
    > Graham Stow wrote:
    >>Done a quick test between
    >> use Email::Address
    >> push(@matches, Email::Address->parse($line));
    >>and
    >> push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    >>The latter pulled up a number of correct email address, while the former
    >>pulled these up plus other stuff that weren't true email addresses


    Could you post some example data showing that?

    > Says you. I trust Email::Address's belief of what a "true" email
    > address is a hell of a lot better than yours. Just because they don't
    > look like what you might consider "normal" addresses doesn't mean they
    > aren't valid. Email::Address follows the RFC. Your handrolled
    > solution does not.


    I suspect that a library that accepts _all_ RFC 822 compliant addresses
    isn't an adequate tool for parsing out substrings from any document that
    are likely email addresses.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 30, 2006
    #10
  11. Graham Stow

    DJ Stunks Guest

    Graham Stow wrote:
    > "DJ Stunks" <> wrote in message
    > news:...
    > > Graham Stow wrote:
    > >> push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
    > >> The pattern doesn't match an email address, but I can work on that...

    > >
    > > well, for one thing, the \w metacharacter doesn't match a literal .
    > >
    > > don't roll your own email address regexp.
    > >
    > > perldoc Email::Address

    >
    > Emaill::Address doesn't grab me
    > Done a quick test between
    > use Email::Address
    > push(@matches, Email::Address->parse($line));
    > and
    > push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    > The latter pulled up a number of correct email address, while the former
    > pulled these up plus other stuff that weren't true email addresses


    try (untested):

    push @matches, map { $_->address } Email::Address->parse($line);

    -jp
     
    DJ Stunks, Dec 30, 2006
    #11
  12. Graham Stow

    Graham Stow Guest

    "DJ Stunks" <> wrote in message news:...
    > Graham Stow wrote:
    >> "DJ Stunks" <> wrote in message
    >> news:...
    >> > Graham Stow wrote:
    >> >> push(@matches, $line=~/\b\w+@\w+\b/g); did it for me
    >> >> The pattern doesn't match an email address, but I can work on that...
    >> >
    >> > well, for one thing, the \w metacharacter doesn't match a literal .
    >> >
    >> > don't roll your own email address regexp.
    >> >
    >> > perldoc Email::Address

    >>
    >> Emaill::Address doesn't grab me
    >> Done a quick test between
    >> use Email::Address
    >> push(@matches, Email::Address->parse($line));
    >> and
    >> push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    >> The latter pulled up a number of correct email address, while the former
    >> pulled these up plus other stuff that weren't true email addresses

    >
    > try (untested):
    >
    > push @matches, map { $_->address } Email::Address->parse($line);
    >
    > -jp
    >

    Using the above line of
    push @matches, map { $_->address } Email::Address->parse($line);
    on a directory including one 'Word' document containing four email addresses, I got the output:-


    }}}{\f1\fs20\lang1033\langfe1033\langnp1033



    }}}{\f1\fs20\lang1033\langfe1033\langnp1033



    }}}{\f1\fs20\lang1033\langfe1033\langnp1033



    }}}{\f1\fs20\lang1033\langfe1033\langnp1033

    8 matches

    Using 'my' line of
    push(@matches, $line=~/\b[.-\w]*@[-\w]*\.+[-\w]*\.*[-\w]*\b/g);
    on a directory including one textfile and one 'Word' document, both containing a few email addresses, I go the output:-










    5 matches

    Interestingly neither are perfect (both can't resolve edward_woodward@ correctly), but at least mine doesn't produce the additional characters beyond the email address that using Email::Address produces
     
    Graham Stow, Dec 31, 2006
    #12
  13. Graham Stow wrote:
    > "DJ Stunks" wrote:
    > >
    > > try (untested):
    > >
    > > push @matches, map { $_->address } Email::Address->parse($line);
    > >

    > Using the above line of
    > push @matches, map { $_->address } Email::Address->parse($line);
    > on a directory including one 'Word' document containing four email
    > addresses, I got the output:-
    >
    >
    >
    > }}}{\f1\fs20\lang1033\langfe1033\langnp1033


    If I have understood it correctly, RFC 822 does not accept backslashes
    in the domain part of an address. A bug in Email::Address?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 31, 2006
    #13
  14. Gunnar Hjalmarsson <> writes:

    > Graham Stow wrote:
    >> "DJ Stunks" wrote: >
    >> > try (untested):
    >> >
    >> > push @matches, map { $_->address } Email::Address->parse($line);
    >> >

    >> Using the above line of
    >> push @matches, map { $_->address } Email::Address->parse($line);
    >> on a directory including one 'Word' document containing four email
    >> addresses, I got the output:-
    >>
    >> }}}{\f1\fs20\lang1033\langfe1033\langnp1033

    >
    > If I have understood it correctly, RFC 822 does not accept backslashes
    > in the domain part of an address. A bug in Email::Address?


    The governing RFC is now 2822 and, no, it does not allow \ in the
    domain. A quick look at the source suggests that this is a simple
    omission. While the RFC defines what *is* allowed in a "dot-atom",
    Email::Address lists what is to be excluded (control characters and
    "special" characters) and \ is not there. The bug seems to be in the
    line:

    my $special = q[()<>\\[\\]:;@\\,."];

    the intent being, presumably, to have \\\\ rather than to quote the
    comma.

    To the OP: fixing this will not solve your problem as curly brackets
    *are* allowed so even a corrected Email::Address will parse more than
    you'd like. If you were to use a package that can pull apart a Word
    document so that you match only in the text parts you might not see
    this problem since I image that the {}s are part of the document
    structure rather than its text. Otherwise your only option is to
    match less than the RFC allows. A reasonable heuristic might be that
    no TLDs contain { or }.

    --
    Ben.
     
    Ben Bacarisse, Dec 31, 2006
    #14
  15. Ben Bacarisse wrote:
    > Gunnar Hjalmarsson <> writes:
    >>Graham Stow wrote:
    >>>
    >>>}}}{\f1\fs20\lang1033\langfe1033\langnp1033

    >>
    >>If I have understood it correctly, RFC 822 does not accept backslashes
    >>in the domain part of an address. A bug in Email::Address?

    >
    > The governing RFC is now 2822 and, no, it does not allow \ in the
    > domain.


    Thanks, Ben, for confirming my suspicion and for reporting the bug.

    http://rt.cpan.org/Public/Bug/Display.html?id=24161

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jan 1, 2007
    #15
  16. Gunnar Hjalmarsson <> writes:

    > Ben Bacarisse wrote:
    >> Gunnar Hjalmarsson <> writes:
    >>>Graham Stow wrote:
    >>>>
    >>>>}}}{\f1\fs20\lang1033\langfe1033\langnp1033
    >>>
    >>>If I have understood it correctly, RFC 822 does not accept backslashes
    >>>in the domain part of an address. A bug in Email::Address?

    >> The governing RFC is now 2822 and, no, it does not allow \ in the
    >> domain.

    >
    > Thanks, Ben, for confirming my suspicion and for reporting the bug.
    >
    > http://rt.cpan.org/Public/Bug/Display.html?id=24161


    You are welcome and, sorry, I should have posted back here that I had
    reported it.

    --
    Ben.
     
    Ben Bacarisse, Jan 1, 2007
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rocalp
    Replies:
    0
    Views:
    379
    rocalp
    Feb 18, 2004
  2. Peekachu
    Replies:
    1
    Views:
    337
    Oliver Wong
    Jul 10, 2006
  3. Markus Fischer
    Replies:
    9
    Views:
    166
    7stud --
    Apr 8, 2011
  4. jhu
    Replies:
    6
    Views:
    118
    Dave Weaver
    Nov 26, 2007
  5. weston
    Replies:
    1
    Views:
    253
    Richard Cornford
    Sep 22, 2006
Loading...

Share This Page