regex help!

Discussion in 'Perl Misc' started by Geoff Cox, Sep 13, 2003.

  1. Geoff Cox

    Geoff Cox Guest

    Hello,

    I am trying to extract email addresses from about 1000 htm files.

    So far am trying

    if ($line =~ /Mailto:(.*)"/ {
    print OUT ("$1 \n");

    where the line is

    <a href="mailto:"

    problem is with the " after the email address and the "greedy" regex
    characteristic which finds other " further along the line ...

    can I stop at the first " mark?

    Cheers

    Geoff
     
    Geoff Cox, Sep 13, 2003
    #1
    1. Advertising

  2. In article <>, Geoff Cox wrote:
    > Hello,
    >
    > I am trying to extract email addresses from about 1000 htm files.


    E-mail address harvesting on your spare time, are you?

    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");

    [cut]
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...


    Read the perlre manual about changing the "greediness" of a
    quantifier with "?".


    --
    Andreas Kähäri
     
    Andreas Kahari, Sep 13, 2003
    #2
    1. Advertising

  3. In article <>,
    Geoff Cox <> wrote:

    > Hello,
    >
    > I am trying to extract email addresses from about 1000 htm files.
    >
    > So far am trying
    >
    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");
    >
    > where the line is
    >
    > <a href="mailto:"
    >
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...
    >
    > can I stop at the first " mark?


    /Mailto:(.*?)"/

    you know that won't match your example don't you? unless you add the 'i'
    flag (for 'i'gnore case):


    /Mailto:(.*?)"/i

    hth-

    --
    Michael Budash
     
    Michael Budash, Sep 13, 2003
    #3
  4. Geoff Cox

    Geoff Cox Guest

    On Sat, 13 Sep 2003 07:33:31 GMT, Michael Budash <>
    wrote:

    >/Mailto:(.*?)"/
    >
    >you know that won't match your example don't you? unless you add the 'i'
    >flag (for 'i'gnore case):


    Michael,

    Thanks for the help - following code works now but I get the error
    message "uninitialized value in string ne at ... the line with a **
    below - do you knwo why?

    Cheers

    Geoff

    use warnings;
    use strict;

    use File::Find;

    open (OUT, ">>out");

    my $dir = 'c:/atemp1/directory';

    find ( sub {

    open (IN, "$_");
    my $line = <IN>;
    ** while ($line ne "") {
    if ($line =~ /Mailto:(.*?)"/i) {
    print OUT ("$1 \n");
    }
    $line = <IN>;
    }

    }, $dir);

    close (OUT);


    >
    >/Mailto:(.*?)"/i
    >
    >hth-
     
    Geoff Cox, Sep 13, 2003
    #4
  5. In article <>, Geoff Cox wrote:
    [cut]
    > Thanks for the help - following code works now but I get the error
    > message "uninitialized value in string ne at ... the line with a **
    > below - do you knwo why?

    [cut]
    > open (IN, "$_");
    > my $line = <IN>;
    > ** while ($line ne "") {
    > if ($line =~ /Mailto:(.*?)"/i) {
    > print OUT ("$1 \n");

    [cut]


    What happens at the end of a file? Well, <IN> will give you an
    undefined value. This will also happen if the open() call failed.


    --
    Andreas Kähäri
     
    Andreas Kahari, Sep 13, 2003
    #5
  6. Geoff Cox

    Geoff Cox Guest

    On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
    <> wrote:

    >In article <>, Geoff Cox wrote:
    >[cut]
    >> Thanks for the help - following code works now but I get the error
    >> message "uninitialized value in string ne at ... the line with a **
    >> below - do you knwo why?

    >[cut]
    >> open (IN, "$_");
    >> my $line = <IN>;
    >> ** while ($line ne "") {
    >> if ($line =~ /Mailto:(.*?)"/i) {
    >> print OUT ("$1 \n");

    >[cut]
    >
    >
    >What happens at the end of a file? Well, <IN> will give you an
    >undefined value. This will also happen if the open() call failed.


    Andreas,

    ah! well the open call works so must be the end of file part - is
    there a better way than using while ($line ne "" ) ? eof?

    Geoff
     
    Geoff Cox, Sep 13, 2003
    #6
  7. In article <>, Geoff Cox wrote:
    > On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
    ><> wrote:
    >>In article <>, Geoff Cox wrote:

    [cut]
    >>> open (IN, "$_");
    >>> my $line = <IN>;
    >>> ** while ($line ne "") {
    >>> if ($line =~ /Mailto:(.*?)"/i) {
    >>> print OUT ("$1 \n");

    >>[cut]
    >>
    >>
    >>What happens at the end of a file? Well, <IN> will give you an
    >>undefined value. This will also happen if the open() call failed.

    >
    > Andreas,
    >
    > ah! well the open call works so must be the end of file part - is
    > there a better way than using while ($line ne "" ) ? eof?


    Yes, a much much better way:

    while(defined($line = <IN>)) {
    ... code ...
    }

    And personally I would say

    open(IN, $_) or die "Failed in open(): $!";


    Cheers,
    Andreas

    --
    Andreas Kähäri
     
    Andreas Kahari, Sep 13, 2003
    #7
  8. Geoff Cox

    Geoff Cox Guest

    On Sat, 13 Sep 2003 08:39:03 +0000 (UTC), Andreas Kahari
    <> wrote:

    >Yes, a much much better way:
    >
    > while(defined($line = <IN>)) {
    > ... code ...
    > }
    >
    >And personally I would say
    >
    > open(IN, $_) or die "Failed in open(): $!";


    will use both - thanks!

    Geoff

    >
    >
    >Cheers,
    >Andreas
     
    Geoff Cox, Sep 13, 2003
    #8
  9. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Geoff Cox <> wrote in
    news::

    > I am trying to extract email addresses from about 1000 htm files.
    >
    > So far am trying
    >
    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");
    >
    > where the line is
    >
    > <a href="mailto:"
    >
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...
    >
    > can I stop at the first " mark?


    Change your thinking a bit. Instead of matching "Mailto:" followed by as
    many characters as possible followed by a quote, match "Mailto:" followed
    by as many non-quote characters as possible followed by a quote:

    if ($line =~ /Mailto:([^"]*)"/)

    Also consider making it case-insensitive with the i modifier.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP2MoO2PeouIeTNHoEQIdtACgxV2WliWoH07gZaS39JHGdb1q+wAAn1f6
    oXom0J4O85KppYwOysICYuZs
    =yU+G
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Sep 13, 2003
    #9
  10. Geoff Cox

    Geoff Cox Guest

    On Sat, 13 Sep 2003 09:22:06 -0500, "Eric J. Roode"
    <> wrote:


    >Change your thinking a bit. Instead of matching "Mailto:" followed by as
    >many characters as possible followed by a quote, match "Mailto:" followed
    >by as many non-quote characters as possible followed by a quote:
    >
    > if ($line =~ /Mailto:([^"]*)"/)


    Thanks Eric - will give it a try...

    Cheers

    Geoff

    >
    >Also consider making it case-insensitive with the i modifier.
     
    Geoff Cox, Sep 13, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    742
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,685
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    625
  4. Xah Lee
    Replies:
    1
    Views:
    971
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    822
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page