regex help!

G

Geoff Cox

Hello,

I am trying to extract email addresses from about 1000 htm files.

So far am trying

if ($line =~ /Mailto:(.*)"/ {
print OUT ("$1 \n");

where the line is

<a href="mailto:[email protected]"

problem is with the " after the email address and the "greedy" regex
characteristic which finds other " further along the line ...

can I stop at the first " mark?

Cheers

Geoff
 
A

Andreas Kahari

Hello,

I am trying to extract email addresses from about 1000 htm files.

E-mail address harvesting on your spare time, are you?
if ($line =~ /Mailto:(.*)"/ {
print OUT ("$1 \n"); [cut]
problem is with the " after the email address and the "greedy" regex
characteristic which finds other " further along the line ...

Read the perlre manual about changing the "greediness" of a
quantifier with "?".
 
M

Michael Budash

Geoff Cox said:
Hello,

I am trying to extract email addresses from about 1000 htm files.

So far am trying

if ($line =~ /Mailto:(.*)"/ {
print OUT ("$1 \n");

where the line is

<a href="mailto:[email protected]"

problem is with the " after the email address and the "greedy" regex
characteristic which finds other " further along the line ...

can I stop at the first " mark?

/Mailto:(.*?)"/

you know that won't match your example don't you? unless you add the 'i'
flag (for 'i'gnore case):


/Mailto:(.*?)"/i

hth-
 
G

Geoff Cox

/Mailto:(.*?)"/

you know that won't match your example don't you? unless you add the 'i'
flag (for 'i'gnore case):

Michael,

Thanks for the help - following code works now but I get the error
message "uninitialized value in string ne at ... the line with a **
below - do you knwo why?

Cheers

Geoff

use warnings;
use strict;

use File::Find;

open (OUT, ">>out");

my $dir = 'c:/atemp1/directory';

find ( sub {

open (IN, "$_");
my $line = <IN>;
** while ($line ne "") {
if ($line =~ /Mailto:(.*?)"/i) {
print OUT ("$1 \n");
}
$line = <IN>;
}

}, $dir);

close (OUT);
 
A

Andreas Kahari

Geoff Cox wrote: said:
Thanks for the help - following code works now but I get the error
message "uninitialized value in string ne at ... the line with a **
below - do you knwo why? [cut]
open (IN, "$_");
my $line = <IN>;
** while ($line ne "") {
if ($line =~ /Mailto:(.*?)"/i) {
print OUT ("$1 \n");
[cut]


What happens at the end of a file? Well, <IN> will give you an
undefined value. This will also happen if the open() call failed.
 
G

Geoff Cox

Geoff Cox wrote: said:
Thanks for the help - following code works now but I get the error
message "uninitialized value in string ne at ... the line with a **
below - do you knwo why? [cut]
open (IN, "$_");
my $line = <IN>;
** while ($line ne "") {
if ($line =~ /Mailto:(.*?)"/i) {
print OUT ("$1 \n");
[cut]


What happens at the end of a file? Well, <IN> will give you an
undefined value. This will also happen if the open() call failed.

Andreas,

ah! well the open call works so must be the end of file part - is
there a better way than using while ($line ne "" ) ? eof?

Geoff
 
A

Andreas Kahari

open (IN, "$_");
my $line = <IN>;
** while ($line ne "") {
if ($line =~ /Mailto:(.*?)"/i) {
print OUT ("$1 \n");
[cut]


What happens at the end of a file? Well, <IN> will give you an
undefined value. This will also happen if the open() call failed.

Andreas,

ah! well the open call works so must be the end of file part - is
there a better way than using while ($line ne "" ) ? eof?

Yes, a much much better way:

while(defined($line = <IN>)) {
... code ...
}

And personally I would say

open(IN, $_) or die "Failed in open(): $!";


Cheers,
Andreas
 
G

Geoff Cox

Yes, a much much better way:

while(defined($line = <IN>)) {
... code ...
}

And personally I would say

open(IN, $_) or die "Failed in open(): $!";

will use both - thanks!

Geoff
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am trying to extract email addresses from about 1000 htm files.

So far am trying

if ($line =~ /Mailto:(.*)"/ {
print OUT ("$1 \n");

where the line is

<a href="mailto:[email protected]"

problem is with the " after the email address and the "greedy" regex
characteristic which finds other " further along the line ...

can I stop at the first " mark?

Change your thinking a bit. Instead of matching "Mailto:" followed by as
many characters as possible followed by a quote, match "Mailto:" followed
by as many non-quote characters as possible followed by a quote:

if ($line =~ /Mailto:([^"]*)"/)

Also consider making it case-insensitive with the i modifier.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP2MoO2PeouIeTNHoEQIdtACgxV2WliWoH07gZaS39JHGdb1q+wAAn1f6
oXom0J4O85KppYwOysICYuZs
=yU+G
-----END PGP SIGNATURE-----
 
G

Geoff Cox

Change your thinking a bit. Instead of matching "Mailto:" followed by as
many characters as possible followed by a quote, match "Mailto:" followed
by as many non-quote characters as possible followed by a quote:

if ($line =~ /Mailto:([^"]*)"/)

Thanks Eric - will give it a try...

Cheers

Geoff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top