/^From:.*?([\w.-]+@[\w.-]+)/

A

April

We know that you are new to Perl. What people want to see is some
indication that you are putting out some effort yourself, and that you
don't expect us to spoon feed it to you. If the answer to your question
is in the docs, hopefully you have made some effort to find the answer
there.

thanks .. I have this piece program was not working and what I was
expecting is kind of reassurance so I can rule out certain things and
focus on the others.

In my original post I was trying to ask specific questions, and then
in following posts with something for reassurance, which is important
as it gives me confidance.

I'm not expecting lengthy responses (though appreciate them), but just
some quick confirmation and short explanation, direction would be
plus.
 
T

Tad J McClellan

Scott Bryce said:
The regulars here expect a certain posting style because it communicates
effectively in this medium.


What the regulars expect is that you will quote enough to provide
necessary background, and that YOU will skip the unnecessary parts by
trimming them from your post.

Which illustrates the effectiveness mentioned above.

Q: How many people spend time on the unnecessary parts if the
poster trims them?
A: One

Q: How many people spend time on the unnecessary parts if the
poster leaves them in?
A: Dozens or even hundreds.


Shifting the work from one person to hundreds of other people will
likely annoy hundreds of other people.
 
J

Jürgen Exner

April said:
April said:
if test:  elsif ($header && /^From:.*?([\w.-]+@[\w.-]+)/)
as && requires the matching needs to be done in the header ($header)

what I meant was as $header also needs to be evaluated true (&&), so

And this is exactly the main problem that I personally have with your
postings. Computers are the most notorius nitpicking bastards you can
possibly imagine. One character in the wrong place and your program may
do something completely different from what you intented it to do.
Very often you articles and questions are vague and ambigious or at the
very least confusing. Please, if you want help then at least try to be
as exact as possible.

Above you wrote verbatim "the matching needs to be done in the header".
So Scott correctly pointed out that code doesn't do that, instead it
matches in $_. And now you are changing your story and saying that's not
what you meant. Well, thank you for leading Scott and everyone else on a
wild goose chase.

Now, having said that, things like this happen to everyone every now and
then. However in your case it happens very, very often. And that makes
it very frustrating for your audience.
..
this test must be done in the header section which is what $header
stands for.

And that is the second main problem I personally have with your
postings: they information you provide is incomplete.
How are we supposed to know what $header stands for in your program? By
all typical programming patterns and naming conventions it's most likely
a string contain the (or a) header from whatever your program is
processing. There is no way to guess that you are using it as some sort
of flag.
We do not, cannot, and don't want to know all of your program. But those
parts that you post need to be self-contained, such that other people
can understand what is going on. There is a very good reason why the
posting guidelines strongly suggest a minimal self-contained program
instead of an incomplete excerpt from a larger program.

jue
 
A

April

April said:
April wrote:
if test:  elsif ($header && /^From:.*?([\w.-]+@[\w.-]+)/)
as && requires the matching needs to be done in the header ($header)
what I meant was as $header also needs to be evaluated true (&&), so

And this is exactly the main problem that I personally have with your
postings. Computers are the most notorius nitpicking bastards you can
possibly imagine. One character in the wrong place and your program may
do something completely different from what you intented it to do.
Very often you articles and questions are vague and ambigious or at the
very least confusing. Please, if you want help then at least try to be
as exact as possible.

Above you wrote verbatim "the matching needs to be done in the header".
So Scott correctly pointed out that code doesn't do that, instead it
matches in $_. And now you are changing your story and saying that's not
what you meant. Well, thank you for leading Scott and everyone else on a
wild goose chase.

Now, having said that, things like this happen to everyone every now and
then. However in your case it happens very, very often. And that makes
it very frustrating for your audience.
.
this test must be done in the header section which is what $header
stands for.

And that is the second main problem I personally have with your
postings: they information you provide is incomplete.
How are we supposed to know what $header stands for in your program? By
all typical programming patterns and naming conventions it's most likely
a string contain the (or a) header from whatever your program is
processing. There is no way to guess that you are using it as some sort
of flag.
We do not, cannot, and don't want to know all of your program. But those
parts that you post need to be self-contained, such that other people
can understand what is going on. There is a very good reason why the
posting guidelines strongly suggest a minimal self-contained program
instead of an incomplete excerpt from a larger program.

jue

I'm not changing my story at all .. the reason I mentioned the header
part is actually to make the line I mentioned self-contained:

"The reason I wasn't sure is that the following cannot be picked up
by
an if test:

From (e-mail address removed) Tue Apr 24 11:02:41 2002


if test: elsif ($header && /^From:.*?([\w.-]+@[\w.-]+)/)


I modified From: to From in the above and also checked still in the
header secton."

I still don't understand where your accusations come from, why you are
so upset, and what exactly you are expecting ..
 
J

Jürgen Exner

April said:
I still don't understand where your accusations come from, why you are
so upset, and what exactly you are expecting ..

I guess our communication is running on very very different frequencies.
I recognize and appreciate that you are _NOT_ acting like a troll or
some of the morons we had here over the years. I believe you are a
decend person and are trying to fit in.

However, I also realize that our communication styles are vastly
different and incompatible. I just can't make much sense of your
postings, sorry, my weakness. But because of this there isn't much
benefit to either of us if I continue to read your postings, therefore I
will stop doing so.

Good luck

jue
 
T

Tad J McClellan

April said:
what exactly you are expecting ..


A short and complete program *that we can run*.

No full-quoting of articles.

No quoting of .sigs.

Have you seen the Posting Guidelines that are posted here frequently?

They spell out in quite some detail what we are expecting.
 
A

April

A short and complete program *that we can run*.

No full-quoting of articles.

No quoting of .sigs.

Have you seen the Posting Guidelines that are posted here frequently?

this is very helpful ..
 
S

sln

[snip]
Thanks Tim, Tad and Jue .. now it's much clear to me! One thing left
is whether .*? simply means anything appears before (...)? April

Since you are asking this question, it is not clear to you at all April.

Look at that expression. '[\w.-]+@' is a hard anchor. That must be satisfied
first, especially the '@'.

The fact is '[\w.-]+' can be satisfied with a single character.
The other fact is '.*?' can get by with one character but it is in the bottom
of precedence.
However '.*' wants to take as much of the string as possible, but it is
subserviant to '[\w.-]+@'.

Here is the heirchy from top down:

1 - '@' is GOD
2 - '[\w.-]+' is CHRIST
3 - '.*' is the greedy HOLY GHOST
4 - '.*?' is the single ANGEL

use strict;
use warnings;

my $email = "From: -2ame\@yahoo.com";

if ($email =~ /^From:.*?([\w.-]+@[\w.-]+)/)
{
print "$1\n";
}
if ($email =~ /^From:.*([\w.-]+@[\w.-]+)/)
{
print "$1\n";
}
__END__

church services produce:

-2ame\@yahoo.com
e\@yahoo.com


Please notice the regexp distinctions that produced these results.
As well, notice that '.*?' was put in as a trap to filter out exterraneous
characters that are not alpha numeric.

Let us pray.
sln
 
A

April

Since you are asking this question, it is not clear to you at all April.

you really know me, however with your inspiration, I'm pretty sure
I'll be getting better sooner.
Look at that expression. '[\w.-]+@' is a hard anchor. That must be satisfied
first, especially the '@'.

not sure I agree with this and the following Church ranking thing ...
The fact is '[\w.-]+' can be satisfied with a single character.
agree.

The other fact is '.*?' can get by with one character but it is in the bottom
of precedence.

believe '.*?' can get by with 0 character too.
However '.*' wants to take as much of the string as possible,

agree, but will still check to see whether that will allow the
following [\w.-] to be satisfied.
Here is the heirchy from top down:

1 - '@' is GOD
2 - '[\w.-]+' is CHRIST
3 - '.*' is the greedy HOLY GHOST
4 - '.*?' is the single ANGEL

use strict;
use warnings;

my $email = "From: -2ame\@yahoo.com";

if ($email =~ /^From:.*?([\w.-]+@[\w.-]+)/)
{
        print "$1\n";}

if ($email =~ /^From:.*([\w.-]+@[\w.-]+)/)
{
        print "$1\n";}

__END__

church services produce:

-2ame\@yahoo.com
e\@yahoo.com

Please notice the regexp distinctions that produced these results.
As well, notice that '.*?' was put in as a trap to filter out exterraneous
characters that are not alpha numeric.

Let us pray.

Just foud and read "Regular Expression Tutorial Part 5: Greedy and Non-
Greedy Quantification" by Andrew Johnson (which can be found on the
Internet by searching). Andrew provides a pretty convencing
explanation on how '.*?' works. I believe the use of '.*?' will take
care of no space, one or more other characters, including space, tab,
etc., that appear before the real email address but are not matched
by [\w.-].

I've started to love this place and you guys .. :)
 
S

Scott Bryce

April said:
I still don't understand where your accusations come from, why you
are so upset, and what exactly you are expecting ..

He is not making accusations, and he is not upset. He is trying to help
you communicate your questions in a way that makes it easier for the
regulars here to answer them.

For example, in this line of code:

elsif ($header && /^From:.*?([\w.-]+@[\w.-]+)/)

We have no way of knowing what $header is used for, or what data it
might contain.

We have no way of knowing what $_ might contain.

We don't know what you mean by "header section." Of a email? Or is your
program divided up into "sections" that we are unaware of?

Without knowing that we cannot know if $header contains a value that is
different from what you think it holds, if $_ contains a different value
than you think it does, or if the regex is wrong for what you are trying
to accomplish.

if you haven't already, take a look at the posting guidelines for this
group. See if they help you ask your questions in a better way in the
future. There is a lot of Perl talent here. (I normally don't respond to
posts here, since there are others who give much better answers than I
do.) If you are willing to work within their guidelines, there is a lot
you can learn here.
 
T

Tim Greer

April said:
Just foud and read "Regular Expression Tutorial Part 5: Greedy and
Non- Greedy Quantification" by Andrew Johnson (which can be found on
the Internet by searching).  Andrew provides a pretty convencing
explanation on how '.*?' works.  I believe the use of '.*?' will take
care of no space, one or more other characters, including space, tab,
etc.,  that appear before the real email address but are not matched
by [\w.-].

I've started to love this place and you guys .. :)

BTW, if you know it'll only be white space (space, tabs, etc.) between
the ^From:? and email@address, then \s+ would probably be a better
idea... unless you suspect other non \w, ., and - characters will exist
between it and don't want to try and predict them.
 
A

April

BTW, if you know it'll only be white space (space, tabs, etc.) between
the ^From:? and email@address, then \s+ would probably be a better
idea... unless you suspect other non \w, ., and - characters will exist
between it and don't want to try and predict them.

you mean '^From:\s+?', how about '^From:\s*?', to also cover the case
no white space or anything at all?
 
A

April

He is not making accusations, and he is not upset. He is trying to help
you communicate your questions in a way that makes it easier for the
regulars here to answer them.

I assume that's the case now ..
For example, in this line of code:

elsif ($header && /^From:.*?([\w.-]+@[\w.-]+)/)

We have no way of knowing what $header is used for, or what data it
might contain.

We have no way of knowing what $_ might contain.

We don't know what you mean by "header section." Of a email? Or is your
program divided up into "sections" that we are unaware of?

Without knowing that we cannot know if $header contains a value that is
different from what you think it holds, if $_ contains a different value
than you think it does, or if the regex is wrong for what you are trying
to accomplish.

I guess I could have done better .. I indicated this is an email
matching issue in the original post, and I remembered to use
meaningful variable name, and the && is a logical and, so then I
assumed people have the same thing in my mind and would understand I
was trying to take out any doubt related to $header, and can focus on
anything wrong with the part after &&.
if you haven't already, take a look at the posting guidelines for this
group. See if they help you ask your questions in a better way in the
future. There is a lot of Perl talent here. (I normally don't respond to
posts here, since there are others who give much better answers than I
do.) If you are willing to work within their guidelines, there is a lot
you can learn here.

I have taken note, thanks Scott.
 
T

Tim Greer

April said:
you mean '^From:\s+?', how about '^From:\s*?', to also cover the case
no white space or anything at all?

Yes, if it might have white space or might have none at all, then \s*
for zero or more is what you want. \s*? isn't necessary here, since
\s* is already zero or more, so making it an optional match doesn't
matter, since if it doesn't exist, it's already "zero". Be sure to
make : optional on From though, since your examples don't have it each
time.
 
J

John W. Krahn

Tim said:
Yes, if it might have white space or might have none at all, then \s*
for zero or more is what you want. \s*? isn't necessary here, since
\s* is already zero or more, so making it an optional match doesn't
matter,

The ? on the end of \s*? changes \s* to non-greedy, the * makes it optional.
since if it doesn't exist, it's already "zero". Be sure to
make : optional on From though, since your examples don't have it each
time.



John
 
T

Tim Greer

John said:
The ? on the end of \s*? changes \s* to non-greedy, the * makes it
optional.

Right, I know. However, they'll never want to capture any of the white
space between From:? and ([\w.-]+@[\w.-]+), so \s* should suffice and
doesn't require non greedy to function as expected.
 
T

Tim Greer

Tim said:
John said:
The ? on the end of \s*? changes \s* to non-greedy, the * makes it
optional.

Right, I know. However, they'll never want to capture any of the
white space between From:? and ([\w.-]+@[\w.-]+), so \s* should
suffice and doesn't require non greedy to function as expected.

Pardon, to be more specific, I misused the word capture (that's
obvious). I simply mean that they appear to want to match all white
space (zero or as many as there is), and don't need to use a non greedy
match there. Not that it would matter, but it's not necessary to use
from what I can see.
 
S

sln

Since you are asking this question, it is not clear to you at all April.

you really know me, however with your inspiration, I'm pretty sure
I'll be getting better sooner.
Look at that expression. '[\w.-]+@' is a hard anchor. That must be satisfied
first, especially the '@'.

not sure I agree with this and the following Church ranking thing ...
The fact is '[\w.-]+' can be satisfied with a single character.
agree.

The other fact is '.*?' can get by with one character but it is in the bottom
of precedence.

believe '.*?' can get by with 0 character too.
Yes, it will take no character, its a filter, but its a one character
at a time filter.
However '.*' wants to take as much of the string as possible,

agree, but will still check to see whether that will allow the
following [\w.-] to be satisfied.
Yes, but '[\w.-]+' can be satisfied with a single character.
Thus '.*' will grab all before that single character in a greedy fashion.
This will take as long as the non-greedy but will not get the right results.
Here is the heirchy from top down:

1 - '@' is GOD
2 - '[\w.-]+' is CHRIST
3 - '.*' is the greedy HOLY GHOST
4 - '.*?' is the single ANGEL
[snip]

Just foud and read "Regular Expression Tutorial Part 5: Greedy and Non-
Greedy Quantification" by Andrew Johnson (which can be found on the
Internet by searching). Andrew provides a pretty convencing
explanation on how '.*?' works. I believe the use of '.*?' will take
care of no space, one or more other characters, including space, tab,
etc., that appear before the real email address but are not matched
by [\w.-].

I never read that book. Its probably good.

In my experience, negative greed is one the most usefull concepts.
I always look to add negative greed to expressions.

In terms of greed, once the engine knows what not to look for, it will
grab all it can up to that point. Then it will look at the next term
in the regex expression.

This is the same as non-greedy, but the greedy one grabs a chunk of
matched data at a time, where as the non-greedy will grab one occurance
at a time. They both then check the next term for a match.

Knowing this, you can shorten the time the data takes to process.

Example:
$data = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
$data =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/
is about %130 faster (2-3x faster) than this
$data =~ /^From:.*?([\w.-]+@[\w.-]+)/

The reason is that the engine grabs the greedy chunk first.
It just so happens we stopped the greed at a boundry where
the next character \w satisfies the next term '[\w-]+'.

Non-greedy will only get one character at a time between checks if the
next character will satisfy the next term '[\w-]+'. The repeated itteration
consumes a very large chunk of processing time.

The more a non-greedy term has to process the longer it takes. It could be
non-linear as well, not sure.

If the above were '$data = "From: -2ame\@yahoo.com";', the processing time's
would be equal. The more '.*?' characters, the longer time it takes.

There are times when you don't know where exactly to stop the greed,
but by all means possible, try to let the greed be there. Just have
to think about it and test all possible scenario's.

In a looping scenario, say like a parser, where everything is processed in a
repeating fashion, there is usually a sink/filter that picks up waste/comments
or formatting data, typically takes on the '.*?' form. This typically gives
the patterns a chance to match on the next character.

If there is 1,2 or 3 characters that start out the pattern matches, a greedy
term (negative) can take you up to them quickly, giving the pattern's a chance
to match without checking at character intervals. In that case you can use negative
greed and just simply have to know how to get past those characters in case the
patterns don't match.
Typically:

$lcbpos = 0
while (/<($pat1|$pat2|$pat3)>|([^<]*)(<?)/g) {
if (defined $2) {
if (length($3) && $lcbpos != pos($_)) {
$lcbpos = pos($_);
pos($_) = $lcbpos - 1;
}
next;
}
# found pattern

So negative greed is a good thing indeed. Its advisable to always try to be greedy.
But, this is un-avoidable sometimes: /ANCHOR's.*?AWAY/

Here are some benchmarks concerning greed and your email regexp.
--------------------------
use strict;
use warnings;
use Benchmark ':hireswallclock';

my $email = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
my ($result,$t0,$t1,$tdif) = '';

### Non-Greedy '.*?'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:.*?([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nNon-greedy '.*?' --\n the code took:",timestr($tdif),"\n";

### Greedy '[^\\w.-]*'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nGreedy '[^\\w.-]*' --\n the code took:",timestr($tdif),"\n";

__END__

Non-greedy '.*?' --
the code took:0.03332 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)

Greedy '[^\w.-]*' --
the code took:0.016902 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
 
J

John W. Krahn

Tim said:
Tim said:
John said:
Tim Greer wrote:

Yes, if it might have white space or might have none at all, then
\s*
for zero or more is what you want. \s*? isn't necessary here, since
\s* is already zero or more, so making it an optional match doesn't
matter,

The ? on the end of \s*? changes \s* to non-greedy, the * makes it
optional.

Right, I know. However, they'll never want to capture any of the
white space between From:? and ([\w.-]+@[\w.-]+), so \s* should
suffice and doesn't require non greedy to function as expected.

Pardon, to be more specific, I misused the word capture (that's
obvious). I simply mean that they appear to want to match all white
space (zero or as many as there is), and don't need to use a non greedy
match there. Not that it would matter, but it's not necessary to use
from what I can see.

Right, because the (optional) whitespace is anchored by 'm:?' on the
left and '[\w.-]+' on the right the greediness is irrelevant.


John
 
A

April

Yes, if it might have white space or might have none at all, then \s*
for zero or more is what you want.  \s*? isn't necessary here, since
\s* is already zero or more, so making it an optional match doesn't
matter, since if it doesn't exist, it's already "zero".  Be sure to
make : optional on From though, since your examples don't have it each
time.

that's right it cannot be greedy to anywhere as here the matching is
with white space.

by the way using ? to make : optional is also a good thinking.

? seems a pretty interesting quatifier in re. it relates to both
optional and non-greedy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top