Can this be combined into one statement?

R

Rainer Weikusat

Charles DeRykus said:
Consider a string like "foo bar baz ". For /\S+\s*$/ perl tries the
following sequence of matches:
[...]

I thought a possessive quantifier would help with this more intuitive
alternative: (\S+)\s*$ -> (\S++)\s*+$. But, unless there's some basic
error on my part, then the possessive replacement ate the proverbial
dust even of the backtracking regex.

BTW:
-----------
use Benchmark qw(cmpthese);

use constant MAX_WORDS => 15;
use constant MAX_LEN => 20;
use constant LINES => 100;

my (@lines, $line, $n, $nw, $x);

srand(0xdeafbabe); # repeatable

$n = LINES;
do {
$nw = int(rand(MAX_WORDS - 2)) + 2;
$line = '';
do {
$line .= 'x' x (int(rand(MAX_LEN - 1)) + 1);
$line .= ' ' x (rand(5) + 1) if $nw > 1;
} while --$nw;

push(@lines, $line);
# print STDERR ("'$line'\n");
} while --$n;

cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
}});
 
T

Tim McDaniel

Ben, thanks for the detailed explanation. This is good stuff to keep
in mind when in a performance critical loop, but if I were doing this
again, I would still go with /(\S+)\s*$/ because it is (to me) much
more clear about what its doing.

I think the split(...) version is clearer still, and that's what I
would use.
 
C

C.DeRykus

Charles DeRykus said:
On 10/30/2013 11:29 AM, Ben Morrow wrote:
Why does /(\S+)\s*$/ have to backtrack over "the whole string" whereas
/.*\s(\S+)/ does not?
I'm sure I don't undertand regex backtracking...

Consider a string like "foo bar baz ". For /\S+\s*$/ perl tries the
following sequence of matches:


[...]



I thought a possessive quantifier would help with this more intuitive
alternative: (\S+)\s*$ -> (\S++)\s*+$. But, unless there's some basic
error on my part, then the possessive replacement ate the proverbial
dust even of the backtracking regex.

BTW:
-----------
use Benchmark qw(cmpthese);

use constant MAX_WORDS => 15;
use constant MAX_LEN => 20;
use constant LINES => 100;

my (@lines, $line, $n, $nw, $x);
srand(0xdeafbabe); # repeatable
$n = LINES;

do {
$nw = int(rand(MAX_WORDS - 2)) + 2;
$line = '';
do {
$line .= 'x' x (int(rand(MAX_LEN - 1)) + 1);
$line .= ' ' x (rand(5) + 1) if $nw > 1;
} while --$nw;
push(@lines, $line);
# print STDERR ("'$line'\n");
} while --$n;

cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
}});


'rere' proves going 'reverse' can be a great idea.

(even does well as MAX_LENGTH grows... of course
split wins if you get crazy big)
 
R

Rainer Weikusat

I think the split(...) version is clearer still, and that's what I
would use.

But that's really nothing except /.*\s(\S+)\s*/ written in a different
way: It parses the string from left to right in order to create a list
of words aka

split(/\s+/, $line)

this has to happen in list context, otherwise, it won't work as intended
which makes

(split(/\s+/, $line))

this will look very much like a spurious pair of parentheses to someone
who is not familiar with the split documentation. Lastly, after the list
has been 'captured' in this way, it is indexed by -1

(split(/\s+/, $line))[-1]

In Perl, this means 'count backwards from the end of the list'. In other
languages, it means something different or is invalid altogether.

All of this combined doesn't exactly strike me as a particularly clear
way to express 'get the last word from a string'. Imagine these were
words written on paper and you were asked to cut the last word off the
sentence with a pair of scissors. The idea to start cutting the whole
strip of paper into pieces and drop them all on the floor except the
last one could be made into a nice sketch (skit?) if the piece of paper
was long enough and most people watching this would consider the person
doing it less-than-intellectually-gifted and "but's that what I always
do" (aka 'the construct is familiar to me') wouldn't improve this.
 
R

Rainer Weikusat

C.DeRykus said:
]
rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
}});


'rere' proves going 'reverse' can be a great idea.

(even does well as MAX_LENGTH grows... of course
split wins if you get crazy big)

It is mostly a demonstration that at least some matches anchored at the
end of the string are theoretically trivial to optimize -- revert the
pattern and match backwards from the end.
 
C

Charlton Wilbur

[wrapping a function in parentheses to induce list context]

RW> this will look very much like a spurious pair of parentheses to
RW> someone who is not familiar with the split
RW> documentation.

[indexing a list with -1]

RW> All of this combined doesn't exactly strike me as a particularly
RW> clear way to express 'get the last word from a string'.

My audience, when am writing Perl, is generally people who are competent
in Perl. If I insisted on avoiding constructions that depended on
context or on Perl's builtin data structures, I might as well be writing
in Java.

At a former place of employment, writing in C, I wrote the comment

/* breadth-first search rather than depth-first search because
average-case performance is the same but worst-case performance is
far better */

and this was called out as needing more explanation by a (thoroughly
incompetent) manager who had never encountered the terms "breadth-first"
and "depth-first."[1] I retorted "A programmer who has never encountered
the concept 'breadth-first' or 'depth-first' and who can't be bothered
to look them up and learn has no business modifying this code."

RW> Imagine these were words written on paper and you were asked to
RW> cut the last word off the sentence with a pair of scissors. The
RW> idea to start cutting the whole strip of paper into pieces and
RW> drop them all on the floor except the last one could be made
RW> into a nice sketch (skit?) if the piece of paper was long enough
RW> and most people watching this would consider the person doing it
RW> less-than-intellectually-gifted

Imagine that you put a bunch of words on index cards and told a person
to sort them. He would not bother with mergesort or quicksort or bubble
sort, but would use insertion sort; and if you explained that quicksort
would sort the cards in-place while mergesort had better average-case
performance, he would look at you as less-than-intellectually-gifted.
While he may be justified in this, the simple fact is that physical
solutions have different constraints than digital solutions, and
attempting to construct an argument by analogy where you pretend that
the digital solution has the same constraints as the physical solution
is, frankly, moronic.

Charlton


[1] His degree was in music, which explains his ignorance but not his
incompetence; my degree is also in music.
 
C

C.DeRykus

It is mostly a demonstration that at least some matches anchored at the
end of the string are theoretically trivial to optimize -- revert the
pattern and match backwards from the end.

Yes, definitely a potential speedup although the more familiar regex anchored at the end is likely the most intuitive and popular solution.

[A very long time ago, there was a discussion of a nice-to-have regex modifier that'd search backwards
and handle the messy reversals automatically. IMO that would be potentially faster and an intuitive win too ]
 
G

gamo

El 01/11/13 22:34, Rainer Weikusat escribió:
use Benchmark qw(cmpthese);

use constant MAX_WORDS => 15;
use constant MAX_LEN => 20;
use constant LINES => 100;

my (@lines, $line, $n, $nw, $x);

srand(0xdeafbabe); # repeatable

$n = LINES;
do {
$nw = int(rand(MAX_WORDS - 2)) + 2;
$line = '';
do {
$line .= 'x' x (int(rand(MAX_LEN - 1)) + 1);
$line .= ' ' x (rand(5) + 1) if $nw > 1;
} while --$nw;

push(@lines, $line);
# print STDERR ("'$line'\n");
} while --$n;

cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
},
gamo => sub {
s/\s+$//;
$x = substr($_, rindex($_, " ") );

If this is correct, consider it.

Best regards
 
G

gamo

El 05/11/13 00:08, gamo escribió:
El 01/11/13 22:34, Rainer Weikusat escribió:
cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
},
gamo => sub {
### s/\s+$//;
### $x = substr($_, rindex($_, " ") );

$x = substr($_,1+rindex($_," ")) for @lines;
If this is correct, consider it.

Of course it's not. It's slower, but if no \s is allowed at end of
lines, it could be faster. My apologies.
 
R

Rainer Weikusat

gamo said:
El 05/11/13 00:08, gamo escribió:
El 01/11/13 22:34, Rainer Weikusat escribió:
cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
},
gamo => sub {
### s/\s+$//;
### $x = substr($_, rindex($_, " ") );

$x = substr($_,1+rindex($_," ")) for @lines;
If this is correct, consider it.

Of course it's not. It's slower, but if no \s is allowed at end of
lines, it could be faster. My apologies.

The example code doesn't produce lines with spaces at the end. If it
did, the s/// would be a dubious idea because it would change the input
data. Apart from that, using rindex here is a good idea speedwise
provided that words are really separated by spaces and not by
'whitespace characters'. In an ASCII/ISO8859-1 environment, that would
be [ \t\v] (written as regex character class). I have no idea what
exactly a vertical tab (\v) is supposed to be and until recently, Perl
didn't support it, anyway, but horizontal tabs (\t) are not that
uncommon.
 
G

gamo

El 05/11/13 12:39, Rainer Weikusat escribió:
gamo said:
El 05/11/13 00:08, gamo escribió:
El 01/11/13 22:34, Rainer Weikusat escribió:
cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
},
gamo => sub {
### s/\s+$//;
### $x = substr($_, rindex($_, " ") );

$x = substr($_,1+rindex($_," ")) for @lines;
}});


If this is correct, consider it.

Of course it's not. It's slower, but if no \s is allowed at end of
lines, it could be faster. My apologies.

The example code doesn't produce lines with spaces at the end. If it
did, the s/// would be a dubious idea because it would change the input
data. Apart from that, using rindex here is a good idea speedwise
provided that words are really separated by spaces and not by
'whitespace characters'. In an ASCII/ISO8859-1 environment, that would
be [ \t\v] (written as regex character class). I have no idea what
exactly a vertical tab (\v) is supposed to be and until recently, Perl
didn't support it, anyway, but horizontal tabs (\t) are not that
uncommon.

So, I win :)

jesus@casa:~/src$ perl cmp.pl
Rate split renb rere gamo
split 13395/s -- -6% -28% -80%
renb 14306/s 7% -- -23% -79%
rere 18620/s 39% 30% -- -72%
gamo 66776/s 399% 367% 259% --

Well, no big deal. I could make a program 15 times _slower_ using threads.
 
R

Rainer Weikusat

gamo said:
El 05/11/13 12:39, Rainer Weikusat escribió:
gamo said:
El 05/11/13 00:08, gamo escribió:
El 01/11/13 22:34, Rainer Weikusat escribió:
cmpthese(-3,
{
split => sub {
$x = (split(/\s+/, $_))[-1] for @lines;
},

renb => sub {
/.*\s(\S+)$/ and $x = $1 for @lines;
},

rere => sub {
reverse($_) =~ /^(\S+)/ and $x = reverse($1) for @lines;
},
gamo => sub {
### s/\s+$//;
### $x = substr($_, rindex($_, " ") );

$x = substr($_,1+rindex($_," ")) for @lines;

}});


If this is correct, consider it.


Of course it's not. It's slower, but if no \s is allowed at end of
lines, it could be faster. My apologies.

The example code doesn't produce lines with spaces at the end. If it
did, the s/// would be a dubious idea because it would change the input
data. Apart from that, using rindex here is a good idea speedwise
provided that words are really separated by spaces and not by
'whitespace characters'. In an ASCII/ISO8859-1 environment, that would
be [ \t\v] (written as regex character class). I have no idea what
exactly a vertical tab (\v) is supposed to be and until recently, Perl
didn't support it, anyway, but horizontal tabs (\t) are not that
uncommon.

So, I win :)

rindex doesn't handle 'whitespace', only the exact string passed to
it. Taking this into account, if you're posting was not supposed to
present another useful idea but to communicated something about you, it
would be "you cheat".

HTH.
 
G

gamo

El 05/11/13 17:11, Rainer Weikusat escribió:
rindex doesn't handle 'whitespace', only the exact string passed to
it. Taking this into account, if you're posting was not supposed to
present another useful idea but to communicated something about you, it
would be "you cheat".

HTH.

I swear that I want to cheat in puzzles as many times as I could. The
only problem is that there are problems in which I don't manage how to
do it. Would you mind to read again the MCE post and help me to cheat
in it? I could post a real problem, with a linear heuristic solution.

Thanks ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top