regexp for matching a string with mandatory underscores

David Filmer · Dec 27, 2011

I want to be able to match the string foo1_bar2_baz3 as having
multiple underscore characters (with no intervening whitespace), but
not match foo1_bar2 which has only one underscore. I want to ignore
one match, but not two or more.

This would be easy if \w did not ALSO match underscores. But it
does. There does not seem to be a character class for alphanumeric
ONLY.

How can I match continuous alphanumeric strings which contain more
than one underscore?

Thanks!

Ilya Zakharevich · Dec 27, 2011

This would be easy if \w did not ALSO match underscores. But it
does. There does not seem to be a character class for alphanumeric
ONLY.

??? [^\W_]

Ilya

Tim McDaniel · Dec 27, 2011

How can I match continuous alphanumeric strings which contain more
than one underscore?

Is it OK to use more than one regexp? If so, I might try
/^\w+$/ && /_.*_/
It's a bit brute-force, but it's also very clear. The second could be
optimized to /_[^_]*_/, but unless you're evaluating it lots of times,
"micro-optimizations leads to micro-results".

David Filmer · Dec 27, 2011

/_.*_/ is a _clear_ way to say "more than one underscore" ?

Yes, but it also would match "foo_bar baz_quux" which contains an
intervening whitespace. This would not satisfy the original
requirements, which stipulate finding multiple underscores within
continuous alphanumeric characters with no intervening whitespace.

Willem · Dec 28, 2011

David Filmer wrote:
) I want to be able to match the string foo1_bar2_baz3 as having
) multiple underscore characters (with no intervening whitespace), but
) not match foo1_bar2 which has only one underscore. I want to ignore
) one match, but not two or more.
)
) This would be easy if \w did not ALSO match underscores. But it
) does. There does not seem to be a character class for alphanumeric
) ONLY.

How would that make it easy?

Doesn't the following work? : m/\w*_\w*_\w*/

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Tim McDaniel · Dec 28, 2011

Very clear to me, at least.

Yes, but it also would match "foo_bar baz_quux" which contains an
intervening whitespace. This would not satisfy the original
requirements, which stipulate finding multiple underscores within
continuous alphanumeric characters with no intervening whitespace.

Which is why I wrote

Tim McDaniel · Dec 28, 2011

The way

There are times to apply the phrase "The way" to Perl, but I don't
know yet that this is one of them.

to count characters is with tr///, not regexes:

/^\w+$/ && tr/_// > 1

What are your reasons to think one better than the other?

Unless the expression is being evaluated many times, efficiency isn't
so important.

How many people are familiar with tr/// versus plain m//? I rarely
use tr///. I don't remember ever using the return value of tr///.
I've never used the empty RHS feature except with /d (indeed, I had to
check the man page to see that you hadn't trashed $_).

Rainer Weikusat · Dec 28, 2011

The way

Click to expand...

[...]

to count characters is with tr///, not regexes:

/^\w+$/ && tr/_// > 1

Click to expand...

What are your reasons to think one better than the other?

Unless the expression is being evaluated many times, efficiency isn't
so important.

A subroutine I encountered in the past in some script written by
someone else was

sub mod($$) { return $_[0] - $_[1] * int($_[0] / $_[1]); }

Actually, it wasn't a subroutine but an inline calculation. Provided
the language provides a more direct way to achieve the same result
(the % operator), the question is not 'why should the built-in way be
preferred' but 'why should something other than the built-in way be
used' and ...

How many people are familiar with tr/// versus plain m//?

.... "But I didn't know about it!" is only a suitable justifcation
until this problem has been remedied.

Tim McDaniel · Dec 28, 2011

Provided the language provides a more direct way to achieve the same
result ..., the question is not 'why should the built-in way be
preferred' but 'why should something other than the built-in way be
used'

In the current case, it's
/_.*_/
versus
tr/_// > 1
They both use builtins pretty directly and they are both short.
Personally, I find the former to be clearer than the latter, which
uses an operator that usually causes side effects but doesn't in this
case, and I'm still don't know how many know its details.

... "But I didn't know about it!" is only a suitable justifcation
until this problem has been remedied.

To some extent I agree, but if someone is coding for other people, one
of the factors that the coder should consider is what is
comprehensible at a glance, in addition to other factors like brevity,
efficiency where needed, robustness, and such. For example, for my
own programs I have no problems with
my %key_lookup;
@key_lookup{@keys} = (1) x @keys;
But since I don't know how many people know that idiom, I might
hesitate to use it when coding for others, and if I did I would likely
comment it clearly.

John W. Krahn · Dec 29, 2011

Tim said:
In the current case, it's
/_.*_/
versus
tr/_//> 1
They both use builtins pretty directly and they are both short.
Personally, I find the former to be clearer than the latter, which
uses an operator that usually causes side effects but doesn't in this
case, and I'm still don't know how many know its details.

tr/_// is pretty simple. It is actually short for tr/_/_/ which
replaces every '_' character with a '_' character and returns the number
of replacements made. It has the advantages that it doesn't interpolate
and it only does one thing, and does it well.

John

C.DeRykus · Dec 29, 2011

I want to be able to match the string foo1_bar2_baz3 as having
multiple underscore characters (with no intervening whitespace), but
not match foo1_bar2 which has only one underscore. I want to ignore
one match, but not two or more.

This would be easy if \w did not ALSO match underscores. But it
does. There does not seem to be a character class for alphanumeric
ONLY.

How can I match continuous alphanumeric strings which contain more
than one underscore?

Maybe,

print '>1' if (()= /\G [[:alnum:]]+ _/gx) > 1;

Tim McDaniel · Dec 29, 2011

How can I match continuous alphanumeric strings which contain more
than one underscore?

Click to expand...

Maybe,

print '>1' if (()= /\G [[:alnum:]]+ _/gx) > 1;

For anyone else who is wondering about the use of ()=, please see "man
perldata".

List assignment in scalar context returns the number of elements
pro- duced by the expression on the right side of the assignment:

$x = (($foo,$bar) = (3,2,1)); # set $x to 3, not 2
$x = (($foo,$bar) = f()); # set $x to f()'s return count

This is handy when you want to do a list assignment in a Boolean
context, because most list functions return a null list when
finished, which when assigned produces a 0, which is interpreted
as FALSE.

It's also the source of a useful idiom for executing a function or
performing an operation in list context and then counting the
number of return values, by assigning to an empty list and then
using that assignment in scalar context. For example, this code:

$count = () = $string =~ /\d+/g;

will place into $count the number of digit groups found in
$string. This happens because the pattern match is in list
context (since it is being assigned to the empty list), and will
therefore return a list of all matching parts of the string. The
list assignment in scalar context will translate that into the
number of elements (here, the number of times the pattern matched)
and assign that to $count. Note that simply using

$count = $string =~ /\d+/g;

would not have worked, since a pattern match in scalar context
will only return true or false, rather than a count of matches.

Ilya Zakharevich · Dec 31, 2011

Very clear to me, at least.

Which is why I wrote

The first one is not completely equivalent to !/\W/, but when ANDed
with the second one it is (ignoring the issue with trailing \n, of
course). Is it more clear? I'm not sure...

Ilya

Ilya Zakharevich · Dec 31, 2011

tr/_// is pretty simple.

tr is extremely complicated.

It is actually short for tr/_/_/ which
replaces every '_' character with a '_' character and returns the number
of replacements made. It has the advantages that it doesn't interpolate
and it only does one thing, and does it well.

For which value of "well"? If it is applied to 2GB string, would it
make a copy of it? If the string is tied to a database entry, would
it cause a database update? If the string is shared between fork()ed
processes, would it become unshared after the operation?

In short: Do you know what you are talking about?

Best wishes for the new year,
Ilya

Rainer Weikusat · Jan 1, 2012

Ilya Zakharevich said:
[...]

It is actually short for tr/_/_/ which
replaces every '_' character with a '_' character and returns the number
of replacements made. It has the advantages that it doesn't interpolate
and it only does one thing, and does it well.

Click to expand...

For which value of "well"? If it is applied to 2GB string, would it
make a copy of it?

Not when counting or replacing character in a non-UTF8 string.

If the string is tied to a database entry, would
it cause a database update?

Maybe, maybe not. That would depend on the implemention of tieing mechanism.

If the string is shared between fork()ed
processes, would it become unshared after the operation?

Strings are not shared between forked processes, memory pages are. As
soon as any process tries to write to a shared page, it will get its
own copy for usual COW-implementations.

Tim McDaniel · Jan 3, 2012

Ilya Zakharevich said:
Ilya Zakharevich said:

[...]

It is actually short for tr/_/_/ which replaces every '_'
character with a '_' character and returns the number of
replacements made. It has the advantages that it doesn't
interpolate and it only does one thing, and does it well.

Click to expand...

For which value of "well"? If it is applied to 2GB string, would
it make a copy of it?

Click to expand...

Not when counting or replacing character in a non-UTF8 string.

If the string is tied to a database entry, would
it cause a database update?

Click to expand...

Maybe, maybe not. That would depend on the implemention of tieing
mechanism.

[and a forking question]

I remember Dennis Ritchie's use of the phrase "unwarranted chumminess
with the C implementation" (in a far more dubious situation). I'm
hesitant to depend on implementation details unless they're guaranteed
in the documentation. Particularly with Perl: systems I'm on have
versions variously between 5.8 and 5.14, so I wonder which versions
have which optimizations, or indeed if they are done at all.

On the other hand, when you write the scripts yourself (I do that a
lot with Perl), you can know whether it does ties, large strings, or
other unusual cases.

Rainer Weikusat · Jan 3, 2012

Ilya Zakharevich said:
Ilya Zakharevich said:

[...]

It is actually short for tr/_/_/ which replaces every '_'
character with a '_' character and returns the number of
replacements made. It has the advantages that it doesn't
interpolate and it only does one thing, and does it well.

For which value of "well"? If it is applied to 2GB string, would
it make a copy of it?

Click to expand...

Not when counting or replacing character in a non-UTF8 string.

If the string is tied to a database entry, would
it cause a database update?

Click to expand...

Maybe, maybe not. That would depend on the implemention of tieing
mechanism.

[and a forking question]

Click to expand...

I remember Dennis Ritchie's use of the phrase "unwarranted chumminess
with the C implementation" (in a far more dubious situation). I'm
hesitant to depend on implementation details unless they're guaranteed
in the documentation.

What is guaranteed in the documentation today will be 'accidentally
still in the documentation' tomorrow and 'a deprecated feature which
must not be used under any circumstances' (on threat of immediate
excommunication from the universe of all the just and beautiful
people) two days later, so that doesn't really buy you anything :->.

OTOH, it is sensible to assume that - usually - the people who wrote
the implementation will have tried to make it behave sensibly and in
this case, that tr/// will neither copy nor modify the string except
if this is necessary to perform the requested operation.

Re: tied scalars

What will happen when an operation is performed on a scalar tied to
something depends on the class/ module used to provide the tied
semantics and this can be anything, so the question didn't really make
sense: This class or module may well cause 'a database update' despite
perl didn't modify the data.

sln · Jan 4, 2012

If I understand you correctly from your example, this may work.
/^[^\W_]+(?:_[^\W_]+){2,}$/

-sln

Ilya Zakharevich · Jan 10, 2012

So I read it as: "it will" (with certain exceptions).

Again...

OTOH, it is sensible to assume that - usually - the people who wrote
the implementation will have tried to make it behave sensibly and in
this case, that tr/// will neither copy nor modify the string except
if this is necessary to perform the requested operation.

Not applicable to Perl (in general). A lot of stuff is majorly pessimized.

Re: tied scalars

What will happen when an operation is performed on a scalar tied to
something depends on the class/ module used to provide the tied
semantics and this can be anything, so the question didn't really make
sense: This class or module may well cause 'a database update' despite
perl didn't modify the data.

This is true "literally", but AFAIK, not applicable to any situation I
know.

Essentially, for me all this boils down to: do not use tr/// unless
you can't avoid it, or know EXACTLY how and when your code is going to
be used...

Ilya

Rainer Weikusat · Jan 10, 2012

[...]

Everyone now knows that using UTF-8 was a mistake,

That's not something "everyone knows" and in fact, some people were so
convinced that UTF-8 would be a sensible choice that they implemented
complete operating systems based on using UTF-8 as native character
encoding (that would be "Plan9"). This should rather be "every member
of some small group of people" (people currently working on Perl
Unicode support?) are strongly convinced that chosing UTF-8 was a
mistake (and I'd wager a bet that the base reason for this is "that's
not what Microsoft did and consequently, it must be WRONG !!1").

trailing underscores naming convention_	9	May 8, 2014
matching '?' in a string ending with digits	15	Feb 26, 2011
matching string literals	4	Feb 1, 2011
Regex: deleting non-matching words	3	Aug 22, 2010
help with regexp	5	Feb 7, 2013
Non-Programmer Needs Help With Simple Program	3	Dec 13, 2024
Matching	8	Nov 12, 2008
Need expert help matching a line	12	Sep 8, 2009

regexp for matching a string with mandatory underscores

David Filmer

Ilya Zakharevich

Tim McDaniel

David Filmer

Willem

Tim McDaniel

Tim McDaniel

Rainer Weikusat

Tim McDaniel

John W. Krahn

C.DeRykus

Tim McDaniel

Ilya Zakharevich

Ilya Zakharevich

Rainer Weikusat

Tim McDaniel

Rainer Weikusat

sln

Ilya Zakharevich

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads