regexp for matching a string with mandatory underscores

Discussion in 'Perl Misc' started by David Filmer, Dec 27, 2011.

  1. David Filmer

    David Filmer Guest

    I want to be able to match the string foo1_bar2_baz3 as having
    multiple underscore characters (with no intervening whitespace), but
    not match foo1_bar2 which has only one underscore. I want to ignore
    one match, but not two or more.

    This would be easy if \w did not ALSO match underscores. But it
    does. There does not seem to be a character class for alphanumeric
    ONLY.

    How can I match continuous alphanumeric strings which contain more
    than one underscore?

    Thanks!
    David Filmer, Dec 27, 2011
    #1
    1. Advertising

  2. On 2011-12-27, David Filmer <> wrote:
    > This would be easy if \w did not ALSO match underscores. But it
    > does. There does not seem to be a character class for alphanumeric
    > ONLY.


    ??? [^\W_]

    Ilya
    Ilya Zakharevich, Dec 27, 2011
    #2
    1. Advertising

  3. David Filmer

    Tim McDaniel Guest

    In article
    <>,
    David Filmer <> wrote:
    >How can I match continuous alphanumeric strings which contain more
    >than one underscore?


    Is it OK to use more than one regexp? If so, I might try
    /^\w+$/ && /_.*_/
    It's a bit brute-force, but it's also very clear. The second could be
    optimized to /_[^_]*_/, but unless you're evaluating it lots of times,
    "micro-optimizations leads to micro-results".

    --
    Tim McDaniel,
    Tim McDaniel, Dec 27, 2011
    #3
  4. David Filmer

    David Filmer Guest

    On Dec 27, 3:04 pm, Tad McClellan <> wrote:

    > /_.*_/ is a _clear_ way to say "more than one underscore" ?


    Yes, but it also would match "foo_bar baz_quux" which contains an
    intervening whitespace. This would not satisfy the original
    requirements, which stipulate finding multiple underscores within
    continuous alphanumeric characters with no intervening whitespace.
    David Filmer, Dec 28, 2011
    #4
  5. David Filmer

    Willem Guest

    David Filmer wrote:
    ) I want to be able to match the string foo1_bar2_baz3 as having
    ) multiple underscore characters (with no intervening whitespace), but
    ) not match foo1_bar2 which has only one underscore. I want to ignore
    ) one match, but not two or more.
    )
    ) This would be easy if \w did not ALSO match underscores. But it
    ) does. There does not seem to be a character class for alphanumeric
    ) ONLY.

    How would that make it easy?

    Doesn't the following work? : m/\w*_\w*_\w*/


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Dec 28, 2011
    #5
  6. David Filmer

    Tim McDaniel Guest

    In article <>,
    David Filmer <> wrote:
    >On Dec 27, 3:04 pm, Tad McClellan <> wrote:
    >
    >> /_.*_/ is a _clear_ way to say "more than one underscore" ?


    Very clear to me, at least.

    >Yes, but it also would match "foo_bar baz_quux" which contains an
    >intervening whitespace. This would not satisfy the original
    >requirements, which stipulate finding multiple underscores within
    >continuous alphanumeric characters with no intervening whitespace.


    Which is why I wrote
    >>> /^\w+$/ && /_.*_/


    --
    Tim McDaniel,
    Tim McDaniel, Dec 28, 2011
    #6
  7. David Filmer

    Tim McDaniel Guest

    In article <>,
    Tad McClellan <> wrote:
    >The way


    There are times to apply the phrase "The way" to Perl, but I don't
    know yet that this is one of them.

    >to count characters is with tr///, not regexes:
    >
    > /^\w+$/ && tr/_// > 1


    What are your reasons to think one better than the other?

    Unless the expression is being evaluated many times, efficiency isn't
    so important.

    How many people are familiar with tr/// versus plain m//? I rarely
    use tr///. I don't remember ever using the return value of tr///.
    I've never used the empty RHS feature except with /d (indeed, I had to
    check the man page to see that you hadn't trashed $_).

    --
    Tim McDaniel,
    Tim McDaniel, Dec 28, 2011
    #7
  8. (Tim McDaniel) writes:
    > In article <>,
    > Tad McClellan <> wrote:
    >>The way


    [...]

    >>to count characters is with tr///, not regexes:
    >>
    >> /^\w+$/ && tr/_// > 1

    >
    > What are your reasons to think one better than the other?
    >
    > Unless the expression is being evaluated many times, efficiency isn't
    > so important.


    A subroutine I encountered in the past in some script written by
    someone else was

    sub mod($$) { return $_[0] - $_[1] * int($_[0] / $_[1]); }

    Actually, it wasn't a subroutine but an inline calculation. Provided
    the language provides a more direct way to achieve the same result
    (the % operator), the question is not 'why should the built-in way be
    preferred' but 'why should something other than the built-in way be
    used' and ...

    > How many people are familiar with tr/// versus plain m//?


    .... "But I didn't know about it!" is only a suitable justifcation
    until this problem has been remedied.
    Rainer Weikusat, Dec 28, 2011
    #8
  9. David Filmer

    Tim McDaniel Guest

    In article <>,
    Rainer Weikusat <> wrote:
    >Provided the language provides a more direct way to achieve the same
    >result ..., the question is not 'why should the built-in way be
    >preferred' but 'why should something other than the built-in way be
    >used'


    In the current case, it's
    /_.*_/
    versus
    tr/_// > 1
    They both use builtins pretty directly and they are both short.
    Personally, I find the former to be clearer than the latter, which
    uses an operator that usually causes side effects but doesn't in this
    case, and I'm still don't know how many know its details.

    >> How many people are familiar with tr/// versus plain m//?

    >
    >... "But I didn't know about it!" is only a suitable justifcation
    >until this problem has been remedied.


    To some extent I agree, but if someone is coding for other people, one
    of the factors that the coder should consider is what is
    comprehensible at a glance, in addition to other factors like brevity,
    efficiency where needed, robustness, and such. For example, for my
    own programs I have no problems with
    my %key_lookup;
    @key_lookup{@keys} = (1) x @keys;
    But since I don't know how many people know that idiom, I might
    hesitate to use it when coding for others, and if I did I would likely
    comment it clearly.

    --
    Tim McDaniel,
    Tim McDaniel, Dec 29, 2011
    #9
  10. Tim McDaniel wrote:
    > In article<>,
    > Rainer Weikusat<> wrote:
    >> Provided the language provides a more direct way to achieve the same
    >> result ..., the question is not 'why should the built-in way be
    >> preferred' but 'why should something other than the built-in way be
    >> used'

    >
    > In the current case, it's
    > /_.*_/
    > versus
    > tr/_//> 1
    > They both use builtins pretty directly and they are both short.
    > Personally, I find the former to be clearer than the latter, which
    > uses an operator that usually causes side effects but doesn't in this
    > case, and I'm still don't know how many know its details.


    tr/_// is pretty simple. It is actually short for tr/_/_/ which
    replaces every '_' character with a '_' character and returns the number
    of replacements made. It has the advantages that it doesn't interpolate
    and it only does one thing, and does it well.



    John
    --
    Any intelligent fool can make things bigger and
    more complex... It takes a touch of genius -
    and a lot of courage to move in the opposite
    direction. -- Albert Einstein
    John W. Krahn, Dec 29, 2011
    #10
  11. David Filmer

    C.DeRykus Guest

    On Dec 27, 2:04 am, David Filmer <> wrote:
    > I want to be able to match the string foo1_bar2_baz3 as having
    > multiple underscore characters (with no intervening whitespace), but
    > not match foo1_bar2 which has only one underscore.  I want to ignore
    > one match, but not two or more.
    >
    > This would be easy if \w did not ALSO match underscores.  But it
    > does.  There does not seem to be a character class for alphanumeric
    > ONLY.
    >
    > How can I match continuous alphanumeric strings which contain more
    > than one underscore?
    >



    Maybe,

    print '>1' if (()= /\G [[:alnum:]]+ _/gx) > 1;

    --
    Charles DeRykus
    C.DeRykus, Dec 29, 2011
    #11
  12. David Filmer

    Tim McDaniel Guest

    In article <>,
    C.DeRykus <> wrote:
    >On Dec 27, 2:04 am, David Filmer <> wrote:
    >> How can I match continuous alphanumeric strings which contain more
    >> than one underscore?

    >
    >Maybe,
    >
    >print '>1' if (()= /\G [[:alnum:]]+ _/gx) > 1;


    For anyone else who is wondering about the use of ()=, please see "man
    perldata".

    List assignment in scalar context returns the number of elements
    pro- duced by the expression on the right side of the assignment:

    $x = (($foo,$bar) = (3,2,1)); # set $x to 3, not 2
    $x = (($foo,$bar) = f()); # set $x to f()'s return count

    This is handy when you want to do a list assignment in a Boolean
    context, because most list functions return a null list when
    finished, which when assigned produces a 0, which is interpreted
    as FALSE.

    It's also the source of a useful idiom for executing a function or
    performing an operation in list context and then counting the
    number of return values, by assigning to an empty list and then
    using that assignment in scalar context. For example, this code:

    $count = () = $string =~ /\d+/g;

    will place into $count the number of digit groups found in
    $string. This happens because the pattern match is in list
    context (since it is being assigned to the empty list), and will
    therefore return a list of all matching parts of the string. The
    list assignment in scalar context will translate that into the
    number of elements (here, the number of times the pattern matched)
    and assign that to $count. Note that simply using

    $count = $string =~ /\d+/g;

    would not have worked, since a pattern match in scalar context
    will only return true or false, rather than a count of matches.

    --
    Tim McDaniel,
    Tim McDaniel, Dec 29, 2011
    #12
  13. On 2011-12-28, Tim McDaniel <> wrote:
    >>> /_.*_/ is a _clear_ way to say "more than one underscore" ?

    >
    > Very clear to me, at least.
    >
    >>Yes, but it also would match "foo_bar baz_quux" which contains an
    >>intervening whitespace. This would not satisfy the original
    >>requirements, which stipulate finding multiple underscores within
    >>continuous alphanumeric characters with no intervening whitespace.

    >
    > Which is why I wrote
    >>>> /^\w+$/ && /_.*_/


    The first one is not completely equivalent to !/\W/, but when ANDed
    with the second one it is (ignoring the issue with trailing \n, of
    course). Is it more clear? I'm not sure...

    Ilya
    Ilya Zakharevich, Dec 31, 2011
    #13
  14. On 2011-12-29, John W. Krahn <> wrote:
    >> In the current case, it's
    >> /_.*_/
    >> versus
    >> tr/_//> 1
    >> They both use builtins pretty directly and they are both short.
    >> Personally, I find the former to be clearer than the latter, which
    >> uses an operator that usually causes side effects but doesn't in this
    >> case, and I'm still don't know how many know its details.

    >
    > tr/_// is pretty simple.


    tr is extremely complicated.

    > It is actually short for tr/_/_/ which
    > replaces every '_' character with a '_' character and returns the number
    > of replacements made. It has the advantages that it doesn't interpolate
    > and it only does one thing, and does it well.


    For which value of "well"? If it is applied to 2GB string, would it
    make a copy of it? If the string is tied to a database entry, would
    it cause a database update? If the string is shared between fork()ed
    processes, would it become unshared after the operation?

    In short: Do you know what you are talking about?

    Best wishes for the new year,
    Ilya
    Ilya Zakharevich, Dec 31, 2011
    #14
  15. Ilya Zakharevich <> writes:
    > On 2011-12-29, John W. Krahn <> wrote:


    [...]

    >> It is actually short for tr/_/_/ which
    >> replaces every '_' character with a '_' character and returns the number
    >> of replacements made. It has the advantages that it doesn't interpolate
    >> and it only does one thing, and does it well.

    >
    > For which value of "well"? If it is applied to 2GB string, would it
    > make a copy of it?


    Not when counting or replacing character in a non-UTF8 string.

    > If the string is tied to a database entry, would
    > it cause a database update?


    Maybe, maybe not. That would depend on the implemention of tieing mechanism.

    > If the string is shared between fork()ed
    > processes, would it become unshared after the operation?


    Strings are not shared between forked processes, memory pages are. As
    soon as any process tries to write to a shared page, it will get its
    own copy for usual COW-implementations.
    Rainer Weikusat, Jan 1, 2012
    #15
  16. David Filmer

    Tim McDaniel Guest

    In article <>,
    Rainer Weikusat <> wrote:
    >Ilya Zakharevich <> writes:
    >> On 2011-12-29, John W. Krahn <> wrote:

    >
    >[...]
    >
    >>> It is actually short for tr/_/_/ which replaces every '_'
    >>> character with a '_' character and returns the number of
    >>> replacements made. It has the advantages that it doesn't
    >>> interpolate and it only does one thing, and does it well.

    >>
    >> For which value of "well"? If it is applied to 2GB string, would
    >> it make a copy of it?

    >
    >Not when counting or replacing character in a non-UTF8 string.
    >
    >> If the string is tied to a database entry, would
    >> it cause a database update?

    >
    >Maybe, maybe not. That would depend on the implemention of tieing
    >mechanism.
    >
    > [and a forking question]


    I remember Dennis Ritchie's use of the phrase "unwarranted chumminess
    with the C implementation" (in a far more dubious situation). I'm
    hesitant to depend on implementation details unless they're guaranteed
    in the documentation. Particularly with Perl: systems I'm on have
    versions variously between 5.8 and 5.14, so I wonder which versions
    have which optimizations, or indeed if they are done at all.

    On the other hand, when you write the scripts yourself (I do that a
    lot with Perl), you can know whether it does ties, large strings, or
    other unusual cases.

    --
    Tim McDaniel,
    Tim McDaniel, Jan 3, 2012
    #16
  17. (Tim McDaniel) writes:
    > In article <>,
    > Rainer Weikusat <> wrote:
    >>Ilya Zakharevich <> writes:
    >>> On 2011-12-29, John W. Krahn <> wrote:

    >>
    >>[...]
    >>
    >>>> It is actually short for tr/_/_/ which replaces every '_'
    >>>> character with a '_' character and returns the number of
    >>>> replacements made. It has the advantages that it doesn't
    >>>> interpolate and it only does one thing, and does it well.
    >>>
    >>> For which value of "well"? If it is applied to 2GB string, would
    >>> it make a copy of it?

    >>
    >>Not when counting or replacing character in a non-UTF8 string.
    >>
    >>> If the string is tied to a database entry, would
    >>> it cause a database update?

    >>
    >>Maybe, maybe not. That would depend on the implemention of tieing
    >>mechanism.
    >>
    >> [and a forking question]

    >
    > I remember Dennis Ritchie's use of the phrase "unwarranted chumminess
    > with the C implementation" (in a far more dubious situation). I'm
    > hesitant to depend on implementation details unless they're guaranteed
    > in the documentation.


    What is guaranteed in the documentation today will be 'accidentally
    still in the documentation' tomorrow and 'a deprecated feature which
    must not be used under any circumstances' (on threat of immediate
    excommunication from the universe of all the just and beautiful
    people) two days later, so that doesn't really buy you anything :->.

    OTOH, it is sensible to assume that - usually - the people who wrote
    the implementation will have tried to make it behave sensibly and in
    this case, that tr/// will neither copy nor modify the string except
    if this is necessary to perform the requested operation.

    Re: tied scalars

    What will happen when an operation is performed on a scalar tied to
    something depends on the class/ module used to provide the tied
    semantics and this can be anything, so the question didn't really make
    sense: This class or module may well cause 'a database update' despite
    perl didn't modify the data.
    Rainer Weikusat, Jan 3, 2012
    #17
  18. David Filmer

    Guest

    On Tue, 27 Dec 2011 12:40:04 +0000, Ben Morrow <> wrote:

    >
    >Quoth David Filmer <>:
    >> I want to be able to match the string foo1_bar2_baz3 as having
    >> multiple underscore characters (with no intervening whitespace), but
    >> not match foo1_bar2 which has only one underscore. I want to ignore
    >> one match, but not two or more.
    >>
    >> This would be easy if \w did not ALSO match underscores. But it
    >> does. There does not seem to be a character class for alphanumeric
    >> ONLY.

    >


    If I understand you correctly from your example, this may work.
    /^[^\W_]+(?:_[^\W_]+){2,}$/

    -sln
    , Jan 4, 2012
    #18
  19. On 2012-01-03, Rainer Weikusat <> wrote:
    >>>> For which value of "well"? If it is applied to 2GB string, would
    >>>> it make a copy of it?
    >>>
    >>>Not when counting or replacing character in a non-UTF8 string.


    So I read it as: "it will" (with certain exceptions).

    >>>> If the string is tied to a database entry, would
    >>>> it cause a database update?
    >>>
    >>>Maybe, maybe not. That would depend on the implemention of tieing
    >>>mechanism.


    Again...

    > OTOH, it is sensible to assume that - usually - the people who wrote
    > the implementation will have tried to make it behave sensibly and in
    > this case, that tr/// will neither copy nor modify the string except
    > if this is necessary to perform the requested operation.


    Not applicable to Perl (in general). A lot of stuff is majorly pessimized.

    > Re: tied scalars
    >
    > What will happen when an operation is performed on a scalar tied to
    > something depends on the class/ module used to provide the tied
    > semantics and this can be anything, so the question didn't really make
    > sense: This class or module may well cause 'a database update' despite
    > perl didn't modify the data.


    This is true "literally", but AFAIK, not applicable to any situation I
    know.

    Essentially, for me all this boils down to: do not use tr/// unless
    you can't avoid it, or know EXACTLY how and when your code is going to
    be used...

    Ilya
    Ilya Zakharevich, Jan 10, 2012
    #19
  20. Ben Morrow <> writes:

    [...]

    > Everyone now knows that using UTF-8 was a mistake,


    That's not something "everyone knows" and in fact, some people were so
    convinced that UTF-8 would be a sensible choice that they implemented
    complete operating systems based on using UTF-8 as native character
    encoding (that would be "Plan9"). This should rather be "every member
    of some small group of people" (people currently working on Perl
    Unicode support?) are strongly convinced that chosing UTF-8 was a
    mistake (and I'd wager a bet that the base reason for this is "that's
    not what Microsoft did and consequently, it must be WRONG !!1").
    Rainer Weikusat, Jan 10, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Laura
    Replies:
    1
    Views:
    409
    Laura
    Jun 3, 2004
  2. J. Hall
    Replies:
    4
    Views:
    1,126
    J. Hall
    Jun 3, 2004
  3. Andy Glew
    Replies:
    84
    Views:
    1,889
    Dave Vandervies
    Oct 28, 2003
  4. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
  5. williamc
    Replies:
    6
    Views:
    99
    williamc
    Sep 25, 2003
Loading...

Share This Page