regexp for matching a string with mandatory underscores

Discussion in 'Perl Misc' started by David Filmer, Dec 27, 2011.

  1. David Filmer

    David Filmer Guest

    I want to be able to match the string foo1_bar2_baz3 as having
    multiple underscore characters (with no intervening whitespace), but
    not match foo1_bar2 which has only one underscore. I want to ignore
    one match, but not two or more.

    This would be easy if \w did not ALSO match underscores. But it
    does. There does not seem to be a character class for alphanumeric
    ONLY.

    How can I match continuous alphanumeric strings which contain more
    than one underscore?

    Thanks!
     
    David Filmer, Dec 27, 2011
    #1
    1. Advertisements

  2. ??? [^\W_]

    Ilya
     
    Ilya Zakharevich, Dec 27, 2011
    #2
    1. Advertisements

  3. David Filmer

    Tim McDaniel Guest

    Is it OK to use more than one regexp? If so, I might try
    /^\w+$/ && /_.*_/
    It's a bit brute-force, but it's also very clear. The second could be
    optimized to /_[^_]*_/, but unless you're evaluating it lots of times,
    "micro-optimizations leads to micro-results".
     
    Tim McDaniel, Dec 27, 2011
    #3
  4. David Filmer

    David Filmer Guest

    Yes, but it also would match "foo_bar baz_quux" which contains an
    intervening whitespace. This would not satisfy the original
    requirements, which stipulate finding multiple underscores within
    continuous alphanumeric characters with no intervening whitespace.
     
    David Filmer, Dec 28, 2011
    #4
  5. David Filmer

    Willem Guest

    David Filmer wrote:
    ) I want to be able to match the string foo1_bar2_baz3 as having
    ) multiple underscore characters (with no intervening whitespace), but
    ) not match foo1_bar2 which has only one underscore. I want to ignore
    ) one match, but not two or more.
    )
    ) This would be easy if \w did not ALSO match underscores. But it
    ) does. There does not seem to be a character class for alphanumeric
    ) ONLY.

    How would that make it easy?

    Doesn't the following work? : m/\w*_\w*_\w*/


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Dec 28, 2011
    #5
  6. David Filmer

    Tim McDaniel Guest

    Very clear to me, at least.
    Which is why I wrote
     
    Tim McDaniel, Dec 28, 2011
    #6
  7. David Filmer

    Tim McDaniel Guest

    There are times to apply the phrase "The way" to Perl, but I don't
    know yet that this is one of them.
    What are your reasons to think one better than the other?

    Unless the expression is being evaluated many times, efficiency isn't
    so important.

    How many people are familiar with tr/// versus plain m//? I rarely
    use tr///. I don't remember ever using the return value of tr///.
    I've never used the empty RHS feature except with /d (indeed, I had to
    check the man page to see that you hadn't trashed $_).
     
    Tim McDaniel, Dec 28, 2011
    #7
  8. A subroutine I encountered in the past in some script written by
    someone else was

    sub mod($$) { return $_[0] - $_[1] * int($_[0] / $_[1]); }

    Actually, it wasn't a subroutine but an inline calculation. Provided
    the language provides a more direct way to achieve the same result
    (the % operator), the question is not 'why should the built-in way be
    preferred' but 'why should something other than the built-in way be
    used' and ...
    .... "But I didn't know about it!" is only a suitable justifcation
    until this problem has been remedied.
     
    Rainer Weikusat, Dec 28, 2011
    #8
  9. David Filmer

    Tim McDaniel Guest

    In the current case, it's
    /_.*_/
    versus
    tr/_// > 1
    They both use builtins pretty directly and they are both short.
    Personally, I find the former to be clearer than the latter, which
    uses an operator that usually causes side effects but doesn't in this
    case, and I'm still don't know how many know its details.
    To some extent I agree, but if someone is coding for other people, one
    of the factors that the coder should consider is what is
    comprehensible at a glance, in addition to other factors like brevity,
    efficiency where needed, robustness, and such. For example, for my
    own programs I have no problems with
    my %key_lookup;
    @key_lookup{@keys} = (1) x @keys;
    But since I don't know how many people know that idiom, I might
    hesitate to use it when coding for others, and if I did I would likely
    comment it clearly.
     
    Tim McDaniel, Dec 29, 2011
    #9
  10. tr/_// is pretty simple. It is actually short for tr/_/_/ which
    replaces every '_' character with a '_' character and returns the number
    of replacements made. It has the advantages that it doesn't interpolate
    and it only does one thing, and does it well.



    John
     
    John W. Krahn, Dec 29, 2011
    #10
  11. David Filmer

    C.DeRykus Guest


    Maybe,

    print '>1' if (()= /\G [[:alnum:]]+ _/gx) > 1;
     
    C.DeRykus, Dec 29, 2011
    #11
  12. David Filmer

    Tim McDaniel Guest

    For anyone else who is wondering about the use of ()=, please see "man
    perldata".

    List assignment in scalar context returns the number of elements
    pro- duced by the expression on the right side of the assignment:

    $x = (($foo,$bar) = (3,2,1)); # set $x to 3, not 2
    $x = (($foo,$bar) = f()); # set $x to f()'s return count

    This is handy when you want to do a list assignment in a Boolean
    context, because most list functions return a null list when
    finished, which when assigned produces a 0, which is interpreted
    as FALSE.

    It's also the source of a useful idiom for executing a function or
    performing an operation in list context and then counting the
    number of return values, by assigning to an empty list and then
    using that assignment in scalar context. For example, this code:

    $count = () = $string =~ /\d+/g;

    will place into $count the number of digit groups found in
    $string. This happens because the pattern match is in list
    context (since it is being assigned to the empty list), and will
    therefore return a list of all matching parts of the string. The
    list assignment in scalar context will translate that into the
    number of elements (here, the number of times the pattern matched)
    and assign that to $count. Note that simply using

    $count = $string =~ /\d+/g;

    would not have worked, since a pattern match in scalar context
    will only return true or false, rather than a count of matches.
     
    Tim McDaniel, Dec 29, 2011
    #12
  13. The first one is not completely equivalent to !/\W/, but when ANDed
    with the second one it is (ignoring the issue with trailing \n, of
    course). Is it more clear? I'm not sure...

    Ilya
     
    Ilya Zakharevich, Dec 31, 2011
    #13
  14. tr is extremely complicated.
    For which value of "well"? If it is applied to 2GB string, would it
    make a copy of it? If the string is tied to a database entry, would
    it cause a database update? If the string is shared between fork()ed
    processes, would it become unshared after the operation?

    In short: Do you know what you are talking about?

    Best wishes for the new year,
    Ilya
     
    Ilya Zakharevich, Dec 31, 2011
    #14
  15. Not when counting or replacing character in a non-UTF8 string.
    Maybe, maybe not. That would depend on the implemention of tieing mechanism.
    Strings are not shared between forked processes, memory pages are. As
    soon as any process tries to write to a shared page, it will get its
    own copy for usual COW-implementations.
     
    Rainer Weikusat, Jan 1, 2012
    #15
  16. David Filmer

    Tim McDaniel Guest

    I remember Dennis Ritchie's use of the phrase "unwarranted chumminess
    with the C implementation" (in a far more dubious situation). I'm
    hesitant to depend on implementation details unless they're guaranteed
    in the documentation. Particularly with Perl: systems I'm on have
    versions variously between 5.8 and 5.14, so I wonder which versions
    have which optimizations, or indeed if they are done at all.

    On the other hand, when you write the scripts yourself (I do that a
    lot with Perl), you can know whether it does ties, large strings, or
    other unusual cases.
     
    Tim McDaniel, Jan 3, 2012
    #16
  17. What is guaranteed in the documentation today will be 'accidentally
    still in the documentation' tomorrow and 'a deprecated feature which
    must not be used under any circumstances' (on threat of immediate
    excommunication from the universe of all the just and beautiful
    people) two days later, so that doesn't really buy you anything :->.

    OTOH, it is sensible to assume that - usually - the people who wrote
    the implementation will have tried to make it behave sensibly and in
    this case, that tr/// will neither copy nor modify the string except
    if this is necessary to perform the requested operation.

    Re: tied scalars

    What will happen when an operation is performed on a scalar tied to
    something depends on the class/ module used to provide the tied
    semantics and this can be anything, so the question didn't really make
    sense: This class or module may well cause 'a database update' despite
    perl didn't modify the data.
     
    Rainer Weikusat, Jan 3, 2012
    #17
  18. David Filmer

    sln Guest

    If I understand you correctly from your example, this may work.
    /^[^\W_]+(?:_[^\W_]+){2,}$/

    -sln
     
    sln, Jan 4, 2012
    #18
  19. So I read it as: "it will" (with certain exceptions).
    Not applicable to Perl (in general). A lot of stuff is majorly pessimized.
    This is true "literally", but AFAIK, not applicable to any situation I
    know.

    Essentially, for me all this boils down to: do not use tr/// unless
    you can't avoid it, or know EXACTLY how and when your code is going to
    be used...

    Ilya
     
    Ilya Zakharevich, Jan 10, 2012
    #19
  20. [...]
    That's not something "everyone knows" and in fact, some people were so
    convinced that UTF-8 would be a sensible choice that they implemented
    complete operating systems based on using UTF-8 as native character
    encoding (that would be "Plan9"). This should rather be "every member
    of some small group of people" (people currently working on Perl
    Unicode support?) are strongly convinced that chosing UTF-8 was a
    mistake (and I'd wager a bet that the base reason for this is "that's
    not what Microsoft did and consequently, it must be WRONG !!1").
     
    Rainer Weikusat, Jan 10, 2012
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.