Regex: Any character in character class

Discussion in 'Java' started by Sebastian, Jan 30, 2013.

  1. Sebastian

    Sebastian Guest

    I want to match any sequence of characters, including line breaks, in a
    suffix of a multi-line string.

    I do not want to use Pattern.DOTALL, because line breaks are not
    permissible everywhere. I cannot write [.]* because dot loses its
    special meaning inside a character class.

    I have come up with [\S\s]*
    as meaning any sequence of non-whitespace or whitespace (incl.
    line-breaks). Is there a better way?

    -- Sebastian
    Sebastian, Jan 30, 2013
    #1
    1. Advertising

  2. What about [^]?
    Mikhail Vladimirov, Jan 30, 2013
    #2
    1. Advertising

  3. Another option is .|\n
    Mikhail Vladimirov, Jan 30, 2013
    #3
  4. Sebastian

    Arne Vajhøj Guest

    On 1/30/2013 5:05 AM, Mikhail Vladimirov wrote:
    > What about [^]?


    java.util.regex.PatternSyntaxException

    Arne
    Arne Vajhøj, Jan 31, 2013
    #4
  5. Sebastian

    Arne Vajhøj Guest

    On 1/30/2013 4:34 AM, Sebastian wrote:
    > I want to match any sequence of characters, including line breaks, in a
    > suffix of a multi-line string.
    >
    > I do not want to use Pattern.DOTALL, because line breaks are not
    > permissible everywhere. I cannot write [.]* because dot loses its
    > special meaning inside a character class.
    >
    > I have come up with [\S\s]*
    > as meaning any sequence of non-whitespace or whitespace (incl.
    > line-breaks). Is there a better way?


    Do you always want to accept line breaks or not? If not then when?

    Arne
    Arne Vajhøj, Jan 31, 2013
    #5
  6. On 01/30/2013 11:27 PM, Arne Vajhøj wrote:
    > On 1/30/2013 4:34 AM, Sebastian wrote:
    >> I want to match any sequence of characters, including line breaks, in a
    >> suffix of a multi-line string.
    >>
    >> I do not want to use Pattern.DOTALL, because line breaks are not
    >> permissible everywhere. I cannot write [.]* because dot loses its
    >> special meaning inside a character class.
    >>
    >> I have come up with [\S\s]*
    >> as meaning any sequence of non-whitespace or whitespace (incl.
    >> line-breaks). Is there a better way?

    >
    > Do you always want to accept line breaks or not? If not then when?
    >
    > Arne
    >
    >

    Good question.

    I take it the suffix is a generic last-N characters of the string
    (Assumption #1). I take it that line breaks are OK in the suffix, not
    necessarily so in the rest of the string (Assumption #2).

    If you don't mind me asking, why don't you just grab the suffix, the
    last N characters, with substring()? That *is* your match.

    AHS
    Arved Sandstrom, Feb 1, 2013
    #6
  7. Sebastian

    Sebastian Guest

    Am 31.01.2013 04:27, schrieb Arne Vajhøj:
    > On 1/30/2013 4:34 AM, Sebastian wrote:
    >> I want to match any sequence of characters, including line breaks, in a
    >> suffix of a multi-line string.
    >>
    >> I do not want to use Pattern.DOTALL, because line breaks are not
    >> permissible everywhere. I cannot write [.]* because dot loses its
    >> special meaning inside a character class.
    >>
    >> I have come up with [\S\s]*
    >> as meaning any sequence of non-whitespace or whitespace (incl.
    >> line-breaks). Is there a better way?

    >
    > Do you always want to accept line breaks or not? If not then when?
    >
    > Arne
    >
    >

    the string I want to match basicallyhas two parts (a "protocol" and a
    "selection expression"). I want to allow line breaks anywhere in the
    selection expression, but not in the protocol.
    -- S.
    Sebastian, Feb 1, 2013
    #7
  8. Sebastian

    Lew Guest

    Sebastian wrote:
    > the string I want to match basicallyhas two parts (a "protocol" and a
    > "selection expression"). I want to allow line breaks anywhere in the
    > selection expression, but not in the protocol.


    How do you tell which part is which?

    --
    Lew
    Lew, Feb 1, 2013
    #8
  9. Sebastian

    Arne Vajhøj Guest

    On 2/1/2013 3:14 PM, Sebastian wrote:
    > Am 31.01.2013 04:27, schrieb Arne Vajhøj:
    >> On 1/30/2013 4:34 AM, Sebastian wrote:
    >>> I want to match any sequence of characters, including line breaks, in a
    >>> suffix of a multi-line string.
    >>>
    >>> I do not want to use Pattern.DOTALL, because line breaks are not
    >>> permissible everywhere. I cannot write [.]* because dot loses its
    >>> special meaning inside a character class.
    >>>
    >>> I have come up with [\S\s]*
    >>> as meaning any sequence of non-whitespace or whitespace (incl.
    >>> line-breaks). Is there a better way?

    >>
    >> Do you always want to accept line breaks or not? If not then when?


    > the string I want to match basicallyhas two parts (a "protocol" and a
    > "selection expression"). I want to allow line breaks anywhere in the
    > selection expression, but not in the protocol.


    Do you have a separator between the two parts like colon in URL's?

    If yes then something like:

    [.]+:[.|\n]+

    Arne
    Arne Vajhøj, Feb 1, 2013
    #9
  10. Sebastian

    markspace Guest

    On 2/1/2013 1:47 PM, Arne Vajhøj wrote:

    > [.]+:[.|\n]+



    Watch out for this. +, being greedy, will match a : in the selection
    expression (the 2nd part) if : is allowed in the second part.

    The reluctant modifier might be a better idea here:

    ..+?:[.|\n]+

    Note that I don't think the initial brackets [] were needed. Also we're
    yet again starting to see the problem with regex: it always evolves into
    something that looks like your cat walked across the keyboard.
    markspace, Feb 1, 2013
    #10
  11. Sebastian

    Arne Vajhøj Guest

    On 2/1/2013 5:06 PM, markspace wrote:
    > On 2/1/2013 1:47 PM, Arne Vajhøj wrote:
    >
    >> [.]+:[.|\n]+

    >
    >
    > Watch out for this. +, being greedy, will match a : in the selection
    > expression (the 2nd part) if : is allowed in the second part.
    >
    > The reluctant modifier might be a better idea here:
    >
    > .+?:[.|\n]+
    >
    > Note that I don't think the initial brackets [] were needed. Also we're
    > yet again starting to see the problem with regex: it always evolves into
    > something that looks like your cat walked across the keyboard.


    You are absolutely right.

    Non greedy.

    No square brackets for first part.

    And also round brackets for the last part.

    ..+?:(.|\n)+

    I think I must have set a new world record. 3 bugs in 12 characters.

    :-(

    Arne
    Arne Vajhøj, Feb 1, 2013
    #11
  12. On 01.02.2013 21:14, Sebastian wrote:
    > Am 31.01.2013 04:27, schrieb Arne Vajhøj:
    >> On 1/30/2013 4:34 AM, Sebastian wrote:
    >>> I want to match any sequence of characters, including line breaks, ina
    >>> suffix of a multi-line string.
    >>>
    >>> I do not want to use Pattern.DOTALL, because line breaks are not
    >>> permissible everywhere. I cannot write [.]* because dot loses its
    >>> special meaning inside a character class.
    >>>
    >>> I have come up with [\S\s]*
    >>> as meaning any sequence of non-whitespace or whitespace (incl.
    >>> line-breaks). Is there a better way?


    Yes.

    >> Do you always want to accept line breaks or not? If not then when?


    > the string I want to match basicallyhas two parts (a "protocol" and a
    > "selection expression"). I want to allow line breaks anywhere in the
    > selection expression, but not in the protocol.


    Of course you can use DOTALL - as an embedded flag:

    package rx;

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class Dotty {

    private static final Pattern PAT =
    Pattern.compile("proto.*(?s:sel.*)");

    public static void main(String[] args) {
    test("protoPselS");
    test("protoPPselS\nS");
    test("protoP\nPselS\nS");
    }

    public static void test(final CharSequence cs) {
    System.out.println("cs=\"" + cs + "\"");
    final Matcher m = PAT.matcher(cs);

    if (m.matches()) {
    System.out.println("Match: \"" + m.group() + "\"");
    } else {
    System.out.println("Mismatch");
    }

    System.out.println();
    }

    }

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Feb 1, 2013
    #12
  13. Sebastian

    Sebastian Guest

    Am 01.02.2013 23:13, schrieb Arne Vajhøj:
    [snip]
    > And also round brackets for the last part.
    >
    > .+?:(.|\n)+
    >
    > I think I must have set a new world record. 3 bugs in 12 characters.
    >
    > :-(
    >
    > Arne
    >

    Here's a concrete example:

    SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
    company:company]


    The second part is everything after the first comma. I was using
    (.+?),[\s\S]+

    Arne's suggestion modified for my needs (comma as separator, and I only
    want to capture the first part as a group) will work fine as well:
    (.+?),(?:.|\n)+

    Can't say though that I find anything to prefer the one to the other.
    Perhaps the second looks even more like the result of a cat walk...

    -- Sebastian
    Sebastian, Feb 2, 2013
    #13
  14. Sebastian

    markspace Guest

    On 2/2/2013 11:45 AM, Sebastian wrote:
    > SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
    > company:company]


    For something this simple you might want to consider just String::split().

    String test =
    "SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company]
    ";
    String[] parse = test.split( ",\\s*", 2 );
    System.out.println( Arrays.toString( parse ) );

    This could be faster since the second half of the regex, (?:.|\n)+,
    doesn't have to execute.
    markspace, Feb 2, 2013
    #14
  15. Sebastian

    Arne Vajhøj Guest

    On 2/2/2013 2:45 PM, Sebastian wrote:
    > Am 01.02.2013 23:13, schrieb Arne Vajhøj:
    > [snip]
    >> And also round brackets for the last part.
    >>
    >> .+?:(.|\n)+
    >>
    >> I think I must have set a new world record. 3 bugs in 12 characters.
    >>
    >> :-(
    >>

    > Here's a concrete example:
    >
    > SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
    > company:company]
    >
    >
    > The second part is everything after the first comma. I was using
    > (.+?),[\s\S]+
    >
    > Arne's suggestion modified for my needs (comma as separator, and I only
    > want to capture the first part as a group) will work fine as well:
    > (.+?),(?:.|\n)+
    >
    > Can't say though that I find anything to prefer the one to the other.
    > Perhaps the second looks even more like the result of a cat walk...


    It is not unusual that there is more than one regex that
    does the job.

    Arne
    Arne Vajhøj, Feb 2, 2013
    #15
  16. Sebastian

    Lew Guest

    Arne Vajhøj wrote:
    > Sebastian wrote:
    >> schrieb Arne Vajhï¿œj:
    >> [snip]
    >>> And also round brackets for the last part.
    >>>
    >>> .+?:(.|\n)+
    >>>
    >>> I think I must have set a new world record. 3 bugs in 12 characters.

    >
    >>> :-(

    >
    >> Here's a concrete example:
    >>
    >> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
    >> company:company]

    >
    >> The second part is everything after the first comma. I was using


    You mean 'expression.substring(expression.indexOf(',') + 1)'?
    (modulo the usual error checks, of course)

    > > (.+?),[\s\S]+


    >> Arne's suggestion modified for my needs (comma as separator, and I only
    >> want to capture the first part as a group) will work fine as well:


    You mean 'expression.substring(0, expression.indexOf(','))'?

    > > (.+?),(?:.|\n)+

    >
    >> Can't say though that I find anything to prefer the one to the other.
    >> Perhaps the second looks even more like the result of a cat walk...


    If all you need to do is split a string on a comma, why use regexes at all?

    > It is not unusual that there is more than one regex that
    > does the job.


    It is not unusual that there is more than one non-regex that does the job.

    --
    Lew
    Lew, Feb 2, 2013
    #16
  17. Sebastian

    Arne Vajhøj Guest

    On 2/2/2013 4:23 PM, Lew wrote:
    > Arne Vajhøj wrote:
    >> Sebastian wrote:
    >>> Can't say though that I find anything to prefer the one to the other.
    >>> Perhaps the second looks even more like the result of a cat walk...

    >
    > If all you need to do is split a string on a comma, why use regexes at all?
    >
    >> It is not unusual that there is more than one regex that
    >> does the job.

    >
    > It is not unusual that there is more than one non-regex that does the job.


    True.

    But less surprising.

    Arne
    Arne Vajhøj, Feb 3, 2013
    #17
  18. On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <>
    wrote:

    [snip]

    >I think I must have set a new world record. 3 bugs in 12 characters.
    >
    >:-(


    I may be able to save your honour. <G>

    IBM had bugs in a one-instruction program of two bytes long. The
    program was IEFBR14, and you can read about it on Wikipedia. There
    was a series of corrections which resulted in a program several times
    larger.

    Sincerely,

    Gene Wirchenko
    Gene Wirchenko, Feb 4, 2013
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    698
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. E11
    Replies:
    1
    Views:
    4,742
    Thomas Weidenfeller
    Oct 12, 2005
  3. Replies:
    6
    Views:
    353
    Jim Langston
    Jul 12, 2006
  4. Replies:
    3
    Views:
    754
    Reedick, Andrew
    Jul 1, 2008
  5. Alexey Muranov
    Replies:
    4
    Views:
    218
    7stud --
    May 2, 2011
Loading...

Share This Page