simple regex pattern sought

Discussion in 'Java' started by Roedy Green, May 25, 2012.

  1. Roedy Green

    Roedy Green Guest

    I often have to search for things of the form

    "xxxxx"
    or
    'xxxxx'

    where xxx is anything not " or '. It might be Russian or English or
    any other language.

    What is the cleanest way to do that?
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I would be quite surprised if the NSA (National Security Agency)
    did not have a computer program to scan bits of shredded
    documents and electronically put them back together like a giant
    jigsaw puzzle. This suggests you cannot just shred, you must also burn.
    ..
    Roedy Green, May 25, 2012
    #1
    1. Advertising

  2. Roedy Green

    markspace Guest

    On 5/25/2012 2:45 PM, Roedy Green wrote:
    > I often have to search for things of the form
    >
    > "xxxxx"
    > or
    > 'xxxxx'
    >
    > where xxx is anything not " or '. It might be Russian or English or
    > any other language.
    >
    > What is the cleanest way to do that?



    Would this work?

    '[^']+'|"[^"]+"
    markspace, May 25, 2012
    #2
    1. Advertising

  3. Roedy Green

    Lew Guest

    Roedy Green wrote:
    > I often have to search for things of the form
    >
    > "xxxxx"
    > or
    > 'xxxxx'
    >
    > where xxx is anything not " or '. It might be Russian or English or
    > any other language.
    >
    > What is the cleanest way to do that?


    Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

    --
    Lew
    Lew, May 25, 2012
    #3
  4. Roedy Green

    Lew Guest

    On Friday, May 25, 2012 2:55:07 PM UTC-7, Lew wrote:
    > Roedy Green wrote:
    > > I often have to search for things of the form
    > >
    > > "xxxxx"
    > > or
    > > 'xxxxx'
    > >
    > > where xxx is anything not " or '. It might be Russian or English or
    > > any other language.
    > >
    > > What is the cleanest way to do that?

    >
    > Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.


    "([\"'])[^\"']+\\1"

    That way you match the opening quote.

    (The extra backslashes are to escape the characters in the string. Regex sees one fewer per each set.)

    --
    Lew
    Lew, May 25, 2012
    #4
  5. Roedy Green

    markspace Guest

    On 5/25/2012 2:55 PM, Lew wrote:

    >
    > Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
    >


    This would match "John's restaurant" as "John'.

    The first quote matches ", John does not contain either ' or " as
    specified, and the last character class matches the '. Not I think what
    is wanted.
    markspace, May 25, 2012
    #5
  6. On 25.05.2012 23:55, Lew wrote:
    > Roedy Green wrote:
    >> I often have to search for things of the form
    >>
    >> "xxxxx"
    >> or
    >> 'xxxxx'
    >>
    >> where xxx is anything not " or '. It might be Russian or English or
    >> any other language.
    >>
    >> What is the cleanest way to do that?

    >
    > Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.


    That does not match quoting properly. Better do something like

    "([\"'])[^\"']*\\1"

    Still I prefer

    "\"[^\"]*\"|'[^']*'"

    Because it allows for quotes of the other type inside quotes.

    With proper escaping (using \ as escape char, any other works, too) this
    becomes

    "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

    Kind regards

    robert


    package rx;

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class Quotes {

    private static final Pattern Q1 = Pattern.compile("([\"'])[^\"']*\\1");
    private static final Pattern Q2 = Pattern.compile("\"[^\"]*\"|'[^']*'");
    private static final Pattern Q3 =
    Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'");

    public static void main(String[] args) {
    System.out.println(Q1);
    for (final Matcher m = Q1.matcher("'a' \"b\" 'c'"); m.find();) {
    System.out.println(m.group());
    }

    System.out.println(Q2);
    for (final Matcher m = Q2.matcher("'a' \"b\" 'c'"); m.find();) {
    System.out.println(m.group());
    }

    System.out.println(Q3);
    for (final Matcher m = Q3.matcher("'a' \"\\\"b\" 'c'"); m.find();) {
    System.out.println(m.group());
    }
    }

    }


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, May 25, 2012
    #6
  7. Roedy Green

    markspace Guest

    On 5/25/2012 3:12 PM, Robert Klemme wrote:

    > "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"



    This looks overly baroque to me. You don't need to escape \ single
    quotes ' in a Java string, and I don't think you need to in a regex
    either (although I didn't check that). I'm also not seeing the need for
    the parenthesis around the character classes [] (but again, without
    having tried it, I could be wrong). And the dot . inside the
    parenthesis just looks wrong.

    Great post overall though.
    markspace, May 26, 2012
    #7
  8. Roedy Green

    Roedy Green Guest

    On Sat, 26 May 2012 00:12:34 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >On 25.05.2012 23:55, Lew wrote:
    >> Roedy Green wrote:
    >>> I often have to search for things of the form
    >>>
    >>> "xxxxx"
    >>> or
    >>> 'xxxxx'
    >>>
    >>> where xxx is anything not " or '. It might be Russian or English or
    >>> any other language.

    /*
    * [TestRegexFindQuotedString.java]
    *
    * Summary: Finding a quoted String with a regex.
    ..
    *
    * Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
    http://mindprod.com
    *
    * Licence: This software may be copied and used freely for any
    purpose but military.
    * http://mindprod.com/contact/nonmil.html
    *
    * Requires: JDK 1.7+
    *
    * Created with: JetBrains IntelliJ IDEA IDE
    http://www.jetbrains.com/idea/
    *
    * Version History:
    * 1.0 2012-05-25 initial release
    */
    package com.mindprod.example;

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    import static java.lang.System.out;

    /**
    * Finding a quoted String with a regex.
    *
    * @author Roedy Green, Canadian Mind Products
    * @version 1.0 2012-05-25 initial release
    * @since 2012-05-25
    */
    public class TestRegexFindQuotedString
    {
    // ------------------------------ CONSTANTS
    ------------------------------

    private static final String lookIn = "George said \"that's the
    ticket\"." +
    " Jeb replied '\"ticket?\"
    what ticket'." +
    " \"How na\u00efve!\"." +
    " empty: \"\"" +
    " 'unbalanced\"";

    // -------------------------- STATIC METHODS
    --------------------------

    /**
    * exercise that pattern to see what if can find
    */
    static void exercisePattern( Pattern pattern )
    {
    out.println();
    out.println( "Pattern: " + pattern.toString() );
    final Matcher m = pattern.matcher( lookIn ); // Matchers are
    used both for matching and finding.
    while ( m.find() )
    {
    out.println( m.group( 0 ) );
    }
    }

    // --------------------------- main() method
    ---------------------------

    /**
    * test harness
    *
    * @param args not used
    */
    public static void main( String[] args )
    {
    // We want to find Strings of the form "xx'xx" or 'xx"xx'
    // We want to avoid the following problems:
    // 1. Works even if String contains foreign languages, even
    Russian or accented letters.
    // 2. If starts with " must end with ", if starts with ' must
    end with '.
    // 3. ' is ok inside "...", and " is ok inside '...'
    // 4. We don't worry about how to use ' inside '...'.

    // here are some suggested techniques:

    exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
    ); // fails 1 2 3

    exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); //
    fails 2 3

    exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //
    fails 3, uses a capturing group.

    exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
    works, rejects empty strings by Mark Space.

    exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //
    works, accepts empty strings by Robert Klemme.

    exercisePattern( Pattern.compile(
    "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
    empty strings
    // (?: ) is a non-capturing group. This is Robert Klemme's
    contribution. I don't understand how it works.
    }
    }
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I would be quite surprised if the NSA (National Security Agency)
    did not have a computer program to scan bits of shredded
    documents and electronically put them back together like a giant
    jigsaw puzzle. This suggests you cannot just shred, you must also burn.
    ..
    Roedy Green, May 26, 2012
    #8
  9. Roedy Green

    markspace Guest

    On 5/26/2012 6:19 AM, Roedy Green wrote:

    > exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
    > works, rejects empty strings by Mark Space.



    If you want it to accept empty strings, replace the +'s with *'s. You
    didn't specify empty strings in your original problem statement, so I
    decided to disallow them.

    Thanks for posting that SSCCE, btw. I was too lazy to cook one up.
    markspace, May 26, 2012
    #9
  10. On 26.05.2012 03:43, markspace wrote:
    > On 5/25/2012 3:12 PM, Robert Klemme wrote:
    >
    >> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

    >
    >
    > This looks overly baroque to me. You don't need to escape \ single
    > quotes ' in a Java string,


    I didn't.

    > and I don't think you need to in a regex
    > either (although I didn't check that).


    There is also no regexp escaping of single quotes either. The only
    regexp escaping you can see are the \\\\ which translate into \\ in the
    string which is a literal backslash for the regexp engine.

    > I'm also not seeing the need for
    > the parenthesis around the character classes [] (but again, without
    > having tried it, I could be wrong).


    It's not parenthesis around character classes but around the alternative
    of "match a backslash followed by any char" and "any char which is not
    backslash or the opening quote type of this string variant".

    > And the dot . inside the parenthesis just looks wrong.


    It isn't - see above.

    > Great post overall though.


    Thank you! It does seem to need some time to sink in though... :)

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, May 26, 2012
    #10
  11. Roedy Green

    markspace Guest

    On 5/26/2012 6:19 AM, Roedy Green wrote:

    > exercisePattern( Pattern.compile(
    > "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
    > empty strings
    > // (?: ) is a non-capturing group. This is Robert Klemme's
    > contribution. I don't understand how it works.



    Ah, OK, so here's my contribution to your excellent SSCCE. First this
    pattern is basically the same as mine. It uses alternation (the
    vertical bar |) to pick a string delimited by either ' or "

    Here's his regex string without the extra escapes for Java:

    "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
    ^^^^^^^^^^^^^^^^

    Let's look at just the first half for a moment, without the (?:\\. part.

    "[^\"]*"
    ^^^^^^^^
    12 3
    Example for the first part:
    1. " string starts with double quote
    2. [^\"]* doesn't contain a "
    3. " ends with double quote

    Same for the second half of the string.

    Notice he's using * instead of +'s, which is why his matches 0 width
    strings.

    The other part didn't appear in your problem statement, but in HTML/XML
    it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
    inclusion is very reasonable.

    So he Robert adds (\\.|[^\"])* to the first part, which is
    12 345 6

    1. Start a group
    2. A slash. It needs to be escaped for regex, hence \\.
    3. . is regex "any character". 2 and 3 together mean "match \ followed
    by any character"
    4. OR (alternation again)
    5. character class, negated (the ^), matches anything except \ or ". I
    think this is a mistake: the \ needs to be quoted.
    6. zero or more.

    Then after that mess, he does the obvious thing and adds non-capturing
    group, to make the regex do a little less work.

    "(?:\\.|[^\"])*"

    Phew! Next, he adds one alternation and does the same for a ' delimited
    string.

    |'(?:\\.|[^\'])*'

    Same thing, just ' instead of ".

    Finally I think this could be simplified slightly with Lew's
    back-reference idea.

    (['"])(?:\\.|[^\1\\])*

    (Untested.) This allows empty strings between delimiters; instead of a
    * use + for only non-empty strings between the quotes.



    My executive summary:

    Regex is a great rapid development tool, except when it isn't. You
    realize your problem is simple, and you could have hand-coded a parser
    to do this much quicker than all these news post exchanges?
    markspace, May 26, 2012
    #11
  12. Roedy Green

    markspace Guest

    On 5/26/2012 7:37 AM, Robert Klemme wrote:
    > On 26.05.2012 03:43, markspace wrote:
    >> On 5/25/2012 3:12 PM, Robert Klemme wrote:
    >>
    >>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

    >>

    ....
    >> and I don't think you need to in a regex
    >> either (although I didn't check that).

    >
    > There is also no regexp escaping of single quotes either. The only
    > regexp escaping you can see are the \\\\ which translate into \\ in the
    > string which is a literal backslash for the regexp engine.



    Yes, there is, although I think it's a typo. Both \\\" and \\' get
    passed to the regex as \" and \', which means just a single character "
    and ' respectively.

    You're right about the rest of it though. With so many \'s floating
    around, I have a hard time reading Java regex!


    > It's not parenthesis around character classes but around the alternative
    > of "match a backslash followed by any char" and "any char which is not
    > backslash or the opening quote type of this string variant".



    Yup, I totally missed this too. Thanks for pointing it out.
    markspace, May 26, 2012
    #12
  13. On 26.05.2012 16:57, markspace wrote:
    > On 5/26/2012 6:19 AM, Roedy Green wrote:
    >
    >> exercisePattern( Pattern.compile(
    >> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
    >> empty strings
    >> // (?: ) is a non-capturing group. This is Robert Klemme's
    >> contribution. I don't understand how it works.

    >
    >
    > Ah, OK, so here's my contribution to your excellent SSCCE. First this
    > pattern is basically the same as mine. It uses alternation (the vertical
    > bar |) to pick a string delimited by either ' or "
    >
    > Here's his regex string without the extra escapes for Java:
    >
    > "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
    > ^^^^^^^^^^^^^^^^
    >
    > Let's look at just the first half for a moment, without the (?:\\. part.
    >
    > "[^\"]*"
    > ^^^^^^^^
    > 12 3
    > Example for the first part:
    > 1. " string starts with double quote
    > 2. [^\"]* doesn't contain a "
    > 3. " ends with double quote
    >
    > Same for the second half of the string.
    >
    > Notice he's using * instead of +'s, which is why his matches 0 width
    > strings.
    >
    > The other part didn't appear in your problem statement, but in HTML/XML
    > it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
    > inclusion is very reasonable.
    >
    > So he Robert adds (\\.|[^\"])* to the first part, which is
    > 12 345 6
    >
    > 1. Start a group
    > 2. A slash. It needs to be escaped for regex, hence \\.
    > 3. . is regex "any character". 2 and 3 together mean "match \ followed
    > by any character"
    > 4. OR (alternation again)
    > 5. character class, negated (the ^), matches anything except \ or ". I
    > think this is a mistake: the \ needs to be quoted.


    Oh, right, thanks for finding that!

    > 6. zero or more.
    >
    > Then after that mess, he does the obvious thing and adds non-capturing
    > group, to make the regex do a little less work.
    >
    > "(?:\\.|[^\"])*"
    >
    > Phew! Next, he adds one alternation and does the same for a ' delimited
    > string.
    >
    > |'(?:\\.|[^\'])*'
    >
    > Same thing, just ' instead of ".
    >
    > Finally I think this could be simplified slightly with Lew's
    > back-reference idea.
    >
    > (['"])(?:\\.|[^\1\\])*
    >
    > (Untested.) This allows empty strings between delimiters; instead of a *
    > use + for only non-empty strings between the quotes.


    Interesting approach - but it doesn't work. Simple test with
    Pattern.compile("(.)[a\\1]"):

    Exception in thread "main" java.util.regex.PatternSyntaxException:
    Illegal/unsupported escape sequence near index 6
    (.)[a\1]
    ^

    > My executive summary:
    >
    > Regex is a great rapid development tool, except when it isn't. You
    > realize your problem is simple, and you could have hand-coded a parser
    > to do this much quicker than all these news post exchanges?


    Maybe, maybe not.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, May 26, 2012
    #13
  14. On 26.05.2012 17:06, markspace wrote:
    > On 5/26/2012 7:37 AM, Robert Klemme wrote:
    >> On 26.05.2012 03:43, markspace wrote:
    >>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
    >>>
    >>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
    >>>

    > ...
    >>> and I don't think you need to in a regex
    >>> either (although I didn't check that).

    >>
    >> There is also no regexp escaping of single quotes either. The only
    >> regexp escaping you can see are the \\\\ which translate into \\ in the
    >> string which is a literal backslash for the regexp engine.

    >
    >
    > Yes, there is, although I think it's a typo. Both \\\" and \\' get
    > passed to the regex as \" and \', which means just a single character "
    > and ' respectively.


    Right you are - both times: there is regexp escapind and it was in fact
    a typo (missing \\)!

    > You're right about the rest of it though. With so many \'s floating
    > around, I have a hard time reading Java regex!


    That's true for other languages as well - the basic reason is that the
    same character is used for

    - escaping in strings
    - escaping in backslashes
    - escaping in the source text (in this case we could pick another
    character)

    >> It's not parenthesis around character classes but around the alternative
    >> of "match a backslash followed by any char" and "any char which is not
    >> backslash or the opening quote type of this string variant".

    >
    >
    > Yup, I totally missed this too. Thanks for pointing it out.


    You're welcome! Thank you again for finding the missing escape.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, May 26, 2012
    #14
  15. Roedy Green

    markspace Guest

    On 5/26/2012 8:13 AM, Robert Klemme wrote:
    > On 26.05.2012 16:57, markspace wrote:
    >> Finally I think this could be simplified slightly with Lew's
    >> back-reference idea.
    >>
    >> (['"])(?:\\.|[^\1\\])*
    >>
    >> (Untested.) This allows empty strings between delimiters; instead of a *
    >> use + for only non-empty strings between the quotes.

    >
    > Interesting approach - but it doesn't work. Simple test with
    > Pattern.compile("(.)[a\\1]"):
    >
    > Exception in thread "main" java.util.regex.PatternSyntaxException:
    > Illegal/unsupported escape sequence near index 6
    > (.)[a\1]
    > ^



    Yup, [] is for characters, and \1 could be a string. Gets rejected. I
    think you could use "negative lookahead" to say "not this string" when
    parsing. Gets kinda ugly though.

    <http://www.regular-expressions.info/conditional.html>

    Java:

    "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

    Regex:

    (['"])(?:\\.|(?!\1|\\).)+\1

    I re-did Roedy's test program to be a bit more clear about what it was
    looking for, and the results. This could be even cleaner if it was run
    with a JUnit test harness.

    At this point though the regex is basically just a mess. Download antlr
    and get an XML/HTML grammar from online.



    package quicktest;

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    import static java.lang.System.out;

    /**
    *
    * @author Brenden
    */
    public class MindProdRegex {

    }

    /*
    * [TestRegexFindQuotedString.java]
    *
    * Summary: Finding a quoted String with a regex.
    ..
    *
    * Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
    http://mindprod.com
    *
    * Licence: This software may be copied and used freely for any
    purpose but military.
    * http://mindprod.com/contact/nonmil.html
    *
    * Requires: JDK 1.7+
    *
    * Created with: JetBrains IntelliJ IDEA IDE
    http://www.jetbrains.com/idea/
    *
    * Version History:
    * 1.0 2012-05-25 initial release
    */

    /**
    * Finding a quoted String with a regex.
    *
    * @author Roedy Green, Canadian Mind Products
    * @version 1.0 2012-05-25 initial release
    * @since 2012-05-25
    */
    class TestRegexFindQuotedString
    {
    // ------------------------------
    CONSTANTS------------------------------

    private static final String[] vectors =
    {"Basic: George said \"that's theticket\".",
    "\"that's theticket\"",
    "Nested: Jeb replied '\"ticket?\"what ticket'.",
    "'\"ticket?\"what ticket'",
    "Non-ASCII: \"How na\u00efve!\".",
    "\"How na\u00efve!\"",
    " empty: \"\"xx",
    "\"\"",
    " escaped: 'Bob\\'s your uncle.'",
    "'Bob\\'s your uncle.'",
    " 'unbalanced\"",
    "",
    };

    // -------------------------- STATIC METHODS--------------------------

    /**
    * exercise that pattern to see what if can find
    */
    static void exercisePattern( Pattern pattern )
    {
    out.println();
    out.println( "Pattern: " + pattern.toString() );
    for( int i = 0; i < vectors.length; i+=2 ) {
    String test = vectors;
    String result = vectors[i+1];
    final Matcher m = pattern.matcher( test );
    boolean found = m.find();
    boolean correct = false;
    String groupString = null;
    if( found ) {
    correct = m.group(0).equals( result );
    groupString = m.group();
    }
    System.out.println( test+", found: "+ found +
    ", correct: "+correct+" ("+groupString+")");
    }
    }

    // --------------------------- main() method---------------------------

    /**
    * test harness
    *
    * @param args not used
    */
    public static void main( String[] args )
    {
    // We want to find Strings of the form "xx'xx" or 'xx"xx'
    // We want to avoid the following problems:
    // 1. Works even if String contains foreign languages,
    evenRussian or accented letters.
    // 2. If starts with " must end with ", if starts with '
    mustend with '.
    // 3. ' is ok inside "...", and " is ok inside '...'
    // 4. We don't worry about how to use ' inside '...'.

    // here are some suggested techniques:

    exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
    ); // fails 1 2 3

    exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
    //fails 2 3

    exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
    //fails 3, uses a capturing group.

    exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
    //works, rejects empty strings by Mark Space.
    exercisePattern( Pattern.compile(
    "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
    by Mark Space.

    exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
    //works, accepts empty strings by Robert Klemme.
    exercisePattern( Pattern.compile(
    "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
    strings
    // (?: ) is a non-capturing group. This is Robert
    Klemme'scontribution. I don't understand how it works.
    }
    }
    markspace, May 26, 2012
    #15
  16. Roedy Green

    Lew Guest

    markspace wrote:
    > Lew wrote:
    >> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
    >>

    > This would match "John's restaurant" as "John'.
    >
    > The first quote matches ", John does not contain either ' or " as specified,
    > and the last character class matches the '. Not I think what is wanted.


    As I correct6ed in my very next post.

    --
    Lew
    Honi soit qui mal y pense.
    http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
    Lew, May 26, 2012
    #16
  17. Roedy Green

    Roedy Green Guest

    On Sat, 26 May 2012 10:08:58 -0700, markspace <-@.> wrote, quoted or
    indirectly quoted someone who said :

    >I re-did Roedy's test program to be a bit more clear about what it was
    >looking for, and the results. This could be even cleaner if it was run
    >with a JUnit test harness.


    Thanks Brendan. I have incorporated your suggestions plus a bit more
    polishing.

    See http://mindprod.com/jgloss/regex.html#FINDQUOTED

    for a formatted listing + output.

    The next task, probably procrastinated, is to solve it with a little
    finite state automaton that decodes \x as well, and a simpler version
    without. If a newbie is interested in tackling that, they can look at
    my Java snippet parser as part of JPrep/JDisplay and strip it down.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I would be quite surprised if the NSA (National Security Agency)
    did not have a computer program to scan bits of shredded
    documents and electronically put them back together like a giant
    jigsaw puzzle. This suggests you cannot just shred, you must also burn.
    ..
    Roedy Green, May 26, 2012
    #17
  18. Roedy Green

    markspace Guest

    On 5/26/2012 2:07 PM, Lew wrote:
    > markspace wrote:
    >> Lew wrote:
    >>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
    >>> don't know.
    >>>

    >> This would match "John's restaurant" as "John'.
    >>
    >> The first quote matches ", John does not contain either ' or " as
    >> specified,
    >> and the last character class matches the '. Not I think what is wanted.

    >
    > As I correct6ed in my very next post.
    >



    Unfortunately that one doesn't work either. The central part, [^"'],
    doesn't allow a match of a ' if the starting delimiter was a ", and that
    doesn't match Roedy's spec. "John's restaurant" wouldn't be matched at
    all, because the matcher couldn't match past the ' to get to the ".

    I think the easiest is to write out a grammar for the expression, then
    translate to regex.

    QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

    SQUOTED_STRING := ' NON_S_QUOTE + '

    DQUOTED_STRING := " NON_D_QUOTE + "

    NON_S_QUOTE := [^']

    NON_D_QUOTE := [^"]

    At this point the grammar is very clear. (Note I haven't included
    Robert's \x escape sequences.) I think it's worth learning to use antlr
    rather than regex, which tends to obfuscate more than it helps.
    However, a literal translation into regex isn't hard, and a literal
    translation avoids mis-optimizations.
    markspace, May 27, 2012
    #18
  19. Roedy Green

    Lew Guest

    markspace wrote:
    > Lew wrote:
    >> markspace wrote:
    >>> Lew wrote:
    >>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
    >>>> don't know.
    >>>>
    >>> This would match "John's restaurant" as "John'.
    >>>
    >>> The first quote matches ", John does not contain either ' or " as
    >>> specified,
    >>> and the last character class matches the '. Not I think what is wanted.

    >>
    >> As I correct6ed in my very next post.

    >
    > Unfortunately that one doesn't work either. The central part, [^"'], doesn't
    > allow a match of a ' if the starting delimiter was a ", and that doesn't match
    > Roedy's spec. "John's restaurant" wouldn't be matched at all, because the
    > matcher couldn't match past the ' to get to the ".
    >
    > I think the easiest is to write out a grammar for the expression, then
    > translate to regex.
    >
    > QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING
    >
    > SQUOTED_STRING := ' NON_S_QUOTE + '
    >
    > DQUOTED_STRING := " NON_D_QUOTE + "
    >
    > NON_S_QUOTE := [^']
    >
    > NON_D_QUOTE := [^"]
    >
    > At this point the grammar is very clear. (Note I haven't included Robert's \x
    > escape sequences.) I think it's worth learning to use antlr rather than regex,
    > which tends to obfuscate more than it helps. However, a literal translation
    > into regex isn't hard, and a literal translation avoids mis-optimizations.


    Very illuminating. Thank you.

    --
    Lew
    Honi soit qui mal y pense.
    http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
    Lew, May 27, 2012
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris Newman

    Help sought Perl with a bit of REGEX

    Chris Newman, Jul 22, 2006, in forum: Perl
    Replies:
    1
    Views:
    1,889
    Mumia W.
    Jul 22, 2006
  2. Xah Lee
    Replies:
    1
    Views:
    937
    Ilias Lazaridis
    Sep 22, 2006
  3. Xah Lee
    Replies:
    8
    Views:
    460
    Ilias Lazaridis
    Sep 26, 2006
  4. Karsten Wutzke
    Replies:
    5
    Views:
    473
    Ishwor Gurung
    Aug 23, 2007
  5. ChrisC
    Replies:
    4
    Views:
    169
    ChrisC
    Jun 25, 2010
Loading...

Share This Page