How do I get String.split() to do what I want?

Discussion in 'Java' started by =?ISO-8859-1?Q?Phil_H=FChn?=, Jun 30, 2004.

  1. Hi, I'm using jdk 1.4.2 - last project I did in java was with jdk 1.3,
    so String.split(...) is new to me. Anyway, I've been trying to get the
    regex expression correct but not having much luck. This is probably very
    simple to everyone out there ;-)

    Anyway, I have something like:

    String myStr = "5,\'this,that\',8";
    String parts[] = myStr.split(",");

    Of course I get a split at ALL the commas (i.e it sees the comma within
    the quotes) so I get 4 parts:

    "5", "'this", "that'", "8"

    But I want any quoted string left intact, ideally:

    "5", "'this,that'", "8"

    (And I'll get rid of the single quote marks).
    So pretty easy... what's the correct regex split argument to do this?

    TIA
    =?ISO-8859-1?Q?Phil_H=FChn?=, Jun 30, 2004
    #1
    1. Advertising

  2. =?ISO-8859-1?Q?Phil_H=FChn?=

    Yu SONG Guest

    Phil Hühn wrote:
    > Hi, I'm using jdk 1.4.2 - last project I did in java was with jdk 1.3,
    > so String.split(...) is new to me. Anyway, I've been trying to get the
    > regex expression correct but not having much luck. This is probably very
    > simple to everyone out there ;-)
    >
    > Anyway, I have something like:
    >
    > String myStr = "5,\'this,that\',8";
    > String parts[] = myStr.split(",");
    >
    > Of course I get a split at ALL the commas (i.e it sees the comma within
    > the quotes) so I get 4 parts:
    >
    > "5", "'this", "that'", "8"
    >
    > But I want any quoted string left intact, ideally:
    >
    > "5", "'this,that'", "8"
    >
    > (And I'll get rid of the single quote marks).
    > So pretty easy... what's the correct regex split argument to do this?
    >
    > TIA


    We have to look at the pattern of these strings,

    is it always in the form of "digit, 'abc', digit"?

    Assume there is no space in 'abc', you can split the "myStr" string into

    "5", "this,that", "8",

    using the following code:

    String myStr = "5,\'this,that\',8";

    //spilt the string using "'", as you want to get rid of it
    String parts[] = myStr.split("'");

    //for output
    StringBuffer output = new StringBuffer();

    for (int i = 0; i < parts.length; i++) {
    //replace ',' for space to trim
    parts = parts.replace(',', ' ').trim();

    //replace back
    parts = parts.replace(' ', ',');

    //for output
    output.append('\"');
    output.append(parts);
    output.append("\", ");
    }
    //output
    System.out.println(output.toString());


    --
    Song

    /* E-mail.c */
    #define User "Yu.Song"
    #define At '@'
    #define Warwick "warwick.ac.uk"
    int main() {
    printf("Yu Song's E-mail: %s%c%s", User, At, Warwick);
    return 0;}

    Further Info. : http://www.dcs.warwick.ac.uk/~esubbn/
    _______________________________________________________
    Yu SONG, Jun 30, 2004
    #2
    1. Advertising

  3. =?ISO-8859-1?Q?Phil_H=FChn?=

    Roedy Green Guest

    On Wed, 30 Jun 2004 22:46:50 +1200, Phil Hühn <> wrote or
    quoted :

    >
    > String myStr = "5,\'this,that\',8";
    > String parts[] = myStr.split(",");


    Try a CSVReader instead. It is a less generic tool specially to
    handle that problem.

    see http://mindprod.com/jgloss/csv.html

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 30, 2004
    #3
  4. Yu SONG wrote:
    > is it always in the form of "digit, 'abc', digit"?

    No, that was just a short example, there may be a varying number of
    arguments, eg

    "5,6,7,'oh, joy',15,'yes',3,7,3"
    or
    "'hello',97,45,'doe,john','bloggs,fred',3,8"
    =?ISO-8859-1?Q?Phil_H=FChn?=, Jun 30, 2004
    #4
  5. Roedy Green wrote:

    > Try a CSVReader instead. It is a less generic tool specially to
    > handle that problem.
    >
    > see http://mindprod.com/jgloss/csv.html
    >

    Thanks, I probably should have stated my problem in terms of parsing a
    CSV! ;-)
    Ta for the CSVreader, but I want to use the regex stuff if possible...
    if not I could simply have written a small method to tokenise the string
    at commas, ignoring those within single quotes, but surely I can do that
    with a regular expression..?
    Maybe I need a perl NG...
    =?ISO-8859-1?Q?Phil_H=FChn?=, Jun 30, 2004
    #5
  6. Phil Hühn wrote:
    > Yu SONG wrote:
    >
    >> is it always in the form of "digit, 'abc', digit"?

    >
    > No, that was just a short example, there may be a varying number of
    > arguments, eg
    >
    > "5,6,7,'oh, joy',15,'yes',3,7,3"
    > or
    > "'hello',97,45,'doe,john','bloggs,fred',3,8"


    Actually, I should say what else I tried. I had a string with 36 fields
    but using regex "," it only extracted 19 fields.

    With regex "\',\'|\d,\'|\',\d" it found 33.
    With regex "\',\'|\d,\'|\',\d|\d,\d" it went back to only 29! Weird.
    =?ISO-8859-1?Q?Phil_H=FChn?=, Jun 30, 2004
    #6
  7. =?ISO-8859-1?Q?Phil_H=FChn?=

    Chris Smith Guest

    Phil Hühn wrote:
    > Thanks, I probably should have stated my problem in terms of parsing a
    > CSV! ;-)
    > Ta for the CSVreader, but I want to use the regex stuff if possible...
    > if not I could simply have written a small method to tokenise the string
    > at commas, ignoring those within single quotes, but surely I can do that
    > with a regular expression..?
    > Maybe I need a perl NG...


    Regular expressions aren't magic, unfortunately, and CSV is a bit more
    complex than a regular expression. There are a lot of special cases to
    handle. Trust me; I wrote a CSV parser using regular expressions some
    time back. In the course of making modifications to handle variations
    on the format used by many popular software export features, it became
    obvious that regular expressions were more complex than they are worth.

    Besides, even if you implement a CSVReader in terms of regular
    expressions, it's a higher level of abstraction as a pseudo-specified
    file format, and deserves its own encapsulation.

    If you don't want to maintain dependence on a third-party (even free or
    open-source) utility to do this for you, I'm placing the following code
    into the public domain. Just don't sue me. :)

    Sorry for the wrapping; no easy way to fix it.

    public class CSVUtil
    {
    /**
    * Reads the next logical line of the CSV file. Returns the next
    line as a
    * {@link java.lang.String} without the trailing newline, or
    * <code>null</code> if there is no more data to read before the end
    of
    * the file.
    *
    * <p>
    * The last line of a file is returned if it contains any
    characters, but
    * is ignored if it is empty. The contract for this method is
    intended to
    * approximate the contract of {@link
    java.io.BufferedReader#readLine}.
    * </p>
    */
    public static String readLine(PushbackReader r)
    throws IOException
    {
    StringBuffer buf = new StringBuffer();
    int ch;
    boolean inQuote = false;

    while ((ch = r.read()) != -1)
    {
    if (ch == '\"')
    {
    inQuote = !inQuote;
    }
    else if (!inQuote && (ch == '\n'))
    {
    break;
    }
    else if (!inQuote && (ch == '\r'))
    {
    /*
    * See if this is a CRLF pair.
    */
    int ch2 = r.read();

    if (ch2 == '\n') break;
    else if (ch2 != -1) r.unread(ch2);
    }

    buf.append((char) ch);
    }

    if ((buf.length() == 0) && (ch == -1))
    {
    /*
    * Reached the end of the file, and there was nothing on the
    * line. This indicates an end of file, such that we should
    * return null, rather than the empty string.
    */
    return null;
    }
    else
    {
    /*
    * Return the line that was read.
    */
    return buf.toString();
    }
    }

    /**
    * Parses a logical line of CSV content into fields. An attempt is
    made to
    * reconstruct CSV data in a generally compatible way across import
    * sources.
    */
    public static String[] parse(String line)
    {
    List fields = new ArrayList();

    int pos = -1;

    while (pos < line.length())
    {
    pos++;

    /*
    * Determine the type of token. This could be plain,
    * single-quoted, double-quoted, or empty.
    */
    pos = skipWhitespace(pos, line);

    if ((pos >= line.length()) || (line.charAt(pos) == ','))
    {
    /*
    * Empty. When there is no non-whitespace between two
    * commas, the result is an empty field. Whitespace is
    * always ignored.
    */
    fields.add("");
    }
    else if (isQuoteCharacter(line.charAt(pos)))
    {
    char quoteChar = line.charAt(pos);

    /*
    * Quoted. The contents extend from here to any of
    * the following:
    *
    * 1. The end of the source string.
    * 2. The next occurrence of a double-quote that is NOT
    * followed by a second double-quote.
    */
    StringBuffer field = new StringBuffer();
    pos++; /* skip the quote */

    boolean done = false;

    while (!done)
    {
    int next = nextOccurrence(pos, line, quoteChar);
    field.append(line.substring(pos, next));
    pos = next;

    if (next >= line.length())
    {
    /*
    * Unterminated quote. Nevertheless, we'll take
    it.
    */
    done = true;
    }
    else if (next == line.length() - 1)
    {
    /*
    * Quote is at the end of the line. It is,
    therefore,
    * not doubled.
    */
    done = true;
    pos++; /* skip the closing quote */
    }
    else if (line.charAt(next + 1) != quoteChar)
    {
    /*
    * Quote is not doubled.
    */
    done = true;
    pos++; /* skip the closing quote */
    }
    else
    {
    /*
    * Quote is doubled. It should be considered
    part of
    * the content, and does not end the field.
    */
    field.append(quoteChar);
    pos += 2; /* skip both doubled quotes */
    }
    }

    fields.add(field.toString());
    }
    else
    {
    /*
    * Plain. Non-quoted fields may not contain quotes,
    * commas, newlines, or leading or trailing whitespace.
    */
    int next = nextOccurrence(pos, line, ',');
    String field = line.substring(pos, next);
    pos = next;

    fields.add(field.trim());
    }

    /*
    * Skip to the next comma. Any text found after a complete
    element
    * (only possible when elements are quoted) is invalid, and
    will be
    * discarded.
    */
    pos = nextOccurrence(pos, line, ',');
    }

    return (String[]) fields.toArray(new String[fields.size()]);
    }

    /**
    * Determines if a character should be considered a quote. As far
    as I can
    * tell, only double-quotes are supposed to be used for quoting in
    CSV.
    * However, I can recall (but can't find documentation for) some
    mention
    * of single quotes. This method is provided to abstract the
    identity of
    * a quote in case more are possible.
    */
    private static boolean isQuoteCharacter(char ch)
    {
    return ch == '\"';
    }

    /**
    * Returns the next character index in <code>line</code> that is not
    * whitespace, beginning at <code>pos</code>. If there is no such
    index,
    * the method returns <code>line.length()</code>.
    */
    private static int skipWhitespace(int pos, String line)
    {
    while (
    (pos < line.length())
    && Character.isWhitespace(line.charAt(pos)))
    {
    pos++;
    }

    return pos;
    }

    /**
    * Returns the next character index in <code>line</code> that is not
    * whitespace, beginning at <code>pos</code>. Returns the next
    character
    * index in <code>line</code> that contains the specified character,
    * <code>ch</code>, beginning at <code>pos</code>. If there is no
    such
    * index, the method returns <code>line.length()</code>.
    */
    private static int nextOccurrence(int pos, String line, char ch)
    {
    while ((pos < line.length()) && (line.charAt(pos) != ch))
    {
    pos++;
    }

    return pos;
    }

    /**
    * Forms a logical line of CSV content from separate fields into one
    String
    * in the CSV format, with appropriate quoting. An attempt is made
    to be
    * compatible with the most basic CSV possible.
    */
    public static String form(String[] elements)
    {
    StringBuffer line = new StringBuffer();

    for (int i = 0; i < elements.length; i++)
    {
    String element = elements;

    if (
    (element.indexOf('\"') != -1)
    || (element.indexOf('\'') != -1)
    || (element.indexOf(',') != -1)
    || (element.indexOf('\r') != -1)
    || (element.indexOf('\n') != -1)
    || !(element.trim().equals(element)))
    {
    /*
    * The element needs to be quoted. There are four cases
    * where an element must be quoted:
    *
    * 1. It contains a single or double quote.
    * 2. It contains a comma.
    * 3. It contains a newline.
    * 4. It has leading or trailing whitespace.
    */
    line.append('\"');

    for (int j = 0; j < element.length(); j++)
    {
    char ch = element.charAt(j);

    if (ch == '\"')
    {
    line.append('\"');
    line.append('\"');
    }
    else
    {
    line.append(ch);
    }
    }

    line.append('\"');
    }
    else
    {
    line.append(element);
    }

    if (i < elements.length - 1)
    {
    line.append(',');
    }
    }

    return line.toString();
    }
    }


    --
    www.designacourse.com
    The Easiest Way to Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Jun 30, 2004
    #7
  8. =?ISO-8859-1?Q?Phil_H=FChn?=

    Roedy Green Guest

    On Thu, 01 Jul 2004 00:52:27 +1200, Phil Hühn <> wrote or
    quoted :

    >Ta for the CSVreader, but I want to use the regex stuff if possible...


    The CSV reader will do the job faster, with less RAM. It is a state
    machine particularly tuned to this problem. The Regex engine is
    general purpose.

    I can certainly see wanting to crack the problem with regex as an
    intellectual exercise. You almost "tricked" me into going for it.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 30, 2004
    #8
  9. =?ISO-8859-1?Q?Phil_H=FChn?=

    Alan Moore Guest

    On Thu, 01 Jul 2004 00:52:27 +1200, Phil Hühn <> wrote:

    >Ta for the CSVreader, but I want to use the regex stuff if possible...
    >if not I could simply have written a small method to tokenise the string
    >at commas, ignoring those within single quotes, but surely I can do that
    >with a regular expression..?


    You can use regex to do what you want, just not by way of the split()
    method. Here's an example:

    String myStr = "5,\'this,that\',8";
    List result = new ArrayList();
    Pattern p = Pattern.compile(
    "(?<=^|,)(?:'(?:[^']*)'|[^,']*)(?=,|$)");
    Matcher m = p.matcher(myStr);
    while (m.find)
    {
    result.add(m.group());
    }

    But, as the others have said, this approach is much less efficient
    than a dedicated CSV parser. Regular expressions are just not the
    right tool for this job.

    >Maybe I need a perl NG...


    They'll just tell you the same thing.
    Alan Moore, Jun 30, 2004
    #9
  10. =?ISO-8859-1?Q?Phil_H=FChn?=

    Roedy Green Guest

    On Wed, 30 Jun 2004 22:46:50 +1200, Phil Hühn <> wrote or
    quoted :

    >(And I'll get rid of the single quote marks).
    >So pretty easy... what's the correct regex split argument to do this?


    You can build a battleship with popsicle sticks, but there are easier
    ways.

    See http://mindprod.com/products.html#CSV


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 30, 2004
    #10
  11. =?ISO-8859-1?Q?Phil_H=FChn?=

    Roedy Green Guest

    On Wed, 30 Jun 2004 20:32:57 GMT, Alan Moore <>
    wrote or quoted :

    >But, as the others have said, this approach is much less efficient
    >than a dedicated CSV parser. Regular expressions are just not the
    >right tool for this job.


    When you look at all the special cases in a CSV parser, it become
    clear it would take a heck of a long regex to specify all those rules.

    It is easier and faster to specify them as a simple state machine.
    Further is it is much easier to just specify a single method call to
    debugged code than create some Rube Goldberg contraption with regexes.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 30, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    466
  2. Carlos Ribeiro
    Replies:
    11
    Views:
    696
    Alex Martelli
    Sep 17, 2004
  3. trans.  (T. Onoma)

    split on '' (and another for split -1)

    trans. (T. Onoma), Dec 27, 2004, in forum: Ruby
    Replies:
    10
    Views:
    213
    Florian Gross
    Dec 28, 2004
  4. Sam Kong
    Replies:
    5
    Views:
    237
    Rick DeNatale
    Aug 12, 2006
  5. Stanley Xu
    Replies:
    2
    Views:
    604
    Stanley Xu
    Mar 23, 2011
Loading...

Share This Page