How do I get String.split() to do what I want?

  • Thread starter =?ISO-8859-1?Q?Phil_H=FChn?=
  • Start date
?

=?ISO-8859-1?Q?Phil_H=FChn?=

Hi, I'm using jdk 1.4.2 - last project I did in java was with jdk 1.3,
so String.split(...) is new to me. Anyway, I've been trying to get the
regex expression correct but not having much luck. This is probably very
simple to everyone out there ;-)

Anyway, I have something like:

String myStr = "5,\'this,that\',8";
String parts[] = myStr.split(",");

Of course I get a split at ALL the commas (i.e it sees the comma within
the quotes) so I get 4 parts:

"5", "'this", "that'", "8"

But I want any quoted string left intact, ideally:

"5", "'this,that'", "8"

(And I'll get rid of the single quote marks).
So pretty easy... what's the correct regex split argument to do this?

TIA
 
Y

Yu SONG

Phil said:
Hi, I'm using jdk 1.4.2 - last project I did in java was with jdk 1.3,
so String.split(...) is new to me. Anyway, I've been trying to get the
regex expression correct but not having much luck. This is probably very
simple to everyone out there ;-)

Anyway, I have something like:

String myStr = "5,\'this,that\',8";
String parts[] = myStr.split(",");

Of course I get a split at ALL the commas (i.e it sees the comma within
the quotes) so I get 4 parts:

"5", "'this", "that'", "8"

But I want any quoted string left intact, ideally:

"5", "'this,that'", "8"

(And I'll get rid of the single quote marks).
So pretty easy... what's the correct regex split argument to do this?

TIA

We have to look at the pattern of these strings,

is it always in the form of "digit, 'abc', digit"?

Assume there is no space in 'abc', you can split the "myStr" string into

"5", "this,that", "8",

using the following code:

String myStr = "5,\'this,that\',8";

//spilt the string using "'", as you want to get rid of it
String parts[] = myStr.split("'");

//for output
StringBuffer output = new StringBuffer();

for (int i = 0; i < parts.length; i++) {
//replace ',' for space to trim
parts = parts.replace(',', ' ').trim();

//replace back
parts = parts.replace(' ', ',');

//for output
output.append('\"');
output.append(parts);
output.append("\", ");
}
//output
System.out.println(output.toString());


--
Song

/* E-mail.c */
#define User "Yu.Song"
#define At '@'
#define Warwick "warwick.ac.uk"
int main() {
printf("Yu Song's E-mail: %s%c%s", User, At, Warwick);
return 0;}

Further Info. : http://www.dcs.warwick.ac.uk/~esubbn/
_______________________________________________________
 
?

=?ISO-8859-1?Q?Phil_H=FChn?=

Yu said:
is it always in the form of "digit, 'abc', digit"?
No, that was just a short example, there may be a varying number of
arguments, eg

"5,6,7,'oh, joy',15,'yes',3,7,3"
or
"'hello',97,45,'doe,john','bloggs,fred',3,8"
 
?

=?ISO-8859-1?Q?Phil_H=FChn?=

Roedy said:
Try a CSVReader instead. It is a less generic tool specially to
handle that problem.

see http://mindprod.com/jgloss/csv.html
Thanks, I probably should have stated my problem in terms of parsing a
CSV! ;-)
Ta for the CSVreader, but I want to use the regex stuff if possible...
if not I could simply have written a small method to tokenise the string
at commas, ignoring those within single quotes, but surely I can do that
with a regular expression..?
Maybe I need a perl NG...
 
?

=?ISO-8859-1?Q?Phil_H=FChn?=

Phil said:
No, that was just a short example, there may be a varying number of
arguments, eg

"5,6,7,'oh, joy',15,'yes',3,7,3"
or
"'hello',97,45,'doe,john','bloggs,fred',3,8"

Actually, I should say what else I tried. I had a string with 36 fields
but using regex "," it only extracted 19 fields.

With regex "\',\'|\d,\'|\',\d" it found 33.
With regex "\',\'|\d,\'|\',\d|\d,\d" it went back to only 29! Weird.
 
C

Chris Smith

Phil said:
Thanks, I probably should have stated my problem in terms of parsing a
CSV! ;-)
Ta for the CSVreader, but I want to use the regex stuff if possible...
if not I could simply have written a small method to tokenise the string
at commas, ignoring those within single quotes, but surely I can do that
with a regular expression..?
Maybe I need a perl NG...

Regular expressions aren't magic, unfortunately, and CSV is a bit more
complex than a regular expression. There are a lot of special cases to
handle. Trust me; I wrote a CSV parser using regular expressions some
time back. In the course of making modifications to handle variations
on the format used by many popular software export features, it became
obvious that regular expressions were more complex than they are worth.

Besides, even if you implement a CSVReader in terms of regular
expressions, it's a higher level of abstraction as a pseudo-specified
file format, and deserves its own encapsulation.

If you don't want to maintain dependence on a third-party (even free or
open-source) utility to do this for you, I'm placing the following code
into the public domain. Just don't sue me. :)

Sorry for the wrapping; no easy way to fix it.

public class CSVUtil
{
/**
* Reads the next logical line of the CSV file. Returns the next
line as a
* {@link java.lang.String} without the trailing newline, or
* <code>null</code> if there is no more data to read before the end
of
* the file.
*
* <p>
* The last line of a file is returned if it contains any
characters, but
* is ignored if it is empty. The contract for this method is
intended to
* approximate the contract of {@link
java.io.BufferedReader#readLine}.
* </p>
*/
public static String readLine(PushbackReader r)
throws IOException
{
StringBuffer buf = new StringBuffer();
int ch;
boolean inQuote = false;

while ((ch = r.read()) != -1)
{
if (ch == '\"')
{
inQuote = !inQuote;
}
else if (!inQuote && (ch == '\n'))
{
break;
}
else if (!inQuote && (ch == '\r'))
{
/*
* See if this is a CRLF pair.
*/
int ch2 = r.read();

if (ch2 == '\n') break;
else if (ch2 != -1) r.unread(ch2);
}

buf.append((char) ch);
}

if ((buf.length() == 0) && (ch == -1))
{
/*
* Reached the end of the file, and there was nothing on the
* line. This indicates an end of file, such that we should
* return null, rather than the empty string.
*/
return null;
}
else
{
/*
* Return the line that was read.
*/
return buf.toString();
}
}

/**
* Parses a logical line of CSV content into fields. An attempt is
made to
* reconstruct CSV data in a generally compatible way across import
* sources.
*/
public static String[] parse(String line)
{
List fields = new ArrayList();

int pos = -1;

while (pos < line.length())
{
pos++;

/*
* Determine the type of token. This could be plain,
* single-quoted, double-quoted, or empty.
*/
pos = skipWhitespace(pos, line);

if ((pos >= line.length()) || (line.charAt(pos) == ','))
{
/*
* Empty. When there is no non-whitespace between two
* commas, the result is an empty field. Whitespace is
* always ignored.
*/
fields.add("");
}
else if (isQuoteCharacter(line.charAt(pos)))
{
char quoteChar = line.charAt(pos);

/*
* Quoted. The contents extend from here to any of
* the following:
*
* 1. The end of the source string.
* 2. The next occurrence of a double-quote that is NOT
* followed by a second double-quote.
*/
StringBuffer field = new StringBuffer();
pos++; /* skip the quote */

boolean done = false;

while (!done)
{
int next = nextOccurrence(pos, line, quoteChar);
field.append(line.substring(pos, next));
pos = next;

if (next >= line.length())
{
/*
* Unterminated quote. Nevertheless, we'll take
it.
*/
done = true;
}
else if (next == line.length() - 1)
{
/*
* Quote is at the end of the line. It is,
therefore,
* not doubled.
*/
done = true;
pos++; /* skip the closing quote */
}
else if (line.charAt(next + 1) != quoteChar)
{
/*
* Quote is not doubled.
*/
done = true;
pos++; /* skip the closing quote */
}
else
{
/*
* Quote is doubled. It should be considered
part of
* the content, and does not end the field.
*/
field.append(quoteChar);
pos += 2; /* skip both doubled quotes */
}
}

fields.add(field.toString());
}
else
{
/*
* Plain. Non-quoted fields may not contain quotes,
* commas, newlines, or leading or trailing whitespace.
*/
int next = nextOccurrence(pos, line, ',');
String field = line.substring(pos, next);
pos = next;

fields.add(field.trim());
}

/*
* Skip to the next comma. Any text found after a complete
element
* (only possible when elements are quoted) is invalid, and
will be
* discarded.
*/
pos = nextOccurrence(pos, line, ',');
}

return (String[]) fields.toArray(new String[fields.size()]);
}

/**
* Determines if a character should be considered a quote. As far
as I can
* tell, only double-quotes are supposed to be used for quoting in
CSV.
* However, I can recall (but can't find documentation for) some
mention
* of single quotes. This method is provided to abstract the
identity of
* a quote in case more are possible.
*/
private static boolean isQuoteCharacter(char ch)
{
return ch == '\"';
}

/**
* Returns the next character index in <code>line</code> that is not
* whitespace, beginning at <code>pos</code>. If there is no such
index,
* the method returns <code>line.length()</code>.
*/
private static int skipWhitespace(int pos, String line)
{
while (
(pos < line.length())
&& Character.isWhitespace(line.charAt(pos)))
{
pos++;
}

return pos;
}

/**
* Returns the next character index in <code>line</code> that is not
* whitespace, beginning at <code>pos</code>. Returns the next
character
* index in <code>line</code> that contains the specified character,
* <code>ch</code>, beginning at <code>pos</code>. If there is no
such
* index, the method returns <code>line.length()</code>.
*/
private static int nextOccurrence(int pos, String line, char ch)
{
while ((pos < line.length()) && (line.charAt(pos) != ch))
{
pos++;
}

return pos;
}

/**
* Forms a logical line of CSV content from separate fields into one
String
* in the CSV format, with appropriate quoting. An attempt is made
to be
* compatible with the most basic CSV possible.
*/
public static String form(String[] elements)
{
StringBuffer line = new StringBuffer();

for (int i = 0; i < elements.length; i++)
{
String element = elements;

if (
(element.indexOf('\"') != -1)
|| (element.indexOf('\'') != -1)
|| (element.indexOf(',') != -1)
|| (element.indexOf('\r') != -1)
|| (element.indexOf('\n') != -1)
|| !(element.trim().equals(element)))
{
/*
* The element needs to be quoted. There are four cases
* where an element must be quoted:
*
* 1. It contains a single or double quote.
* 2. It contains a comma.
* 3. It contains a newline.
* 4. It has leading or trailing whitespace.
*/
line.append('\"');

for (int j = 0; j < element.length(); j++)
{
char ch = element.charAt(j);

if (ch == '\"')
{
line.append('\"');
line.append('\"');
}
else
{
line.append(ch);
}
}

line.append('\"');
}
else
{
line.append(element);
}

if (i < elements.length - 1)
{
line.append(',');
}
}

return line.toString();
}
}


--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
R

Roedy Green

Ta for the CSVreader, but I want to use the regex stuff if possible...

The CSV reader will do the job faster, with less RAM. It is a state
machine particularly tuned to this problem. The Regex engine is
general purpose.

I can certainly see wanting to crack the problem with regex as an
intellectual exercise. You almost "tricked" me into going for it.
 
A

Alan Moore

Ta for the CSVreader, but I want to use the regex stuff if possible...
if not I could simply have written a small method to tokenise the string
at commas, ignoring those within single quotes, but surely I can do that
with a regular expression..?

You can use regex to do what you want, just not by way of the split()
method. Here's an example:

String myStr = "5,\'this,that\',8";
List result = new ArrayList();
Pattern p = Pattern.compile(
"(?<=^|,)(?:'(?:[^']*)'|[^,']*)(?=,|$)");
Matcher m = p.matcher(myStr);
while (m.find)
{
result.add(m.group());
}

But, as the others have said, this approach is much less efficient
than a dedicated CSV parser. Regular expressions are just not the
right tool for this job.
Maybe I need a perl NG...

They'll just tell you the same thing.
 
R

Roedy Green

But, as the others have said, this approach is much less efficient
than a dedicated CSV parser. Regular expressions are just not the
right tool for this job.

When you look at all the special cases in a CSV parser, it become
clear it would take a heck of a long regex to specify all those rules.

It is easier and faster to specify them as a simple state machine.
Further is it is much easier to just specify a single method call to
debugged code than create some Rube Goldberg contraption with regexes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top