Phil said:
Thanks, I probably should have stated my problem in terms of parsing a
CSV! ;-)
Ta for the CSVreader, but I want to use the regex stuff if possible...
if not I could simply have written a small method to tokenise the string
at commas, ignoring those within single quotes, but surely I can do that
with a regular expression..?
Maybe I need a perl NG...
Regular expressions aren't magic, unfortunately, and CSV is a bit more
complex than a regular expression. There are a lot of special cases to
handle. Trust me; I wrote a CSV parser using regular expressions some
time back. In the course of making modifications to handle variations
on the format used by many popular software export features, it became
obvious that regular expressions were more complex than they are worth.
Besides, even if you implement a CSVReader in terms of regular
expressions, it's a higher level of abstraction as a pseudo-specified
file format, and deserves its own encapsulation.
If you don't want to maintain dependence on a third-party (even free or
open-source) utility to do this for you, I'm placing the following code
into the public domain. Just don't sue me.
Sorry for the wrapping; no easy way to fix it.
public class CSVUtil
{
/**
* Reads the next logical line of the CSV file. Returns the next
line as a
* {@link java.lang.String} without the trailing newline, or
* <code>null</code> if there is no more data to read before the end
of
* the file.
*
* <p>
* The last line of a file is returned if it contains any
characters, but
* is ignored if it is empty. The contract for this method is
intended to
* approximate the contract of {@link
java.io.BufferedReader#readLine}.
* </p>
*/
public static String readLine(PushbackReader r)
throws IOException
{
StringBuffer buf = new StringBuffer();
int ch;
boolean inQuote = false;
while ((ch = r.read()) != -1)
{
if (ch == '\"')
{
inQuote = !inQuote;
}
else if (!inQuote && (ch == '\n'))
{
break;
}
else if (!inQuote && (ch == '\r'))
{
/*
* See if this is a CRLF pair.
*/
int ch2 = r.read();
if (ch2 == '\n') break;
else if (ch2 != -1) r.unread(ch2);
}
buf.append((char) ch);
}
if ((buf.length() == 0) && (ch == -1))
{
/*
* Reached the end of the file, and there was nothing on the
* line. This indicates an end of file, such that we should
* return null, rather than the empty string.
*/
return null;
}
else
{
/*
* Return the line that was read.
*/
return buf.toString();
}
}
/**
* Parses a logical line of CSV content into fields. An attempt is
made to
* reconstruct CSV data in a generally compatible way across import
* sources.
*/
public static String[] parse(String line)
{
List fields = new ArrayList();
int pos = -1;
while (pos < line.length())
{
pos++;
/*
* Determine the type of token. This could be plain,
* single-quoted, double-quoted, or empty.
*/
pos = skipWhitespace(pos, line);
if ((pos >= line.length()) || (line.charAt(pos) == ','))
{
/*
* Empty. When there is no non-whitespace between two
* commas, the result is an empty field. Whitespace is
* always ignored.
*/
fields.add("");
}
else if (isQuoteCharacter(line.charAt(pos)))
{
char quoteChar = line.charAt(pos);
/*
* Quoted. The contents extend from here to any of
* the following:
*
* 1. The end of the source string.
* 2. The next occurrence of a double-quote that is NOT
* followed by a second double-quote.
*/
StringBuffer field = new StringBuffer();
pos++; /* skip the quote */
boolean done = false;
while (!done)
{
int next = nextOccurrence(pos, line, quoteChar);
field.append(line.substring(pos, next));
pos = next;
if (next >= line.length())
{
/*
* Unterminated quote. Nevertheless, we'll take
it.
*/
done = true;
}
else if (next == line.length() - 1)
{
/*
* Quote is at the end of the line. It is,
therefore,
* not doubled.
*/
done = true;
pos++; /* skip the closing quote */
}
else if (line.charAt(next + 1) != quoteChar)
{
/*
* Quote is not doubled.
*/
done = true;
pos++; /* skip the closing quote */
}
else
{
/*
* Quote is doubled. It should be considered
part of
* the content, and does not end the field.
*/
field.append(quoteChar);
pos += 2; /* skip both doubled quotes */
}
}
fields.add(field.toString());
}
else
{
/*
* Plain. Non-quoted fields may not contain quotes,
* commas, newlines, or leading or trailing whitespace.
*/
int next = nextOccurrence(pos, line, ',');
String field = line.substring(pos, next);
pos = next;
fields.add(field.trim());
}
/*
* Skip to the next comma. Any text found after a complete
element
* (only possible when elements are quoted) is invalid, and
will be
* discarded.
*/
pos = nextOccurrence(pos, line, ',');
}
return (String[]) fields.toArray(new String[fields.size()]);
}
/**
* Determines if a character should be considered a quote. As far
as I can
* tell, only double-quotes are supposed to be used for quoting in
CSV.
* However, I can recall (but can't find documentation for) some
mention
* of single quotes. This method is provided to abstract the
identity of
* a quote in case more are possible.
*/
private static boolean isQuoteCharacter(char ch)
{
return ch == '\"';
}
/**
* Returns the next character index in <code>line</code> that is not
* whitespace, beginning at <code>pos</code>. If there is no such
index,
* the method returns <code>line.length()</code>.
*/
private static int skipWhitespace(int pos, String line)
{
while (
(pos < line.length())
&& Character.isWhitespace(line.charAt(pos)))
{
pos++;
}
return pos;
}
/**
* Returns the next character index in <code>line</code> that is not
* whitespace, beginning at <code>pos</code>. Returns the next
character
* index in <code>line</code> that contains the specified character,
* <code>ch</code>, beginning at <code>pos</code>. If there is no
such
* index, the method returns <code>line.length()</code>.
*/
private static int nextOccurrence(int pos, String line, char ch)
{
while ((pos < line.length()) && (line.charAt(pos) != ch))
{
pos++;
}
return pos;
}
/**
* Forms a logical line of CSV content from separate fields into one
String
* in the CSV format, with appropriate quoting. An attempt is made
to be
* compatible with the most basic CSV possible.
*/
public static String form(String[] elements)
{
StringBuffer line = new StringBuffer();
for (int i = 0; i < elements.length; i++)
{
String element = elements
;
if (
(element.indexOf('\"') != -1)
|| (element.indexOf('\'') != -1)
|| (element.indexOf(',') != -1)
|| (element.indexOf('\r') != -1)
|| (element.indexOf('\n') != -1)
|| !(element.trim().equals(element)))
{
/*
* The element needs to be quoted. There are four cases
* where an element must be quoted:
*
* 1. It contains a single or double quote.
* 2. It contains a comma.
* 3. It contains a newline.
* 4. It has leading or trailing whitespace.
*/
line.append('\"');
for (int j = 0; j < element.length(); j++)
{
char ch = element.charAt(j);
if (ch == '\"')
{
line.append('\"');
line.append('\"');
}
else
{
line.append(ch);
}
}
line.append('\"');
}
else
{
line.append(element);
}
if (i < elements.length - 1)
{
line.append(',');
}
}
return line.toString();
}
}
--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation