parsing a tab delimited or CSV, but keep the delimiter

S

Sideswipe

I know this question has been asked before, and believe me I checked
the newsgroup and web extensively before asking, but I think my needs
are slightly different.

I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.


I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.

Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)

private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");

So, you have the cases of:

1) continuous string or with space -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the double
comma is also a token
3) blank cells are just a single comma ,

That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters
 
S

Sideswipe

It also looks like double comma before a "" like this ,,"" -- the ,,
counts as 1 token
 
C

Chris Uppal

Sideswipe said:
I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.

I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.

I do not believe that regular expressions (even the weird "enhanced" regular
expressions that Java has apparently borrowed from Perl) are a suitable tool
for this job. Even if you could find some way of expressing the complex logic
of CSV parsing using the extended regexps, your resulting code would be totally
incomprehensible.

A useful test case for your parser is that /any/ CSV file can (treated as
a single string) be used as the contents of a field within another CSV.
Embedded newlines, arbitrarily deep nesting, etc, are all allowed. That power
hardly makes it any harder to parse CVS /unless/ you've started out by relying
on an inadequate tool like regexps.

Also, don't forget to test your code with files generated from Excel in places
where the numeric thousands separator is not ','.

-- chris


P.S. Although I don't much like Java's regexps (I think they are /grossly/
overused), it seems that I like them rather better than OE's spellchecker,
which suggests that replace "regexp" with "rage" and "regexps" with "creeps".
True AI is clearly only just around the corner ;-)
 
D

Daniel Pitts

I know this question has been asked before, and believe me I checked
the newsgroup and web extensively before asking, but I think my needs
are slightly different.

I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.

I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.
Cross posting is fine, but you should add a follow up header. Which I
have (to comp.lang.java.programmer)
Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)

private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");

So, you have the cases of:

1) continuous string or with space -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the double
comma is also a token
3) blank cells are just a single comma ,

That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters

I'm not sure that a regex is good enough to do everything...

Anyway, here are the cases that I can think of, ignoring the
delimiters.

Field value:
Field value: ,
Field value: "
Field value: a,b
Field value: "a and b"
Field value: 6"3

What are the encodings of this?

I'm guessing that

Field value:
Encoded value:
Field value: ,
Encoded value: ","
Field value: "
Encoded value: ""
Field value: a,b
Encoded value: "a,b"
Field value: "a and b"
Encoded value: ""a and b""
Field value: 6"3
Encoded value: 6""3


You can verify these cases in excel.
If those ARE the correct cases, then this would work:

import java.util.List;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class ParseCSV {
final static String quoted = "\":)?[^\"]|\"\")+\"";
public static List<String> parseCSV(String csv, String delim) {
final Pattern NEXT_COLUMN = nextColumnRegex(delim);
final List<String> strings = new ArrayList<String>();
final Matcher matcher = NEXT_COLUMN.matcher(csv);
while (!matcher.hitEnd() && matcher.find()) {
String match = matcher.group(1);
if (match.matches(quoted)) {
match = match.substring(1, match.length() - 1);
}
match = match.replaceAll("\"\"", "\"");
strings.add(match);
}
return strings;
}

private static Pattern nextColumnRegex(String comma) {
String unquoted = ":)?[^\"" + comma + "]|\"\")*";
String ending = ":)?" + comma +"|$)";
return Pattern.compile('(' + quoted + '|' + unquoted + ')' +
ending);
}


public static void main(String[] args) {
String csv = ",\",\",\"\",\"a,b\",\"\"a and b\"\",6\"\"3";
List<String> result = parseCSV(csv, ",");
for (String col : result) {
System.out.println("Field value:" + col);
}
}

}
 
J

Jürgen Exner

Sideswipe said:
I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.

Text::CSV should do the job quite nicely for regular CSV files.
Here are the 3 different REGEX expressions I have found /created but
none are correct. [...]

That doesn't suprise me. While REs are quite powerful indeed, there are
many, many tasks, where you need a more complex logic than REs can provide.
Even if you have custom requirements and you need a custom version of a CSV
parser I would still start with one of the Text::CSV modules and customize
that instead of rolling my own code from scratch and repeating all the
mistakes that have been fixed in the ready-made modules a long time ago.

jue
 
L

Lew

Jürgen Exner said:
No, why would I?

There were so many that came up when I googled that I just picked one at
random, on the basis that it was eight out of the first ten sites that popped
up in my search.
Of course I am referring to the module Text::CSV (or one of its cousins):
http://search.cpan.org/search?query=text::csv&mode=all

Of course! I should have /known/ that it was one of the other 68790 hits,
instead of the one that was eight out of the first ten.

-- Lew
 
J

Jürgen Exner

Lew said:
There were so many that came up when I googled that I just picked one
at random, on the basis that it was eight out of the first ten sites
that popped up in my search.


Of course! I should have /known/ that it was one of the other 68790
hits, instead of the one that was eight out of the first ten.

Well, sorry, but when you are talking Perl, then CPAN is the one and only
repository for modules. There may be others, there may even be commercial
ones. But CPAN is just plain the default and nobody would assume otherwise
unless you mention a different source explicitely.

jue
 
L

Lew

Jürgen Exner said:
Well, sorry, but when you are talking Perl, then CPAN is the one and only
repository for modules. There may be others, there may even be commercial
ones. But CPAN is just plain the default and nobody would assume otherwise
unless you mention a different source explicitely.

You're talking Perl. I'm talking Java. That's where I read the message, in
clj.programmer, and that's what I googled. There are a lot of "registries" for
code in Java; we're not limited to just one.

Sorry us Java folks aren't always hep to the Perl culture.

-- Lew
 
L

Lew

Jürgen Exner said:
Well, sorry, but when you are talking Perl, then CPAN is the one and only
repository for modules. There may be others, there may even be commercial
ones. But CPAN is just plain the default and nobody would assume otherwise
unless you mention a different source explicitely.

Well, sorry, but you're the one talking Perl. I'm talking Java. I read the
post in clj.programmer, and that's what informed my googling. In the Java
world we're not limited to only one repository for useful code. I guess us
Java folks aren't hep to Perl culture.

-- Lew
 
P

Patricia Shanahan

Jürgen Exner said:
No, why would I?
Of course I am referring to the module Text::CSV (or one of its cousins):
http://search.cpan.org/search?query=text::csv&mode=all

There is some confusion here because of the cross posting between Java
and Perl newsgroups.

The original message says "I am cross posting this in the Perl and Java
groups because, my implementation is in Java, but Perl users use regexp
far more frequently."

I don't think that was a really good idea because of the confusion,
because it appears to assume a regular expression solution, and because
in any case many Java programmers know regular expressions.

Text::CSV is a Perl module, which I don't think will help given the
question. On the other hand, I think the principle of looking for an
existing implementation is a good one. Parsing CSV is hardly a unique
requirement.

Google for 'java csv' got several promising looking hits - I think the
OP should do the search and compare the results to the requirements.

Patricia
 
J

Jürgen Exner

Lew said:
Well, sorry, but you're the one talking Perl. I'm talking Java.

The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...

jue
 
S

Stefan Ram

Sideswipe said:
I need to parse either a CSV or a Tab delimited file, BUT I need to

Whatever you want to parse, you need a specification of
the syntax - which usually is given in EBNF.
generated from excel

This is not such a specification.
none are correct

Correct accordint to which syntax specification?
1) continuous string or with space

What is a »continous string«? Is »continuous string or with
space« supposed to be English?

You do not seem to be able to describe what you need to parse
in English, nor in ENBF. So how are you ever going to implement it?
That's my understanding of the cases.

My (German) version of Excel does not use the comma as a
separator at all, but the semicolon. So, a CSV file generated
by it for two cells containing »1« and »2«, respectively, is:

1;2

I suggest the following reading:

http://secretgeek.net/csv_trouble.asp
 
T

Tad McClellan

["Followup-To:" header set to comp.lang.perl.misc.]

Well, sorry, but you're the one talking Perl.


That can happen in articles posted to the Perl newsgroup...
 
L

Lew

Jürgen Exner said:
The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...

The OP cross-posted to a Perl group and a Java group, so /of course/ we have
an obligation to speak to both audiences.

-- Lew
 
L

Lew

Jürgen Exner said:
The OP posted to comp.lang.PERL.misc. So of course he gets a Perl answer...

jue

The OP cross-posted to a Perl group and a Java group, so /of course/ we have
an obligation to speak to both audiences.

Furthermore, the OP stated that his implementation was in Java, so /of course/
a Perl answer was irrelevant.

/Of course/ based on that I expected answers to focus on Java solutions and
/of course/ went looking in that space for references.

-- Lew
 
S

Sherm Pendley

Lew said:
Furthermore, the OP stated that his implementation was in Java, so /of
course/ a Perl answer was irrelevant.

If he didn't want a Perl answer, he shouldn't have posted to a Perl group.

sherm--
 
A

Alex

There were so many that came up when I googled that I just picked one at
random, on the basis that it was eight out of the first ten sites that popped
up in my search.
Of course! I should have /known/ that it was one of the other 68790 hits,
instead of the one that was eight out of the first ten.

If you search for "Text::CSV" on Google, the very first hit you get is a
link to the CPAN-module. The hxtt.com-link you send doesn't even contain
the search string and does not appear in the first hundred hits. So yes,
you should have /known/ that you should at least check your spelling.
 
L

Lew

Alex said:
If you search for "Text::CSV" on Google, the very first hit you get is a
link to the CPAN-module. The hxtt.com-link you send doesn't even contain
the search string and does not appear in the first hundred hits. So yes,
you should have /known/ that you should at least check your spelling.

I searched on "Java Text::CSV" and got the hits I said I got - none of the
first ten hits were CPAN, eight of them were HXTT. Those were my results. I
said what I said based on what I experienced. You can make all the claims you
want on what I would get if I searched; I told you what I did get when I searched.

-- Lew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top