Best way to parse a csv...... a csv that has CRLF in the fields

S

sso

Any suggestions as to the best way to parse through a csv file that
has carriage returns in some of the fields? Its in an ods file that I
save to csv. I'm lost....
 
K

Knute Johnson

sso said:
Any suggestions as to the best way to parse through a csv file that
has carriage returns in some of the fields? Its in an ods file that I
save to csv. I'm lost....

Is the CRLF a delimiter? In any case, you can use the Scanner class to
do that sort of thing.
 
M

Mark Space

Knute said:
Is the CRLF a delimiter? In any case, you can use the Scanner class to
do that sort of thing.


I think he's say the CRLF is part of the data, and the program has to
distinguish between LF as part of a field, and LF when it ends a line.

Not really easy with Scanner. I can't think of a good way to do it off
hand...
 
R

Roedy Green

Any suggestions as to the best way to parse through a csv file that
has carriage returns in some of the fields? Its in an ods file that I
save to csv. I'm lost....
use my CSVReader class. It has an allowMultilineFields boolean in the
constructor.

See http://mindprod.com/products1.html#CSV

Other possibilities are listed at http://mindprod.com/jgloss/csv.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
~ Charles Darwin
 
S

sso

Is the CRLF a delimiter? In any case, you can use the Scanner class to
do that sort of thing.

This is definitely working better. Thanks!

Scanner doesn't seem to like my Chinese characters. | is the delim.
Example:

AI YE
$BghMU(B
Folium Artemisiae Argyi
Wormwood/ MOXA|
 
S

sso

Which begs the question, how does he differentiate between a CRLF  
terminating a line of input, and one that's in a field.

The most obvious answer is that the CRLF is quoted.  But whatever the  
indicator, I'd guess that a suitable regex could distinguish the  
individual fields without picking up the CRLF as a terminator for the line  
(you'd have to disable the end-of-line processing for the regex, of  
course).


I'm not familiar with Scanner, but it looks to me as though you can use a  
custom regex to tell it how to break apart the input line.  Assuming he  
can come up with an appropriate regex to do the job, it should be  
relatively easy to move from that to using Scanner for the actual input  
processing.

As far as the exact regex goes, well...that'd be for someone else to  
figure out.  I'm not good enough with regular expressions to come up with  
that easily, and don't have the time or interest to work it out myself.  :)

Pete

I could regex it, but there are about 400 records in the file.
Perhaps that would be cumbersome? As far as the LF being a delimiter,
well it is part of the data, but the records always have the same
number of fields. I will try this CSVreader class. :)
 
M

Mark Space

Peter said:
As far as the exact regex goes, well...that'd be for someone else to
figure out.


That's what I'm saying. Sure, as long as one can be determined. I
can't. I saw the regex delimiters on Scanner, I just can't come up with
an actual regex to make it work.

I'm at least somewhat interested, because CSV is common and handy.
There are third part libraries (like Roedy's) but it would be nice if I
didn't have to download any extra jar files. However, that may not be
possible.
 
M

Mayeul

sso said:
I could regex it, but there are about 400 records in the file.
Perhaps that would be cumbersome? As far as the LF being a delimiter,
well it is part of the data, but the records always have the same
number of fields. I will try this CSVreader class. :)

Obvious question:

Is the last field of each record terminated with a delimiter, or does it
guarantee it does _not_ contain a CRLF?
 
S

Stefan Ram

sso said:
Any suggestions as to the best way to parse through a csv file that
has carriage returns in some of the fields? Its in an ods file that I
save to csv. I'm lost....

To write a parser, I need a specification of the language
used.

The name »CSV« is not such a specification, because there are
several different languages in the world that are referred to
by »CSV«.

Given a specification, writing a parser often is
straightforward (for those having learned how to write
parsers).

(There are some languages, for example, C++, that are
difficult to parse, even with proper education and a proper
specification. But most languages named »CSV« should be easy
to parse.)
 
L

Lew

Mark said:
I'm at least somewhat interested, because CSV is common and handy. There
are third part libraries (like Roedy's) but it would be nice if I didn't
have to download any extra jar files. However, that may not be possible.

So it's nicer to reinvent the wheel than to use someone else's tried-and-true
solution?
 
A

Andreas Leitgeb

Lew said:
So it's nicer to reinvent the wheel than to use someone else's tried-and-true
solution?

I'd say it depends on the wheel...

If you need a pneu on ball bearings, then using/buying someone
else's solution appears reasonable. If a circle cut out from
cardboard and a centric hole punched out with a pencil suffices,
then I'd go for that, and do it myself.
 
J

John B. Matthews

[QUOTE="Lew said:
I'm at least somewhat interested, because CSV is common and handy.
There are third part libraries (like Roedy's) but it would be nice
if I didn't have to download any extra jar files. However, that
may not be possible.

So it's nicer to reinvent the wheel than to use someone else's
tried-and-true solution?[/QUOTE]

That difficult decision would rest on a host of factors, possibly
including an assessment of the license terms. IANAL.

I have had positive experience with the CSV utilities that are part of
the H2 Database: <http://www.h2database.com/>. I was intrigued to see
CSV support in the PostgreSQL COPY command: <http://www.postgresql.org/>
 
M

Mark Space

Lew said:
So it's nicer to reinvent the wheel than to use someone else's
tried-and-true solution?

I don't follow. Neither using Scanner properly or using a third party
jar seem to be reinventing the wheel to me. Can you clarify?
 
S

sso

This is definitely working better. Thanks!

Scanner doesn't seem to like my Chinese characters. | is the delim.
Example:

AI YE
$BghMU(B
Folium Artemisiae Argyi
Wormwood/ MOXA|

Dumb question: How do I get netbeans to recognize the import
statement for opencsv. I think this would be obvious, but I can't seem
to find an answer.
 
J

John B. Matthews

[...]
Dumb question: How do I get netbeans to recognize the import
statement for opencsv. I think this would be obvious, but I can't
seem to find an answer.

Tools > Libraries > New Library.
 
L

Lew

Mark said:
I don't follow. Neither using Scanner properly or using a third party
jar seem to be reinventing the wheel to me. Can you clarify?

The comment "it would be nice if I didn't have to download any extra jar
files" seemed like it decried the use of third-party libraries (like Roedy's)
in favor of writing one's own code. It seemed to me that it makes more sense
to use third-party libraries (like Roedy's), which of course means having to
download JAR files. Unless you meant that there's a better way to acquire
those libraries?
 
M

Mark Space

Lew said:
The comment "it would be nice if I didn't have to download any extra jar
files" seemed like it decried the use of third-party libraries (like
Roedy's) in favor of writing one's own code. It seemed to me that it
makes more sense to use third-party libraries (like Roedy's), which of
course means having to download JAR files. Unless you meant that
there's a better way to acquire those libraries?


No, I meant using a built-in class (like Scanner) was the preferred
option. Third partly libraries are choice number two. Writing your own
comes in third.
 
L

Lew

Mark said:
No, I meant using a built-in class (like Scanner) was the preferred
option. Third partly libraries are choice number two. Writing your own
comes in third.

I see. But Scanner doesn't have a complete CSV solution, unless I misread the
Javadocs. So use of Scanner becomes door number 3 - roll your own. I should
think it would be tricky with Scanner to get things just right, for example,
to deal with the OP's original problem:
 
R

Roedy Green

(There are some languages, for example, C++, that are
difficult to parse, even with proper education and a proper
specification. But most languages named »CSV« should be easy
to parse.)

There are many implementation out there. You don't have to roll your
own unless you find writing finite state automata and parsers fun.

My own has configurable magic letters for comment intro, separator,
quote char and a few other variations.

It turns out making code configurable and using enums fight at cross
purposes. You can't create a object containing a customised enum.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top