Best way to parse a csv...... a csv that has CRLF in the fields

Discussion in 'Java' started by sso, Apr 24, 2009.

  1. sso

    sso Guest

    Any suggestions as to the best way to parse through a csv file that
    has carriage returns in some of the fields? Its in an ods file that I
    save to csv. I'm lost....
    sso, Apr 24, 2009
    #1
    1. Advertising

  2. sso wrote:
    > Any suggestions as to the best way to parse through a csv file that
    > has carriage returns in some of the fields? Its in an ods file that I
    > save to csv. I'm lost....


    Is the CRLF a delimiter? In any case, you can use the Scanner class to
    do that sort of thing.

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
    Knute Johnson, Apr 24, 2009
    #2
    1. Advertising

  3. sso

    Mark Space Guest

    Knute Johnson wrote:
    > sso wrote:
    >> Any suggestions as to the best way to parse through a csv file that
    >> has carriage returns in some of the fields? Its in an ods file that I
    >> save to csv. I'm lost....

    >
    > Is the CRLF a delimiter? In any case, you can use the Scanner class to
    > do that sort of thing.
    >



    I think he's say the CRLF is part of the data, and the program has to
    distinguish between LF as part of a field, and LF when it ends a line.

    Not really easy with Scanner. I can't think of a good way to do it off
    hand...
    Mark Space, Apr 24, 2009
    #3
  4. sso

    Roedy Green Guest

    On Thu, 23 Apr 2009 20:45:09 -0700 (PDT), sso
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Any suggestions as to the best way to parse through a csv file that
    >has carriage returns in some of the fields? Its in an ods file that I
    >save to csv. I'm lost....

    use my CSVReader class. It has an allowMultilineFields boolean in the
    constructor.

    See http://mindprod.com/products1.html#CSV

    Other possibilities are listed at http://mindprod.com/jgloss/csv.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
    ~ Charles Darwin
    Roedy Green, Apr 24, 2009
    #4
  5. sso

    sso Guest

    On Apr 24, 12:03 am, Knute Johnson <>
    wrote:
    > sso wrote:
    > > Any suggestions as to the best way to parse through a csv file that
    > > has carriage returns in some of the fields? Its in an ods file that I
    > > save to csv. I'm lost....

    >
    > Is the CRLF a delimiter? In any case, you can use the Scanner class to
    > do that sort of thing.
    >
    > --
    >
    > Knute Johnson
    > email s/nospam/knute2009/
    >
    > --
    > Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    > ------->>>>>>http://www.NewsDemon.com<<<<<<------
    > Unlimited Access, Anonymous Accounts, Uncensored Broadband Access


    This is definitely working better. Thanks!

    Scanner doesn't seem to like my Chinese characters. | is the delim.
    Example:

    AI YE
    艾葉
    Folium Artemisiae Argyi
    Wormwood/ MOXA|
    sso, Apr 24, 2009
    #5
  6. sso

    sso Guest

    On Apr 24, 12:52am, "Peter Duniho" <>
    wrote:
    > On Thu, 23 Apr 2009 21:30:00 -0700, Mark Space <>
    > wrote:
    >
    > > Knute Johnson wrote:
    > >> sso wrote:
    > >>> Any suggestions as to the best way to parse through a csv file that
    > >>> has carriage returns in some of the fields? Its in an ods file that I
    > >>> save to csv. I'm lost....

    >
    > >> Is the CRLF a delimiter? In any case, you can use the Scanner class
    > >> to do that sort of thing.

    >
    > > I think he's say the CRLF is part of the data, and the program has to
    > > distinguish between LF as part of a field, and LF when it ends a line.

    >
    > Which begs the question, how does he differentiate between a CRLF
    > terminating a line of input, and one that's in a field.
    >
    > The most obvious answer is that the CRLF is quoted. But whatever the
    > indicator, I'd guess that a suitable regex could distinguish the
    > individual fields without picking up the CRLF as a terminator for the line
    > (you'd have to disable the end-of-line processing for the regex, of
    > course).
    >
    > > Not really easy with Scanner. I can't think of a good way to do it off
    > > hand...

    >
    > I'm not familiar with Scanner, but it looks to me as though you can use a
    > custom regex to tell it how to break apart the input line. Assuming he
    > can come up with an appropriate regex to do the job, it should be
    > relatively easy to move from that to using Scanner for the actual input
    > processing.
    >
    > As far as the exact regex goes, well...that'd be for someone else to
    > figure out. I'm not good enough with regular expressions to come up with
    > that easily, and don't have the time or interest to work it out myself. :)
    >
    > Pete


    I could regex it, but there are about 400 records in the file.
    Perhaps that would be cumbersome? As far as the LF being a delimiter,
    well it is part of the data, but the records always have the same
    number of fields. I will try this CSVreader class. :)
    sso, Apr 24, 2009
    #6
  7. sso

    Mark Space Guest

    Peter Duniho wrote:

    >
    > As far as the exact regex goes, well...that'd be for someone else to
    > figure out.



    That's what I'm saying. Sure, as long as one can be determined. I
    can't. I saw the regex delimiters on Scanner, I just can't come up with
    an actual regex to make it work.

    I'm at least somewhat interested, because CSV is common and handy.
    There are third part libraries (like Roedy's) but it would be nice if I
    didn't have to download any extra jar files. However, that may not be
    possible.
    Mark Space, Apr 24, 2009
    #7
  8. sso

    Mayeul Guest

    sso wrote:
    > On Apr 24, 12:52 am, "Peter Duniho" <>
    > wrote:
    >> On Thu, 23 Apr 2009 21:30:00 -0700, Mark Space <>
    >> wrote:
    >>
    >>> Knute Johnson wrote:
    >>>> sso wrote:
    >>>>> Any suggestions as to the best way to parse through a csv file that
    >>>>> has carriage returns in some of the fields? Its in an ods file that I
    >>>>> save to csv. I'm lost....
    >>>> Is the CRLF a delimiter? In any case, you can use the Scanner class
    >>>> to do that sort of thing.
    >>> I think he's say the CRLF is part of the data, and the program has to
    >>> distinguish between LF as part of a field, and LF when it ends a line.

    >> Which begs the question, how does he differentiate between a CRLF
    >> terminating a line of input, and one that's in a field.
    >>
    >> The most obvious answer is that the CRLF is quoted. But whatever the
    >> indicator, I'd guess that a suitable regex could distinguish the
    >> individual fields without picking up the CRLF as a terminator for the line
    >> (you'd have to disable the end-of-line processing for the regex, of
    >> course).
    >>
    >>> Not really easy with Scanner. I can't think of a good way to do it off
    >>> hand...

    >> I'm not familiar with Scanner, but it looks to me as though you can use a
    >> custom regex to tell it how to break apart the input line. Assuming he
    >> can come up with an appropriate regex to do the job, it should be
    >> relatively easy to move from that to using Scanner for the actual input
    >> processing.
    >>
    >> As far as the exact regex goes, well...that'd be for someone else to
    >> figure out. I'm not good enough with regular expressions to come up with
    >> that easily, and don't have the time or interest to work it out myself. :)
    >>
    >> Pete

    >
    > I could regex it, but there are about 400 records in the file.
    > Perhaps that would be cumbersome? As far as the LF being a delimiter,
    > well it is part of the data, but the records always have the same
    > number of fields. I will try this CSVreader class. :)


    Obvious question:

    Is the last field of each record terminated with a delimiter, or does it
    guarantee it does _not_ contain a CRLF?

    --
    Mayeul
    Mayeul, Apr 24, 2009
    #8
  9. sso

    Stefan Ram Guest

    sso <> writes:
    >Any suggestions as to the best way to parse through a csv file that
    >has carriage returns in some of the fields? Its in an ods file that I
    >save to csv. I'm lost....


    To write a parser, I need a specification of the language
    used.

    The name CSV is not such a specification, because there are
    several different languages in the world that are referred to
    by CSV.

    Given a specification, writing a parser often is
    straightforward (for those having learned how to write
    parsers).

    (There are some languages, for example, C++, that are
    difficult to parse, even with proper education and a proper
    specification. But most languages named CSV should be easy
    to parse.)
    Stefan Ram, Apr 24, 2009
    #9
  10. sso

    Lew Guest

    Mark Space wrote:
    > I'm at least somewhat interested, because CSV is common and handy. There
    > are third part libraries (like Roedy's) but it would be nice if I didn't
    > have to download any extra jar files. However, that may not be possible.


    So it's nicer to reinvent the wheel than to use someone else's tried-and-true
    solution?

    --
    Lew
    Lew, Apr 24, 2009
    #10
  11. Lew <> wrote:
    > So it's nicer to reinvent the wheel than to use someone else's tried-and-true
    > solution?


    I'd say it depends on the wheel...

    If you need a pneu on ball bearings, then using/buying someone
    else's solution appears reasonable. If a circle cut out from
    cardboard and a centric hole punched out with a pencil suffices,
    then I'd go for that, and do it myself.
    Andreas Leitgeb, Apr 24, 2009
    #11
  12. In article <gssc7m$ipa$>, Lew <>
    wrote:

    > Mark Space wrote:
    > > I'm at least somewhat interested, because CSV is common and handy.
    > > There are third part libraries (like Roedy's) but it would be nice
    > > if I didn't have to download any extra jar files. However, that
    > > may not be possible.

    >
    > So it's nicer to reinvent the wheel than to use someone else's
    > tried-and-true solution?


    That difficult decision would rest on a host of factors, possibly
    including an assessment of the license terms. IANAL.

    I have had positive experience with the CSV utilities that are part of
    the H2 Database: <http://www.h2database.com/>. I was intrigued to see
    CSV support in the PostgreSQL COPY command: <http://www.postgresql.org/>

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
    John B. Matthews, Apr 24, 2009
    #12
  13. On Fri, 24 Apr 2009 09:43:09 -0400, John B. Matthews wrote:

    >
    > I have had positive experience with the CSV utilities that are part of
    > the H2 Database: <http://www.h2database.com/>. I was intrigued to see
    > CSV support in the PostgreSQL COPY command: <http://www.postgresql.org/>
    >

    I've yet to use an RDBMS that didn't offer some variation on CSV import/
    export capabilities.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
    Martin Gregorie, Apr 24, 2009
    #13
  14. sso

    Mark Space Guest

    Lew wrote:
    > Mark Space wrote:
    >> I'm at least somewhat interested, because CSV is common and handy.
    >> There are third part libraries (like Roedy's) but it would be nice if
    >> I didn't have to download any extra jar files. However, that may not
    >> be possible.

    >
    > So it's nicer to reinvent the wheel than to use someone else's
    > tried-and-true solution?
    >


    I don't follow. Neither using Scanner properly or using a third party
    jar seem to be reinventing the wheel to me. Can you clarify?
    Mark Space, Apr 24, 2009
    #14
  15. sso

    sso Guest

    On Apr 24, 12:51 am, sso <> wrote:
    > On Apr 24, 12:03 am, Knute Johnson <>
    > wrote:
    >
    >
    >
    > > sso wrote:
    > > > Any suggestions as to the best way to parse through a csv file that
    > > > has carriage returns in some of the fields? Its in an ods file that I
    > > > save to csv. I'm lost....

    >
    > > Is the CRLF a delimiter? In any case, you can use the Scanner class to
    > > do that sort of thing.

    >
    > > --

    >
    > > Knute Johnson
    > > email s/nospam/knute2009/

    >
    > > --
    > > Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    > > ------->>>>>>http://www.NewsDemon.com<<<<<<------
    > > Unlimited Access, Anonymous Accounts, Uncensored Broadband Access

    >
    > This is definitely working better. Thanks!
    >
    > Scanner doesn't seem to like my Chinese characters. | is the delim.
    > Example:
    >
    > AI YE
    > 艾葉
    > Folium Artemisiae Argyi
    > Wormwood/ MOXA|


    Dumb question: How do I get netbeans to recognize the import
    statement for opencsv. I think this would be obvious, but I can't seem
    to find an answer.
    sso, Apr 25, 2009
    #15
  16. In article
    <>,
    sso <> wrote:

    [...]

    > Dumb question: How do I get netbeans to recognize the import
    > statement for opencsv. I think this would be obvious, but I can't
    > seem to find an answer.


    Tools > Libraries > New Library.

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
    John B. Matthews, Apr 25, 2009
    #16
  17. sso

    Lew Guest

    Mark Space wrote:
    > Lew wrote:
    >> Mark Space wrote:
    >>> I'm at least somewhat interested, because CSV is common and handy.
    >>> There are third part libraries (like Roedy's) but it would be nice if
    >>> I didn't have to download any extra jar files. However, that may not
    >>> be possible.

    >>
    >> So it's nicer to reinvent the wheel than to use someone else's
    >> tried-and-true solution?
    >>

    >
    > I don't follow. Neither using Scanner properly or using a third party
    > jar seem to be reinventing the wheel to me. Can you clarify?


    The comment "it would be nice if I didn't have to download any extra jar
    files" seemed like it decried the use of third-party libraries (like Roedy's)
    in favor of writing one's own code. It seemed to me that it makes more sense
    to use third-party libraries (like Roedy's), which of course means having to
    download JAR files. Unless you meant that there's a better way to acquire
    those libraries?

    --
    Lew
    Lew, Apr 25, 2009
    #17
  18. sso

    Mark Space Guest

    Lew wrote:

    > The comment "it would be nice if I didn't have to download any extra jar
    > files" seemed like it decried the use of third-party libraries (like
    > Roedy's) in favor of writing one's own code. It seemed to me that it
    > makes more sense to use third-party libraries (like Roedy's), which of
    > course means having to download JAR files. Unless you meant that
    > there's a better way to acquire those libraries?



    No, I meant using a built-in class (like Scanner) was the preferred
    option. Third partly libraries are choice number two. Writing your own
    comes in third.
    Mark Space, Apr 25, 2009
    #18
  19. sso

    Lew Guest

    Mark Space wrote:
    > Lew wrote:
    >
    >> The comment "it would be nice if I didn't have to download any extra
    >> jar files" seemed like it decried the use of third-party libraries
    >> (like Roedy's) in favor of writing one's own code. It seemed to me
    >> that it makes more sense to use third-party libraries (like Roedy's),
    >> which of course means having to download JAR files. Unless you meant
    >> that there's a better way to acquire those libraries?

    >
    >
    > No, I meant using a built-in class (like Scanner) was the preferred
    > option. Third partly libraries are choice number two. Writing your own
    > comes in third.


    I see. But Scanner doesn't have a complete CSV solution, unless I misread the
    Javadocs. So use of Scanner becomes door number 3 - roll your own. I should
    think it would be tricky with Scanner to get things just right, for example,
    to deal with the OP's original problem:
    > parse through a csv file that has carriage returns in some of the fields


    --
    Lew
    Lew, Apr 25, 2009
    #19
  20. sso

    Roedy Green Guest

    On 24 Apr 2009 09:50:06 GMT, -berlin.de (Stefan Ram)
    wrote, quoted or indirectly quoted someone who said :

    > (There are some languages, for example, C++, that are
    > difficult to parse, even with proper education and a proper
    > specification. But most languages named CSV should be easy
    > to parse.)


    There are many implementation out there. You don't have to roll your
    own unless you find writing finite state automata and parsers fun.

    My own has configurable magic letters for comment intro, separator,
    quote char and a few other variations.

    It turns out making code configurable and using enums fight at cross
    purposes. You can't create a object containing a customised enum.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Creationists tell me all the dinosaurs died out because there was not room enough for them in the ark. However, most dinosaurs were quite small.
    Roedy Green, Apr 26, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack Wright

    Problem in CRLF in multiline fields

    Jack Wright, Apr 20, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    394
    Jack Wright
    Apr 21, 2004
  2. Johnny Google
    Replies:
    19
    Views:
    820
  3. Dave Angel
    Replies:
    6
    Views:
    222
    Terry Reedy
    Nov 1, 2011
  4. Ian Kelly
    Replies:
    5
    Views:
    185
    Ian Kelly
    Nov 1, 2011
  5. Replies:
    1
    Views:
    264
    J. Gleixner
    Jan 14, 2008
Loading...

Share This Page