CSV Parsing algorithms in Java

Discussion in 'Java' started by Jeffrey Spoon, Nov 3, 2006.

  1. Hello, has anybody seen well-known/good practice CSV parsing algorithms
    in Java? I've been googling about but can't see anything suitable so
    far. I'm not interested in using library functions, rather implementing
    the algorithm myself (or at least learning how to).

    Any pointers appreciated, thanks.



    --
    Jeffrey Spoon
     
    Jeffrey Spoon, Nov 3, 2006
    #1
    1. Advertising

  2. Jeffrey Spoon

    David Segall Guest

    Jeffrey Spoon <> wrote:

    >
    >
    >
    >Hello, has anybody seen well-known/good practice CSV parsing algorithms
    >in Java? I've been googling about but can't see anything suitable so
    >far. I'm not interested in using library functions, rather implementing
    >the algorithm myself (or at least learning how to).
    >
    >Any pointers appreciated, thanks.

    Roedy Green has assembled some useful information on this topic.
    <http://mindprod.com/jgloss/csv.html>
     
    David Segall, Nov 3, 2006
    #2
    1. Advertising

  3. Jeffrey Spoon

    JanTheKing Guest

    Probably you may get some idea from the Apache POI project..

    On Nov 3, 9:32 pm, David Segall <> wrote:
    > Jeffrey Spoon <> wrote:
    >
    > >Hello, has anybody seen well-known/good practice CSV parsing algorithms
    > >in Java? I've been googling about but can't see anything suitable so
    > >far. I'm not interested in using library functions, rather implementing
    > >the algorithm myself (or at least learning how to).

    >
    > >Any pointers appreciated, thanks.Roedy Green has assembled some useful information on this topic.

    > <http://mindprod.com/jgloss/csv.html>
     
    JanTheKing, Nov 4, 2006
    #3
  4. Jeffrey Spoon wrote:

    > Hello, has anybody seen well-known/good practice CSV parsing algorithms
    > in Java? I've been googling about but can't see anything suitable so
    > far. I'm not interested in using library functions, rather implementing
    > the algorithm myself (or at least learning how to).
    >
    > Any pointers appreciated, thanks.


    take a look at my project
    http://csvtosql.sourceforge.net

    --
    Davide Consonni <> http://csvtosql.sourceforge.net
    Linux: basta con le clessidre sullo schermo! -- By Zuse
     
    Davide Consonni, Nov 4, 2006
    #4
  5. In message <>, David Segall
    <> writes
    >Jeffrey Spoon <> wrote:
    >
    >>
    >>
    >>
    >>Hello, has anybody seen well-known/good practice CSV parsing algorithms
    >>in Java? I've been googling about but can't see anything suitable so
    >>far. I'm not interested in using library functions, rather implementing
    >>the algorithm myself (or at least learning how to).
    >>
    >>Any pointers appreciated, thanks.

    >Roedy Green has assembled some useful information on this topic.
    ><http://mindprod.com/jgloss/csv.html>



    Thanks, I had a look. The reason I'm asking is because I had a graduate
    role interview and they asked this as a question, as in to write one. I
    didn't know how to anyway, but looking at Roedy's, just the get() method
    is 200 hundred lines, am I really expected to know this stuff off by
    heart?


    Thanks to the others who suggested as well, I'll get around to them.



    --
    Jeffrey Spoon
     
    Jeffrey Spoon, Nov 4, 2006
    #5
  6. Jeffrey Spoon

    Stefan Ram Guest

    Jeffrey Spoon <> writes:
    >Thanks, I had a look. The reason I'm asking is because I had a graduate
    >role interview and they asked this as a question, as in to write one. I
    >didn't know how to anyway, but looking at Roedy's, just the get() method
    >is 200 hundred lines, am I really expected to know this stuff off by
    >heart?


    The correct answer would have been:

    »There are dozens of different formal languages, all
    referred to by the name of "CSV". Some differ only by
    minor details, but these are important, when one wants to
    write a parser. So, I would like to invite you to join me
    in a process to figure out the exact specifications of the
    language you want me to parse or - if available - please
    give me a language specification«.

    After all such questions would have been cleared, I would have
    been able to write a parser from scratch if the interviewer
    would have the patience to wait for me to finish it. The Java
    SE API documentation at hand might be helpful during this.
     
    Stefan Ram, Nov 4, 2006
    #6
  7. In message <-berlin.de>, Stefan Ram
    <-berlin.de> writes
    >Jeffrey Spoon <> writes:
    >>Thanks, I had a look. The reason I'm asking is because I had a graduate
    >>role interview and they asked this as a question, as in to write one. I
    >>didn't know how to anyway, but looking at Roedy's, just the get() method
    >>is 200 hundred lines, am I really expected to know this stuff off by
    >>heart?

    >
    > The correct answer would have been:
    >
    > ›There are dozens of different formal languages, all
    > referred to by the name of "CSV". Some differ only by
    > minor details, but these are important, when one wants to
    > write a parser. So, I would like to invite you to join me
    > in a process to figure out the exact specifications of the
    > language you want me to parse or - if available - please
    > give me a language specification‹.
    >
    > After all such questions would have been cleared, I would have
    > been able to write a parser from scratch if the interviewer
    > would have the patience to wait for me to finish it. The Java
    > SE API documentation at hand might be helpful during this.
    >


    So that's a no then? :)

    They did specify that some of the values may contain double quotes.
    I had two other questions to do as well, in 30 minutes. One was a fairly
    advanced SQL question (for me anyway) and the other was easy enough,
    about client/server stuff. They left me to write the answers down with
    no references other than the question sheet. Oh, and there were some
    other multiple choice questions, but they were fairly straightforward.




    --
    Jeffrey Spoon
     
    Jeffrey Spoon, Nov 4, 2006
    #7
  8. Jeffrey Spoon

    Stefan Ram Guest

    Jeffrey Spoon <> writes:
    >So that's a no then? :)
    >They did specify that some of the values may contain double quotes.
    >I had two other questions to do as well, in 30 minutes.


    Assuming that there are only about 10 minutes to write such a
    parser on paper without any reference, it is difficult, indeed.

    Let me try to see, what I can write in 10 minutes without a
    reference

    // 2006-11-04T17:48:18+01:00

    public class CsvParser
    { private CsvScanner tokenSource;
    public CsvParser( final CsvScanner tokenSource )
    { this.tokenSource = tokenSource; }

    // 2006-11-04T17:50:09+01:00

    public void parseAll()
    { while( tokenSource.isMoreInSource() )parseLine(); }

    // 2006-11-04T17:51:26+01:00

    public void parseLine()
    { while( tokenSource.isMoreInLine() )parseValue(); }

    // 2006-11-04T17:54:43+01:00

    public void parseValue()
    { final Token token = tokenSource.getToken();
    token.to( new TokenProcessor()
    { public void processNumericStart(){ /* todo */ }
    public void processTextStart(){ /* todo */ }
    /* here my time limit was reached */

    // 2006-11-04T17:58:31+01:00

    Sometimes an interviewer might give you an "impossible"
    task just to see how you cope with that.
     
    Stefan Ram, Nov 4, 2006
    #8
  9. Jeffrey Spoon

    Simon Brooke Guest

    in message <>, Jeffrey Spoon
    ('') wrote:

    > In message <>, David Segall
    > <> writes
    >>Jeffrey Spoon <> wrote:
    >>
    >>>Hello, has anybody seen well-known/good practice CSV parsing algorithms
    >>>in Java? I've been googling about but can't see anything suitable so
    >>>far. I'm not interested in using library functions, rather implementing
    >>>the algorithm myself (or at least learning how to).
    >>>
    >>>Any pointers appreciated, thanks.

    >>Roedy Green has assembled some useful information on this topic.
    >><http://mindprod.com/jgloss/csv.html>

    >
    > Thanks, I had a look. The reason I'm asking is because I had a graduate
    > role interview and they asked this as a question, as in to write one. I
    > didn't know how to anyway, but looking at Roedy's, just the get() method
    > is 200 hundred lines, am I really expected to know this stuff off by
    > heart?
    >
    > Thanks to the others who suggested as well, I'll get around to them.


    Heavens, writing a CSV parser is trivial. It's simply a case of a
    StringTokenizer in a for loop:

    public ResultClass parse( InputStream in, String separatorChars)
    throws IOException
    {
    ResultClass result = new ResultClass();
    BufferedReader buffy =
    new BufferedReader( new InputStreamReader( in));

    for ( String line = buffy.readLine(); line != null;
    line = buffy.readLine)
    {
    StringTokenizer tok =
    new StringTokenizer( line, separatorChars);

    while ( tok.hasMoreTokens())
    {
    // do something with result and tok.nextToken()
    }
    }
    /* consider (and document) whether it's your or the caller's
    * responsibility to close the stream; since you were passed the
    * stream I suggest it's the caller's */

    return result;
    }

    As to what that ResultClass object should be, if the first line in your CSV
    may be column headers and each value in the first row is distinct then
    probably what you want is a vector of maps where the keys of the maps are
    the corresponding values from the first line; otherwise I'd probably just
    return a vector of vectors.

    Obviously you may not want to schlurp a whole CSV file into core memory at
    one go; it may be better to produce a parser to which you can add
    callbacks/listeners for the fields or patterns you are interested in. But
    the general pattern is as given.

    --
    (Simon Brooke) http://www.jasmine.org.uk/~simon/
    ;; Let's have a moment of silence for all those Americans who are stuck
    ;; in traffic on their way to the gym to ride the stationary bicycle.
    ;; Rep. Earl Blumenauer (Dem, OR)
     
    Simon Brooke, Nov 4, 2006
    #9
  10. Jeffrey Spoon

    Karl Uppiano Guest

    "Simon Brooke" <> wrote in message
    news:...
    > in message <>, Jeffrey Spoon
    > ('') wrote:
    >
    >> In message <>, David Segall
    >> <> writes
    >>>Jeffrey Spoon <> wrote:
    >>>
    >>>>Hello, has anybody seen well-known/good practice CSV parsing algorithms
    >>>>in Java? I've been googling about but can't see anything suitable so
    >>>>far. I'm not interested in using library functions, rather implementing
    >>>>the algorithm myself (or at least learning how to).
    >>>>
    >>>>Any pointers appreciated, thanks.
    >>>Roedy Green has assembled some useful information on this topic.
    >>><http://mindprod.com/jgloss/csv.html>

    >>
    >> Thanks, I had a look. The reason I'm asking is because I had a graduate
    >> role interview and they asked this as a question, as in to write one. I
    >> didn't know how to anyway, but looking at Roedy's, just the get() method
    >> is 200 hundred lines, am I really expected to know this stuff off by
    >> heart?
    >>
    >> Thanks to the others who suggested as well, I'll get around to them.

    >
    > Heavens, writing a CSV parser is trivial. It's simply a case of a
    > StringTokenizer in a for loop:
    >
    > public ResultClass parse( InputStream in, String separatorChars)
    > throws IOException
    > {
    > ResultClass result = new ResultClass();
    > BufferedReader buffy =
    > new BufferedReader( new InputStreamReader( in));
    >
    > for ( String line = buffy.readLine(); line != null;
    > line = buffy.readLine)
    > {
    > StringTokenizer tok =
    > new StringTokenizer( line, separatorChars);
    >
    > while ( tok.hasMoreTokens())
    > {
    > // do something with result and
    > tok.nextToken()
    > }
    > }
    > /* consider (and document) whether it's your or the
    > caller's
    > * responsibility to close the stream; since you were
    > passed the
    > * stream I suggest it's the caller's */
    >
    > return result;
    > }
    >
    > As to what that ResultClass object should be, if the first line in your
    > CSV
    > may be column headers and each value in the first row is distinct then
    > probably what you want is a vector of maps where the keys of the maps are
    > the corresponding values from the first line; otherwise I'd probably just
    > return a vector of vectors.
    >
    > Obviously you may not want to schlurp a whole CSV file into core memory at
    > one go; it may be better to produce a parser to which you can add
    > callbacks/listeners for the fields or patterns you are interested in. But
    > the general pattern is as given.
    >
    > --
    > (Simon Brooke) http://www.jasmine.org.uk/~simon/
    > ;; Let's have a moment of silence for all those Americans who are stuck
    > ;; in traffic on their way to the gym to ride the stationary bicycle.
    > ;; Rep. Earl Blumenauer (Dem, OR)



    or this:

    String[] columnData = rowData.split("[,]");
     
    Karl Uppiano, Nov 4, 2006
    #10
  11. Simon Brooke wrote:
    > Heavens, writing a CSV parser is trivial. It's simply a case of a
    > StringTokenizer in a for loop:


    Hmmm.

    In the real world programmers usually have to deal with
    item separators (typical , or ;) inside strings (typical "").
    And a convention for string delimiters inside strings.

    Arne
     
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=, Nov 4, 2006
    #11
  12. Jeffrey Spoon

    Eric Sosman Guest

    Simon Brooke wrote:
    >
    > Heavens, writing a CSV parser is trivial. It's simply a case of a
    > StringTokenizer in a for loop:
    > [...]


    There is no one official "CSV format," but even the simple
    version described at http://www.wotsit.org/ is not parseable by
    a mere StringTokenizer (which the JavaDoc calls a "legacy class"
    whose use in new code is "discouraged," by the way).

    Brooke, 21 Elm Street
    // space before '2' should vanish but embedded spaces
    // should remain

    "Brooke, Simon" , 21 Elm Street
    // first comma does not end a field, quotes disappear,
    // both spaces surrounding second comma disappear

    "Brooke, Simon" , """The Beeches"", Herts"
    // doubled quotes become singles, only one of the three
    // commas is a field separator, more disappearing and
    // retained spaces

    "Brooke, Simon" , "21 Elm Street
    Apartment 3B"
    // embedded newline in second field

    Parsing CSV -- even allowing for some variations beyond the
    wotsit description -- is not difficult, but not trivial. My own
    CSVReader class runs to 376 lines, including JavaDoc. (It could
    probably be tightened a bit; I wrote it as an exercise when I was
    new to Java and would likely do things differently nowadays.)

    --
    Eric Sosman
    lid
     
    Eric Sosman, Nov 4, 2006
    #12
  13. Jeffrey Spoon wrote:


    > Hello, has anybody seen well-known/good practice CSV parsing algorithms
    > in Java? I've been googling about but can't see anything suitable so
    > far. I'm not interested in using library functions, rather implementing
    > the algorithm myself (or at least learning how to).
    >
    > Any pointers appreciated, thanks.


    use regex, watch this:
    http://tinyurl.com/ska4z

    --
    Davide Consonni <> http://csvtosql.sourceforge.net
    "Avremo un bambino. Sara' il mio regalo di Natale." "Ma io mi sarei
    accontentato di una cravatta!" -- Woody Allen, da "Prendi i soldi e scappa"
     
    Davide Consonni, Nov 4, 2006
    #13
  14. Jeffrey Spoon

    Chris Uppal Guest

    Jeffrey Spoon wrote:

    > Hello, has anybody seen well-known/good practice CSV parsing algorithms
    > in Java? I've been googling about but can't see anything suitable so
    > far. I'm not interested in using library functions, rather implementing
    > the algorithm myself (or at least learning how to).


    There is no real specification for CSV. Some places to look for information on
    what people /think/ CSV files are like:

    http://www.ietf.org/rfc/rfc4180.txt
    http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
    http://www.pobox.com/~qed/bcsv.zip

    Note: I'm pretty sure that the rfc's suggested handling of spaces around fields
    is wrong -- everbody else seems to think that leading/trailing spaces are
    ignored.

    -- chris
     
    Chris Uppal, Nov 5, 2006
    #14
  15. Jeffrey Spoon

    Chris Uppal Guest

    Simon Brooke wrote:

    > for ( String line = buffy.readLine(); line != null;
    > line = buffy.readLine)


    CSV fields (and hence CSV records) may span more than one line.


    > StringTokenizer tok =
    > new StringTokenizer( line,
    > separatorChars);


    Nothing based on naive use of pattern matching can possibly parse CSV since
    fields may contain separator tokens. Indeed a field may contain an entire
    CSV-format sub-file (and so on recursively).

    If /I/ had set this exercise then my (hidden) purpose would have been to filter
    out candidates who don't realise that this is a reasonably complex parsing
    task, and not solvable with simple minded tools like regexps[*].

    The probability (I think) is that the OP's interviewer was someone who would
    have failed my test ;-)

    Mind you, I wouldn't have set this task -- too challenging for the context.
    Unless, perhaps, I were interviewing for very senior engineers and I was
    expecting them to show that they could think realistically under pressure by
    answering "that's too complicated to do here and now".

    -- chris

    ([*] Using regexps is nearly always a sign that the program is broken -- there
    are not many tasks for which they are (part of) the correct solution.)
     
    Chris Uppal, Nov 5, 2006
    #15
  16. Stefan Ram wrote:
    > Jeffrey Spoon <> writes:
    >> So that's a no then? :)
    >> They did specify that some of the values may contain double quotes.
    >> I had two other questions to do as well, in 30 minutes.

    >
    > Assuming that there are only about 10 minutes to write such a
    > parser on paper without any reference, it is difficult, indeed.
    >
    > Let me try to see, what I can write in 10 minutes without a
    > reference
    >
    > // 2006-11-04T17:48:18+01:00
    >
    > public class CsvParser
    > { private CsvScanner tokenSource;
    > public CsvParser( final CsvScanner tokenSource )
    > { this.tokenSource = tokenSource; }
    >
    > // 2006-11-04T17:50:09+01:00
    >
    > public void parseAll()
    > { while( tokenSource.isMoreInSource() )parseLine(); }
    >
    > // 2006-11-04T17:51:26+01:00
    >
    > public void parseLine()
    > { while( tokenSource.isMoreInLine() )parseValue(); }
    >
    > // 2006-11-04T17:54:43+01:00
    >
    > public void parseValue()
    > { final Token token = tokenSource.getToken();
    > token.to( new TokenProcessor()
    > { public void processNumericStart(){ /* todo */ }
    > public void processTextStart(){ /* todo */ }
    > /* here my time limit was reached */
    >
    > // 2006-11-04T17:58:31+01:00
    >
    > Sometimes an interviewer might give you an "impossible"
    > task just to see how you cope with that.
    >

    Clever clogs solution:

    - write down the BNF notation for the CSV syntax (about 6 statements)
    - say you're going to feed that through a parser generator, e.g. Coco/R


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Nov 5, 2006
    #16
  17. In message <>, Simon
    Brooke <> writes

    >>
    >> Thanks to the others who suggested as well, I'll get around to them.

    >
    >Heavens, writing a CSV parser is trivial. It's simply a case of a
    >StringTokenizer in a for loop:
    >


    Except I wasn't allowed to use String Tokenizer, as I said in the
    original post, "I'm not interested in using library functions".



    --
    Jeffrey Spoon
     
    Jeffrey Spoon, Nov 5, 2006
    #17
  18. In message <-berlin.de>, Stefan Ram
    <-berlin.de> writes

    >
    >// 2006-11-04T17:48:18+01:00
    >
    >public class CsvParser
    >{ private CsvScanner tokenSource;
    > public CsvParser( final CsvScanner tokenSource )
    > { this.tokenSource = tokenSource; }
    >
    >// 2006-11-04T17:50:09+01:00
    >
    > public void parseAll()
    > { while( tokenSource.isMoreInSource() )parseLine(); }
    >
    >// 2006-11-04T17:51:26+01:00
    >
    > public void parseLine()
    > { while( tokenSource.isMoreInLine() )parseValue(); }
    >
    >// 2006-11-04T17:54:43+01:00
    >
    > public void parseValue()
    > { final Token token = tokenSource.getToken();
    > token.to( new TokenProcessor()
    > { public void processNumericStart(){ /* todo */ }
    > public void processTextStart(){ /* todo */ }
    > /* here my time limit was reached */
    >
    >// 2006-11-04T17:58:31+01:00
    >
    > Sometimes an interviewer might give you an "impossible"
    > task just to see how you cope with that.
    >


    Interesting, thanks. I certainly have to do some reading on parsing in
    general anyway.

    Cheers all,



    --
    Jeffrey Spoon
     
    Jeffrey Spoon, Nov 5, 2006
    #18
  19. Jeffrey Spoon

    Simon Brooke Guest

    in message <>, Jeffrey Spoon
    ('') wrote:

    > In message <>, Simon
    > Brooke <> writes
    >
    >>>
    >>> Thanks to the others who suggested as well, I'll get around to them.

    >>
    >>Heavens, writing a CSV parser is trivial. It's simply a case of a
    >>StringTokenizer in a for loop:

    >
    > Except I wasn't allowed to use String Tokenizer, as I said in the
    > original post, "I'm not interested in using library functions".


    Then write your own; it's a trivial thing to do. Here, in fact, is one I
    wrote earlier:

    /**
    * MIDP does not provide a StringTokenizer. Because this has to be
    * compatible with MIDP we'll provide our own. If you have access to a real
    * StringTokenizer don't use this one - it is minimal and possibly
    * inefficient.
    */
    public class StringTokenizer
    {
    //~ Instance fields -----------------------------------------------

    /** the source string, which I tokenize */
    private String source = null;

    /** the separator character which I split it on */
    private char sep = ' ';

    /** my current cursor into the strong */
    private int cursor = 0;

    //~ Constructors --------------------------------------------------

    /**
    * @param sep the separator which separates tokens in this source
    * @param source the source string to separate into tokens
    */
    public StringTokenizer( String source, char sep )
    {
    super( );
    this.sep = sep;
    this.source = source;
    }

    //~ Methods -------------------------------------------------------

    /**
    * @return true if this tokenizer still has more tokens, else false
    */
    public boolean hasMoreTokens( )
    {
    return ( ( source != null ) && ( cursor < source.length( ) ) );
    }

    /**
    * Test harness only - do not use
    *
    * @param args
    */
    public static void main( String[] args )
    {
    if ( args.length == 2 )
    {
    StringTokenizer tock =
    new StringTokenizer( args[0], args[1].charAt( 0 ) );

    System.out.println( "String is: '" + args[0] + "'" );
    System.out.println( "Separator is: '" + args[1].charAt( 0 ) + "'" );

    for ( int i = 0; tock.hasMoreTokens( ); i++ )
    {
    System.out.println( Integer.toString( i ) + ": '" +
    tock.nextToken( ) + "'" );
    }
    }
    }

    /**
    * @return the next token from this string tokenizer, or null if there are
    * no more.
    */
    public synchronized String nextToken( )
    {
    String result = null;
    int end = source.indexOf( sep, cursor );

    if ( cursor < source.length( ) )
    {
    if ( end > -1 )
    {
    result = source.substring( cursor, end );
    cursor = end + 1;
    }
    else
    {
    result = source.substring( cursor );
    cursor = source.length( );
    }
    }

    return result;
    }
    }


    --
    (Simon Brooke) http://www.jasmine.org.uk/~simon/
     
    Simon Brooke, Nov 5, 2006
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    929
    GIMME
    Feb 11, 2004
  2. Michal Mikolajczyk
    Replies:
    0
    Views:
    692
    Michal Mikolajczyk
    Feb 13, 2004
  3. Skip Montanaro
    Replies:
    0
    Views:
    764
    Skip Montanaro
    Feb 13, 2004
  4. Tintin92
    Replies:
    1
    Views:
    1,806
    Andrew Thompson
    Feb 14, 2007
  5. jliu66
    Replies:
    0
    Views:
    560
    jliu66
    Oct 19, 2007
Loading...

Share This Page