[org.relique.jdbc.csv.CsvDriver] spaces / trim

I

Ingo R. Homann

Hi,

I'm using the CSV/JDBC-driver mentioned above. It works, but I've got a
problem with spaces. Consider the following (human-readable) table:

name ;surname
Homann;Ingo
Smith ;John

When reading the fields, the spaces are returned as well, that means, I
will get "Smith ". This can be solved by calling trim() on the String,
but even worse, when I try to read the column, rs.getString("name")
fails because rs.getString("name ") would be 'correct'.

So, my question is, is there a property I can set to avoid this? (I
didn't find any apropriate docu.)

Or does someone know a different driver that can deal with this problem?

Ciao,
Ingo
 
R

Roedy Green

I'm using the CSV/JDBC-driver mentioned above. It works, but I've got a
problem with spaces. Consider the following (human-readable) table:

Here another approach to the problem. Use a different CSV parser.

See http://mindprod.com/jgloss/csv.html

that will treat spaces the way you want.

I don't know how the JDBC CSV thing works, but presumably you can use
the parser to find individual fields for JDBC or just store CLOBS
containing CSV streams.
 
I

Ingo R. Homann

Hi Roedy,

Roedy said:
Here another approach to the problem. Use a different CSV parser.

See http://mindprod.com/jgloss/csv.html

that will treat spaces the way you want.

I don't know how the JDBC CSV thing works, but presumably you can use
the parser to find individual fields for JDBC or just store CLOBS
containing CSV streams.

The CSV/JDBC-driver allows me to access a CSV-file via
standard-SQL-statements like "select * from test.csv" and to retrieve
the data over a java.sql.ResultSet.

I think this is a great idea, and I have some code that already uses
this mechanism. So, I do not want to change the whole code. (Before
doing that, I would write into the docu of our software "Please ensure
that there are no unnecessary whitespaces.")

Ciao,
Ingo
 
R

Roedy Green

The CSV/JDBC-driver allows me to access a CSV-file via
standard-SQL-statements like "select * from test.csv" and to retrieve
the data over a java.sql.ResultSet.

that definitely requires using their parser.

What you might do then is tidy the file beforehand.
See http://mindprod.com/jgloss/csv.html

I am trying to figure out where that approach would work better than
importing the CSV file into a traditional table. I gather the inquiry
you do is one off and you never look at the CSV file again and the CSV
files are relatively short?
 
I

Ingo R. Homann

Hi,

Roedy said:
I am trying to figure out where that approach would work better than
importing the CSV file into a traditional table.

What do you mean with "traditional"? Reading the file using a new
StringTokenizer(line,";");?

The advantage is that I have well-defined headers and can access the
data independant of the exact file format very easily: No matter, if the
file contains (name,surname,birthday) or
(birthday,haircolor,name,surname), I always can access the data using
rs.getString("name") instead of counting the columns. Ideally, I also do
not have to care about quoting ("Homann";Ingo;"some comment with a
semicolon ;";"blond").

I think, this is a nice way to access a csv file.

I know it is not too difficult to implement it yourself, but why do so,
if it already exist (and especially if your code already depends on that)?
I gather the inquiry
you do is one off and you never look at the CSV file again and the CSV
files are relatively short?

Well, I simply want to import a CSV file. It contains perhaps 30 cols
and a few hundret rows. Why do you think this is important? (The query,
as I said, is indeed very simple: "select * from csv;")

Ciao,
Ingo
 
R

Roedy Green

What do you mean with "traditional"? Reading the file using a new
StringTokenizer(line,";");?

I mean read it with a CSVReader class, and use JDBC to put the data
into your SQL database to be stored in its native internal format int
tables. Perhaps your database has an import function to do that. Then
you do ordinary SQL queries.

CSV is not a friendly format for doing queries. It has no indexes. It
has no formal system for labelling the column or describing their
types. You have to do everything by scanning or by temporarily
bringing the whole file into RAM.
 
L

Lothar Kimmeringer

I mean read it with a CSVReader class, and use JDBC to put the data
into your SQL database to be stored in its native internal format int
tables. Perhaps your database has an import function to do that. Then
you do ordinary SQL queries.

You miss the point, the CSV-file *is* the "database". Why introducing
another (somethimes not existant) database if a simple CSV-file do the
trick as well?
CSV is not a friendly format for doing queries. It has no indexes. It
has no formal system for labelling the column or describing their
types. You have to do everything by scanning or by temporarily
bringing the whole file into RAM.

So what? For small amounts of data it's an easy way of handling
data and the switch to a normal database is much more easy that
way compared to the change from a CSVReader to a JDBC-based
database-access.

I'm using this kind of way for parts of my websites for years
(just in Perl rather than Java but the idea behind is the same).


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
T

Thomas Kellerer

Hi,

I'm using the CSV/JDBC-driver mentioned above. It works, but I've got a
problem with spaces. Consider the following (human-readable) table:

name ;surname
Homann;Ingo
Smith ;John

When reading the fields, the spaces are returned as well, that means, I
will get "Smith ". This can be solved by calling trim() on the String,
but even worse, when I try to read the column, rs.getString("name")
fails because rs.getString("name ") would be 'correct'.

So, my question is, is there a property I can set to avoid this? (I
didn't find any apropriate docu.)

Or does someone know a different driver that can deal with this problem?
The only thing I can think of, is to connect to the file using
suppressHeaders=true, then read the first line, loop over all returned
columns (getString(int)) and store a mapping of trimmed column names to
real index, and then always use the index to access the columns instead
of the column name.

Apart from that, as the sourcecode is available at sourceforge, it seems
quite easy to implement a trimming of values in there (had a quick
glance at it, and it doesn't seem to be too complicated)

Regards
Thomas
 
I

Ingo R. Homann

Hi Lothar,

Lothar said:
You miss the point, the CSV-file *is* the "database". Why introducing
another (somethimes not existant) database if a simple CSV-file do the
trick as well?

Well, the *real* point is that the CSV-file is used for import and
export. Unfortunately, the data cannot be inserted plain into the
(SQL-)Database (which would indeed be easy by means of the database),
but there have to be some conversions before that. So, I have to read
the CSV-file, process it, and then write it to the SQL-Database.

Or shortly (as I said in my OP): I have to read the CSV-file in java.
And, the idea to access the csv-data via a CSV/JDBC-driver seems a very
nice idea to me. (And - furthermore, there already exists much code
relying on that.)

Ciao,
Ingo
 
R

Roedy Green

Or shortly (as I said in my OP): I have to read the CSV-file in java.
And, the idea to access the csv-data via a CSV/JDBC-driver seems a very
nice idea to me. (And - furthermore, there already exists much code
relying on that.)

The point I am making is a CSV file is a very poor format to maintain
a database compared with the internal structures of a typical SQL
engine. If your CSV file for example had a million records, the only
way it could be treated as a database is by scanning it linearly for
every query. In contrast if you imported the CSV file then you could
have auxiliary indexes to help you find what you wanted, going
directly to them.

Obviously using the CSV file directly as your database must be
advantageous in some circumstances or someone would not have gone to
all the work of creating a search engine that uses CSV as its base
file structure.

The only time I can see where this JDBC-CSV approach might prove
faster than importing and using a traditional SQL engine is if you
did a single query per file and then never looked at it again, or
perhaps if your CSV files were quite short.. That appears to be what
you are doing.
 
A

Andrea Desole

Ingo said:
Hi,

I'm using the CSV/JDBC-driver mentioned above. It works, but I've got a
problem with spaces. Consider the following (human-readable) table:

name ;surname
Homann;Ingo
Smith ;John

When reading the fields, the spaces are returned as well, that means, I
will get "Smith ". This can be solved by calling trim() on the String,
but even worse, when I try to read the column, rs.getString("name")
fails because rs.getString("name ") would be 'correct'.

So, my question is, is there a property I can set to avoid this? (I
didn't find any apropriate docu.)

Or does someone know a different driver that can deal with this problem?

It sounds strange that you don't know what columns you are going to
read. Can't you use getString(int), or you really need the name?
If you really need the name, can't you use the meta data to get it?
 
I

Ingo R. Homann

Hi,

Andrea said:
It sounds strange that you don't know what columns you are going to
read.

I think you misunderstood me. I do know the name of the column. In the
example above, I want to access the first column using getString("name")
but that simply does not work (*).
Can't you use getString(int), or you really need the name?

I just don't want to use the indices, because they can differ. But the
name of the column is unique.
If you really need the name, can't you use the meta data to get it?

Does not make sense in this context! :)

Ciao,
Ingo

(*) because the driver expects getString("name ")
 
I

Ingo R. Homann

Hi,

perhaps my last posting was not very clear (sorry for my english! :)
The point I am making is a CSV file is a very poor format to maintain
a database compared with the internal structures of a typical SQL
engine. If your CSV file for example had a million records, the only
way it could be treated as a database is by scanning it linearly for
every query. In contrast if you imported the CSV file then you could
have auxiliary indexes to help you find what you wanted, going
directly to them.

I do *not* use the CSV file *instead* of a database. In fact, I *have* a
database (a MySQL-Database).

I just use the CSV file for importing and exporting data to different
programs.

I use the CSV/JDBC-driver only *one* time, namely when *importing* a
file. There, I want to iterate over all lines of the CSV file. But now,
the different CSV-files have the columns in a different order and
different (optional) columns. Some have (name, surname, birthday), some
have (id, birthday, name) and so on. So, reading the csv file in the old
fashioned way (using StringTokenizer), I would have to implement the
mapping of the column-names to the column-indices for myself (and
furthermore, I would have to implement the quoting of the values for
myself and so on). The CSV/JDBC-driver (which is quite simple and does
not support much SQL) does that. It is not much more than a way of
reading the csv and getting a Map from which you can easily read the
values of a dataset by having the name of the column: getString("name"),
getDate("birthday")

By the way: This has nothing to do with performance.

I hope it's clearer now.

Ciao,
Ingo
 
A

Andrea Desole

Ingo said:
I think you misunderstood me. I do know the name of the column. In the
example above, I want to access the first column using getString("name")
but that simply does not work (*).

sorry, I think I was not clear. When I said "know what columns you are
going to read" I was talking about their position, not their names
I just don't want to use the indices, because they can differ. But the
name of the column is unique.
Nice


Does not make sense in this context! :)

Because it's a CSV? I can imagine a CSV can be a bit different. Still,
it's a table with fields; it should have meta data.
If you really can't do it, how about reading the CSV yourself? You can
read the first line to get the column names
 
C

Chris Uppal

Ingo said:
I think you misunderstood me. I do know the name of the column. In the
example above, I want to access the first column using getString("name")
but that simply does not work (*). [...]
(*) because the driver expects getString("name ")

Then the driver is broken. At least in so far as there /are/ any "rules" for
CSV, the rules for trimming whitespace are, I believe[*], quite definite, and
the CsvDriver isn't implementing them. If you don't want to replace the parser
then you'll have to fix it yourself.

Either fix org.relique.jdbc.csv.CsvReader.parseCsvLine() or override it with a
fixed version. BTW if you do choose to fix it, don't forget cases like

xxx , " yyy " , zzz

which should produce strings:

"xxx"
" yyy "
"zzz"

-- chris

([*] At least that's how I implemented it in /my/ CSV parser, and I didn't make
that aspect configurable so I must have thought I'd found a clear concensus on
the issue.)
 
R

Roedy Green

So, reading the csv file in the old
fashioned way (using StringTokenizer), I would have to implement the
mapping of the column-names to the column-indices for myself (and
furthermore, I would have to implement the quoting of the values for
myself and so on).

There is yet another option. Use the light weight CSV reader I wrote.
If you want, it will hand you each line as an array of Strings. You
then have to map the strings in the order you want.

It should be pretty easy to write a generic importer that takes a list
of column numbers.

As long as you are not trying to do queries on the CSV file, you are
fine. It sounds like you are using a fairly heavyweight tool for a
lightweight job.
 
R

Roedy Green

It should be pretty easy to write a generic importer that takes a list
of column numbers.

And if your file has the columns labelled in the first row, you can
generate that list of column numbers easily.
 
I

Ingo R. Homann

Hi Andrea,

Andrea said:
... I can imagine a CSV can be a bit different. Still,
it's a table with fields; it should have meta data.
If you really can't do it, how about reading the CSV yourself? You can
read the first line to get the column names

Of course I could reimplement everything. But why, if someone already
has done it? :)

Ciao,
Ingo
 
I

Ingo R. Homann

Hi,

Roedy said:
There is yet another option. Use the light weight CSV reader I wrote.
If you want, it will hand you each line as an array of Strings. You
then have to map the strings in the order you want.

It should be pretty easy to write a generic importer that takes a list
of column numbers.

Yes. Re-Implementation of a package is always a solution. But if it is a
good one...? :-/

But it looks like I will simply modify the sources of the package.
It sounds like you are using a fairly heavyweight tool for a
lightweight job.

Not really. The driver has 22kB.

Ciao,
Ingo
 
I

Ingo R. Homann

Hi,

Thomas said:
Apart from that, as the sourcecode is available at sourceforge, it seems
quite easy to implement a trimming of values in there (had a quick
glance at it, and it doesn't seem to be too complicated)

OK, I'll do so.

Ciao,
Ingo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top