CSV(???)

D

David C. Ullrich

Is there a csvlib out there somewhere?

And/or does anyone see any problems with
the code below?

What csvline does is straightforward: fields
is a list of strings. csvline(fields) returns
the strings concatenated into one string
separated by commas. Except that if a field
contains a comma or a double quote then the
double quote is escaped to a pair of double
quotes and the field is enclosed in double
quotes.

The part that seems somewhat hideous is
parsecsvline. The intention is that
parsecsvline(csvline(fields)) should be
the same as fields. Haven't attempted
to deal with parsecsvline(data) where
data is in an invalid format - in the
intended application data will always
be something that was returned by
csvline. It seems right after some
testing... also seems blechitudinous.

(Um: Believe it or not I'm _still_ using
python 1.5.7. So comments about iterators,
list comprehensions, string methods, etc
are irrelevent. Comments about errors in
the algorithm would be great. Thanks.)

The code:

from string import replace, join

def csvescape(s):
if ',' in s or '"' in s or '\n' in s:
res = replace(s, '"', '""')
return '"%s"' % res
else:
return s

def csvline(fields):
return join(map(csvescape, fields), ',')

class indexedstring:
def __init__(self, s):
self.s = s
self.index = 0

def current(self):
return self[self.index]

def inc(self):
self.index = self.index + 1

def next(self):
self.inc()
return self.current()

def __getitem__(self, j):
return self.s[j]

def __len__(self):
return len(self.s)

def eos(self):
return self.index >= len(self)

def lookahead(self):
return self[self.index + 1]

def getfield(self):
if self.eos():
return None
if self.current() == '"':
return self.quotedfield()
else:
return self.rawfield()

def rawfield(self):
"""Read until comma or eos."""
start = self.index
while not (self.eos() or (self.current() == ',')):
self.inc()

res = self.s[start:self.index]

self.inc()

return res

def quotedfield(self):
"""Read until '",' or '" followed by eos.
Replace "" in result with "."""

start = self.index

while 1:
self.inc()
if self.current() == '"':
self.inc()
if (self.eos() or (self.current()==',')):
break

res = self.s[start + 1:self.index - 1]

self.inc()

return replace(res, '""', '"')

def parsecsvline(csvline):
"""Inverts csvline(). Assumes csvline is valid, ie
is something as returned by csvline(); output undefined
if csvline is in invalid format"""

s = indexedstring(csvline)
res = []

while not s.eos():
res.append(s.getfield())

return res

************************

David C. Ullrich
 
P

Philipp Pagel

David C. Ullrich said:
Is there a csvlib out there somewhere?

How about csv in the standard library?
(Um: Believe it or not I'm _still_ using
python 1.5.7.

I have no idea if csv was part of the standard library backin those
days...

But even if not: either upgrade to something less outdated or see if you
can get todays csv to work with the oldtimer.

cu
Philipp
 
T

Tim Golden

Philipp said:
How about csv in the standard library?


I have no idea if csv was part of the standard library backin those
days...

But even if not: either upgrade to something less outdated or see if you
can get todays csv to work with the oldtimer.

You might have a look at the Object Craft CSV module, from
which the stdlib one is loosely descended, but you'd have to
compile it yourself if you're on 1.5.2.

http://www.object-craft.com.au/projects/csv/

TJG
 
N

nmp

Op Fri, 23 Feb 2007 11:45:54 +0000, schreef nmp:
Op Fri, 23 Feb 2007 05:11:26 -0600, schreef David C. Ullrich:


Hey, cool! I am just beginning with Python but I may already be able to
help you ;)

Oops I missed the bit where you said you were using that very old Python
version...
 
J

John Machin

Is there a csvlib out there somewhere?

I can make available the following which should be capable of running
on 1.5.2 -- unless they've suffered bitrot :)

(a) a csv.py which does simple line-at-a-time hard-coded-delimiter-etc
pack and unpack i.e. very similar to your functionality *except* that
it doesn't handle newline embedded in a field. You may in any case be
interested to see a different way of writing this sort of thing: my
unpack does extensive error checking; it uses a finite state machine
so unexpected input in any state is automatically an error.

(b) an extension module (i.e. written in C) with the same API. The
python version (a) imports and uses (b) if it exists.

(c) an extension module which parameterises everything including the
ability to handle embedded newlines.

The two extension modules have never been compiled & tested on other
than Windows but they both should IIRC be compilable with both gcc
(MinGW) and the free Borland 5.5 compiler -- in other words vanilla C
which should compile OK on Linux etc.

If you are interested in any of the above, just e-mail me.
And/or does anyone see any problems with
the code below?

What csvline does is straightforward: fields
is a list of strings. csvline(fields) returns
the strings concatenated into one string
separated by commas. Except that if a field
contains a comma or a double quote then the
double quote is escaped to a pair of double
quotes and the field is enclosed in double
quotes.

The part that seems somewhat hideous is
parsecsvline. The intention is that
parsecsvline(csvline(fields)) should be
the same as fields. Haven't attempted
to deal with parsecsvline(data) where
data is in an invalid format - in the
intended application data will always
be something that was returned by
csvline.

"Always"? Famous last words :)
It seems right after some
testing... also seems blechitudinous.

I agree that it's bletchworthy, but only mildly so. If it'll make you
feel better, I can send you as a yardstick csv pack and unpack written
in awk -- that's definitely *not* a thing of beauty and a joy
forever :)

I presume that you don't write csvline() output to a file, using
newline as a record terminator and then try to read them back and pull
them apart with parsecsvline() -- such a tactic would of course blow
up on the first embedded newline. So as a matter of curiosity, where/
how are you storing multiple csvline() outputs?
(Um: Believe it or not I'm _still_ using
python 1.5.7. So comments about iterators,
list comprehensions, string methods, etc
are irrelevent. Comments about errors in
the algorithm would be great. Thanks.)

1.5.7 ?
[big snip]

Cheers,
John
 
N

Neil Cerutti

Is there a csvlib out there somewhere?

And/or does anyone see any problems with
the code below?

What csvline does is straightforward: fields
is a list of strings. csvline(fields) returns
the strings concatenated into one string
separated by commas. Except that if a field
contains a comma or a double quote then the
double quote is escaped to a pair of double
quotes and the field is enclosed in double
quotes.

The part that seems somewhat hideous is
parsecsvline. The intention is that
parsecsvline(csvline(fields)) should be
the same as fields. Haven't attempted
to deal with parsecsvline(data) where
data is in an invalid format - in the
intended application data will always
be something that was returned by
csvline. It seems right after some
testing... also seems blechitudinous.

(Um: Believe it or not I'm _still_ using python 1.5.7. So
comments about iterators, list comprehensions, string methods,
etc are irrelevent. Comments about errors in the algorithm
would be great. Thanks.)

Two member functions of indexedstring are not used: next and
lookahead. __len__ and __getitem__ appear to serve no real
purpose.
def parsecsvline(csvline):
"""Inverts csvline(). Assumes csvline is valid, ie
is something as returned by csvline(); output undefined
if csvline is in invalid format"""

s = indexedstring(csvline)
res = []

while not s.eos():
res.append(s.getfield())

return res

You'll be happy to know that iterators and list comprehensions
will make your code better after you upgrade. ;-)

In the meantime, I think your (relative lack of) error handling
is OK. GIGO, as they say (garbage in, garbage out).
 
D

David C. Ullrich

How about csv in the standard library?


I have no idea if csv was part of the standard library backin those
days...
Nope.

But even if not: either upgrade to something less outdated

Thanks. Actually, for reasons I could explain if you really
wanted to know, going to 2.x would actually cost me some
money. Since I'm not getting paid for this...
or see if you
can get todays csv to work with the oldtimer.

cu
Philipp


************************

David C. Ullrich
 
D

David C. Ullrich

Op Fri, 23 Feb 2007 11:45:54 +0000, schreef nmp:


Oops I missed the bit where you said you were using that very old Python
version...

No problem. Actually if there were a csv.py in my version I would
have found it before posting, but you didn't know that.


************************

David C. Ullrich
 
D

David C. Ullrich

Is there a csvlib out there somewhere?

And/or does anyone see any problems with
the code below?

[...]

(Um: Believe it or not I'm _still_ using python 1.5.7. So
comments about iterators, list comprehensions, string methods,
etc are irrelevent. Comments about errors in the algorithm
would be great. Thanks.)

Two member functions of indexedstring are not used: next and
lookahead. __len__ and __getitem__ appear to serve no real
purpose.

Hey, thanks! I didn't realize that using an object with
methods that were never called could cause an algorithm
to fail... shows how much I know.

(Seriously, all I really wanted to know was whether anyone
noticed something I overlooked, so that
parsecsvline(csvline(fields)) might under some condition
not come out the same as fields...)
def parsecsvline(csvline):
"""Inverts csvline(). Assumes csvline is valid, ie
is something as returned by csvline(); output undefined
if csvline is in invalid format"""

s = indexedstring(csvline)
res = []

while not s.eos():
res.append(s.getfield())

return res

You'll be happy to know that iterators and list comprehensions
will make your code better after you upgrade. ;-)

Uh, thanks again. You're right, knowing that makes me so happy
I could just burst.
In the meantime, I think your (relative lack of) error handling
is OK. GIGO, as they say (garbage in, garbage out).

_I_ don't think it's ok.

But (i) the code I posted was not supposed to be the final
version! It was a preliminary version, posted hoping that
someone would notice any errors in the _algorithm_ that
existed. (ii) in the intended application parsecsvline
will only be applied to the output of csvline, so if
the former is indeed a left inverse of the latter there
should be no error unless something else has already
gone wrong elsewhere. Not that that makes it ok...

************************

David C. Ullrich
 
D

David C. Ullrich

I can make available the following which should be capable of running
on 1.5.2 -- unless they've suffered bitrot :)

(a) a csv.py which does simple line-at-a-time hard-coded-delimiter-etc
pack and unpack i.e. very similar to your functionality *except* that
it doesn't handle newline embedded in a field. You may in any case be
interested to see a different way of writing this sort of thing: my
unpack does extensive error checking; it uses a finite state machine
so unexpected input in any state is automatically an error.

Actually a finite-state machine was the first thing I thought of.
Then while I was thinking about what states would be needed, etc,
it ocurred to me that I could get something working _now_ by
just noticing that (assuming valid input) a quoted field
would be terminated by '",' or '"[eos]'.

A finite-state machine seems like the "right" way to do it,
but there are plenty of other parts of the project where
doing it right is much more important - yes, in my experience
doing it "right" saves time in the long run, but that
finite-state machine would have taken more time
_yesterday_.
(b) an extension module (i.e. written in C) with the same API. The
python version (a) imports and uses (b) if it exists.

(c) an extension module which parameterises everything including the
ability to handle embedded newlines.

The two extension modules have never been compiled & tested on other
than Windows but they both should IIRC be compilable with both gcc
(MinGW) and the free Borland 5.5 compiler -- in other words vanilla C
which should compile OK on Linux etc.

If you are interested in any of the above, just e-mail me.
Keen.


"Always"? Famous last words :)

Heh.

Otoh, having read about all the existing variations
in csv files, I don't think I'd attempt to write
something that parses csv provided from an
external source.
I agree that it's bletchworthy, but only mildly so. If it'll make you
feel better, I can send you as a yardstick csv pack and unpack written
in awk -- that's definitely *not* a thing of beauty and a joy
forever :)

I presume that you don't write csvline() output to a file, using
newline as a record terminator and then try to read them back and pull
them apart with parsecsvline() -- such a tactic would of course blow
up on the first embedded newline.

Indeed. Thanks - this is exactly the sort of problem I was hoping
people would point out (although in fact this one is irrelevant,
since I already realized this). In fact the fields will not
contain linefeeds (the data is coming from <INPUT type="text">
on an html form, which means that unless someone's _trying_
to cause trouble a linefeed is impossible, right? Regardless,
incoming data is filtered. Fields containing newlines are
quoted just to make the thing usable in other situations - I
wouldn't use parsecsvline without being very careful, but there's
no reason csvline shouldn't have general applicability.) And in
any case, no, I don't intend to be parsing multi-record csv files.

Although come to think of it one could modify the above
to do that without too much trouble, at least assuming
valid input - end-of-field followed by linefeed must
be end-of-record, right?
So as a matter of curiosity, where/
how are you storing multiple csvline() outputs?

Since you ask: the project is to allow alumni to
store contact information on a web site, and then
let office staff access the information for various
purposes. So each almunus' data is stored as a
csvline in an anydbm "database" - when someone in
the office requests the information it's dumped
into a csv file, the idea being that the office
staff opens that in Excel or whatever.

(Why not simply provide a suitable interface
to the data instead of just giving them the
csv file? So they can use the data in ways I
haven't anticipated. Why not give them access
to a real database? They know how to use Excel.

I do think I'll provide a few access thingies
in addition to the csv file, for example an
automatic mass mailer...)

So why put csv data into an anydbm thing instead
of using shelve or something? Laughably or not,
the reason is to speed up what seems like the
main bottleneck:

If I use my parsecsvline() that will be very slow.
But that doesn't matter, since that only happens
once or twice a day on one record, when an alumnus
logs in and edits his contact information.

But when the office requests the data we run through
the entire database - if we store the data as csv
then we don't have any conversion to do at that
point, we just write the raw data in the database
to a file. Should be much quicker than converting
something else to csv at that point.

(So why not just store the data in a csv file?
Random access.)

Since you asked, if you had any comments on
what's silly about the general plan there by
all means say so.

Hmm. Why not use one of the many Python
web tools out there?

(i) Doing it myself is more interesting. I'm
not getting paid for this.

(ii) If I do it muself it's going to be easier
for me to be certain I know exactly where user
input is at all times.

The boss wanted me to use php because Python
was going to be too hard for someone else to
read. That's nonsense, of course. Anyway, he
gave me a book on php security. The book
raised a lot of issues that I wouldn't have
thought of, but it also convinced me I
wouldn't want to use php - all through
the book we're warned that php will do this
or that bad thing if you're not careful.
Don't want to have to learn all the things
you need not to do with whatever tool I
use.

Here, the only write access to the database is
through an Alum object; Alum objects filter their
data on creation, and they're read-only (via
the magic of ___setattr__), so a maintainer
would have _try_ if he wanted to insert unfiltered
data - wouldn't be hard to do, but he can't do it
by accident.

And the only html output is through PostHTML, which
filters everything through cgi.escape(). In particular
print statements raise exceptions (via
sys.stdout = PrintExploder().) Again, a maintainer
could easily write to sys.__stdout__ to get around
this, but that's not going to happen by accident.

Altogether seems much cleaner than the php stuff
I saw in that book - the way he does things you need
to be careful every time you do something, with
the current setup I only need to be careful twice,
in Alum.__init__ and in PostHTML.

Could be I'm being arrogant putting more trust in
asetup like that instead of some well-known
Python web thingie. But I don't see anyplace
things can leak out, and using someone else's
thing I'd either have to just believe them
or read a lot of code.

That'll teach you to express curiosity
about something I'm doing. Been thinking
about all this for a few weeks, you asked
a question and the fingers started ty[ing.

Well I _said_ you wouldn't believe it...
[big snip]

Cheers,
John


************************

David C. Ullrich
 
S

skip

David> Is there a csvlib out there somewhere?
...
David> ...Believe it or not I'm _still_ using python 1.5.7....

You might see if someone has written a pure Python version of the csv module
for use in PyPy.

Skip
 
N

Neil Cerutti

Is there a csvlib out there somewhere?

And/or does anyone see any problems with
the code below?

[...]

(Um: Believe it or not I'm _still_ using python 1.5.7. So
comments about iterators, list comprehensions, string
methods, etc are irrelevent. Comments about errors in the
algorithm would be great. Thanks.)

Two member functions of indexedstring are not used: next and
lookahead. __len__ and __getitem__ appear to serve no real
purpose.

Hey, thanks! I didn't realize that using an object with methods
that were never called could cause an algorithm to fail...
shows how much I know.

Sorry I couldn't provide the help you wanted.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,113
Latest member
Vinay KumarNevatia
Top