csv.Sniffer: wrong detection of the end of line delimiter

L

Laurent Laporte

hello,

I'm using cvs standard module under Python 2.3 / 2.4 to read a CSV
file. The file is opened in binary mode, so I keep the end of line
terminator.

It appears that the csv.Sniffer force the line terminator to be
'\r\n'. It's fine under Windows but wrong under Linux or
Macintosh.

More about this line terminator: Potential bug in the
_guess_delimiter() method.
The first line of code does a wrong splitting:
data = filter(None, data.split('\n'))
It doesn't take care of the real line terminator!

Here is a patch (not a perfect one):
# ------- begin of patch -------
class PatchedSniffer(csv.Sniffer):

def __init__(self):
csv.Sniffer.__init__(self)


def sniff(self, p_data, p_delimiters = None):
t_dialect = csv.Sniffer.sniff(self, p_data, p_delimiters)
t_dialect.lineterminator = self._guessLineTerminator(p_data)
return t_dialect


def _guessLineTerminator(self, p_data):
for t_lineTerminator in ['\r\n', '\n', '\r']:
if t_lineTerminator in p_data:
return t_lineTerminator
else:
return '\r\n' # Windows default (Excel)


def _formatDataForGuess(self, p_data):
t_lineTerminator = self._guessLineTerminator(p_data)
return '\n'.join(p_data.split(t_lineTerminator))


def _guess_delimiter(self, p_data, p_delimiters):
t_data = self._formatDataForGuess(p_data)

(t_delimiter, t_skipInitialSpace) = \
csv.Sniffer._guess_delimiter(self, t_data, p_delimiters)

if t_delimiter == '' and '\t' in p_data:
t_delimiter = '\t'

return (t_delimiter, t_skipInitialSpace)
# ------- end of patch -------

Bye.
------- Laurent.
 
S

Steve Holden

Laurent said:
hello,

I'm using cvs standard module under Python 2.3 / 2.4 to read a CSV
file. The file is opened in binary mode, so I keep the end of line
terminator.
It's not advisable to open a file like a CSV, intended for use as text,
in binary mode.
It appears that the csv.Sniffer force the line terminator to be
'\r\n'. It's fine under Windows but wrong under Linux or
Macintosh.
Perhaps you should try opening the file in text mode, as this will
normally end up giving you a "\n" terminator on all platforms: that's
what text mode is intended to ensure, and that's probably why the csv
module assumes that splitting on "\n" is safe.
More about this line terminator: Potential bug in the
_guess_delimiter() method.
The first line of code does a wrong splitting:
data = filter(None, data.split('\n'))
It doesn't take care of the real line terminator!
> [...]

I suspect it's not supposed to be trying to!

regards
Steve
 
M

Marc 'BlackJack' Rintsch

Steve Holden said:
It's not advisable to open a file like a CSV, intended for use as text,
in binary mode.

But the docs "demand" this explicitly and all examples in the docs fulfill
that demand.

From http://docs.python.org/lib/csv-contents.html :

If csvfile is a file object, it must be opened with the 'b' flag on
platforms where that makes a difference.

I guess the reason is the same as for "text" pickle format: If you don't
use binary mode the file is not platform independend anymore because some
OSes "manipulate" the data in text mode.

Ciao,
Marc 'BlackJack' Rintsch
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top