Odd csv column-name truncation with only one column

Discussion in 'Python' started by Tim Chase, Jul 19, 2012.

  1. Tim Chase

    Tim Chase Guest

    tim@laptop:~/tmp$ python
    Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
    [GCC 4.4.5] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import csv
    >>> from cStringIO import StringIO
    >>> s = StringIO('Email\\\n')
    >>> s.seek(0)
    >>> d = csv.Sniffer().sniff(s.read())
    >>> s.seek(0)
    >>> r = csv.DictReader(s, dialect=d)
    >>> r.fieldnames

    ['Emai', '']

    I get the same results using Python 3.1.3 (also readily available on
    Debian Stable), as well as working directly on a file rather than a
    StringIO.

    Any reason I'm getting ['Emai', ''] (note the missing ell) instead
    of ['Email'] as my resulting fieldnames? Did I miss something in
    the docs?

    -tkc
     
    Tim Chase, Jul 19, 2012
    #1
    1. Advertising

  2. On Thu, 19 Jul 2012 06:21:58 -0500, Tim Chase wrote:

    > tim@laptop:~/tmp$ python
    > Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) [GCC 4.4.5] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import csv
    >>>> from cStringIO import StringIO
    >>>> s = StringIO('Email\\\n') s.seek(0)
    >>>> d = csv.Sniffer().sniff(s.read())
    >>>> s.seek(0)
    >>>> r = csv.DictReader(s, dialect=d)
    >>>> r.fieldnames

    > ['Emai', '']



    I get the same results for Python 2.6 and 2.7. Curiously, 2.5 returns
    fieldnames as None.

    I'm not entirely sure that a single column is legitimate for CSV -- if
    there's only one column, it is hardly comma-separated, or any other
    separated for that matter. But perhaps the csv module should raise an
    exception in that case.

    I think you've found a weird corner case where the sniffer goes nuts. You
    should probably report it as a bug:

    py> s = StringIO('Email\\\n')
    py> s.seek(0)
    py> d = csv.Sniffer().sniff(s.read())
    py> d.delimiter
    'l'

    py> s = StringIO('Spam\\\n')
    py> s.seek(0)
    py> d = csv.Sniffer().sniff(s.read())
    py> d.delimiter
    'p'

    py> s = StringIO('Spam\nham\ncheese\n')
    py> s.seek(0)
    py> d = csv.Sniffer().sniff(s.read())
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python2.7/csv.py", line 184, in sniff
    raise Error, "Could not determine delimiter"
    _csv.Error: Could not determine delimiter


    --
    Steven
     
    Steven D'Aprano, Jul 19, 2012
    #2
    1. Advertising

  3. Tim Chase

    Hans Mulder Guest

    On 19/07/12 13:21:58, Tim Chase wrote:
    > tim@laptop:~/tmp$ python
    > Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
    > [GCC 4.4.5] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import csv
    >>>> from cStringIO import StringIO
    >>>> s = StringIO('Email\\\n')
    >>>> s.seek(0)
    >>>> d = csv.Sniffer().sniff(s.read())
    >>>> s.seek(0)
    >>>> r = csv.DictReader(s, dialect=d)
    >>>> r.fieldnames

    > ['Emai', '']
    >
    > I get the same results using Python 3.1.3 (also readily available on
    > Debian Stable), as well as working directly on a file rather than a
    > StringIO.
    >
    > Any reason I'm getting ['Emai', ''] (note the missing ell) instead
    > of ['Email'] as my resulting fieldnames? Did I miss something in
    > the docs?


    The sniffer tries to guess the column separator. If none of the
    usual suspects seems to work, it tries to find a character that
    occurs with the same frequency in every row. In your sample,
    the letter 'l' occurs exactly once on each line, so it is the
    most plausible separator, or so the Sniffer thinks.

    Perhaps it should be documented that the Sniffer doesn't work
    on single-column data.

    If you really need to read a one-column csv file, you'll have
    to find some other way to produce a Dialect object. Perhaps the
    predefined 'cvs.excel' dialect matches your data. If not, the
    easiest way might be to manually define a csv.Dialect subclass.

    Hope this helps,

    -- HansM
     
    Hans Mulder, Jul 19, 2012
    #3
  4. Tim Chase

    Tim Chase Guest

    On 07/19/12 08:52, Hans Mulder wrote:
    > Perhaps it should be documented that the Sniffer doesn't work
    > on single-column data.


    I think this would involve the least change in existing code, and
    go a long way towards removing my surprise. :)

    > If you really need to read a one-column csv file, you'll have
    > to find some other way to produce a Dialect object. Perhaps the
    > predefined 'cvs.excel' dialect matches your data. If not, the
    > easiest way might be to manually define a csv.Dialect subclass.


    The problem I'm trying to solve is "here's a filename that might be
    comma/pipe/tab delimited, it has an 'email' column at minimum, and
    perhaps a couple others of interest if they were included" It's
    improbable that it's ONLY an email column, but my tests happened to
    snag this edge case. I can likely do my own sniffing by reading the
    first line, checking for tabs then pipes then commas (perhaps
    biasing the order based on the file-extension of .csv vs. .txt), and
    then building my own dialect information to pass to csv.DictReader
    It just seems unfortunate that the sniffer would ever consider
    [a-zA-Z0-9] as a valid delimiter.

    -tkc
     
    Tim Chase, Jul 19, 2012
    #4
  5. On Thu, 19 Jul 2012 13:01:37 -0500, Tim Chase
    <> declaimed the following in
    gmane.comp.python.general:

    > It just seems unfortunate that the sniffer would ever consider
    > [a-zA-Z0-9] as a valid delimiter.
    >

    I'd suspect the sniffer logic does not do any special casing -- any
    /byte value/ is a candidate for the delimiter. This would allow for
    usage of some old ASCII control characters -- things like x1F (unit
    separator)

    {Next is to rig the sniffer to identify x1F for fields, and x1E for
    records <G>}
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Jul 19, 2012
    #5
  6. On Thu, 19 Jul 2012 15:52:12 +0200, Hans Mulder wrote:

    > Perhaps it should be documented that the Sniffer doesn't work on
    > single-column data.
    >
    > If you really need to read a one-column csv file, you'll have to find
    > some other way to produce a Dialect object. Perhaps the predefined
    > 'cvs.excel' dialect matches your data. If not, the easiest way might be
    > to manually define a csv.Dialect subclass.


    Perhaps the csv module could do with a pre-defined "one column" dialect.
    If anyone comes up with one, do consider proposing it as a patch on the
    bug tracker.


    --
    Steven
     
    Steven D'Aprano, Jul 20, 2012
    #6
  7. Tim Chase

    Hans Mulder Guest

    On 19/07/12 23:10:04, Dennis Lee Bieber wrote:
    > On Thu, 19 Jul 2012 13:01:37 -0500, Tim Chase
    > <> declaimed the following in
    > gmane.comp.python.general:
    >
    >> It just seems unfortunate that the sniffer would ever consider
    >> [a-zA-Z0-9] as a valid delimiter.


    +1

    > I'd suspect the sniffer logic does not do any special casing
    > -- any /byte value/ is a candidate for the delimiter.


    The sniffer prefers [',', '\t', ';', ' ', ':'] (in that order).
    If none of those is found, it goes to the other extreme and considers
    all characters equally likely.

    > This would allow for usage of some old ASCII control characters --
    > things like x1F (unit separator)


    If the Sniffer excludes [a-zA-Z0-9] (or all alphanumerics) as
    potential delimiters, than control characters such as "\x1F" are
    still possible.

    > {Next is to rig the sniffer to identify x1F for fields, and x1E
    > for records <G>}


    The sniffer will always guess '\r\n' as the line terminator.

    That should not stop you from creating a dialect with '\x1E' as
    the line terminator. Just don't expect the sniffer to recognize
    that dialect.

    -- HansM
     
    Hans Mulder, Jul 20, 2012
    #7
  8. On Fri, 20 Jul 2012 18:59:24 +0200, Hans Mulder <>
    declaimed the following in gmane.comp.python.general:


    > The sniffer will always guess '\r\n' as the line terminator.
    >
    > That should not stop you from creating a dialect with '\x1E' as
    > the line terminator. Just don't expect the sniffer to recognize
    > that dialect.
    >

    {devil's advocate}: Maybe it's time to expand the CSV module... Of
    course, if we set it to recognize x1E as a record separator, we should
    be fair and also incorporate the other two ASCII "separator" codes.

    x1D (group separator) could be used to signal a "new table" -- ie; a
    change in record structure (number of columns, header labels). And then
    x1C (file separator) could represent a new "worksheet" (in Excel terms).

    We'd need some sort of flag/query method to detect these changes, of
    course.

    while not csv.EndOfSheet():
    while not csv.EndOfTable():
    ...

    And then there is the potential of using <VT> and <FF> as
    equivalents for x1D and x1C (for those files using <TAB> and <CR><LF> as
    field/record separators).

    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Jul 20, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. manas

    Text Truncation

    manas, Jul 1, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    522
    Mark Fitzpatrick
    Jul 1, 2005
  2. newsgroups.comcast.net

    Odd error from one ip address only

    newsgroups.comcast.net, Apr 5, 2006, in forum: ASP .Net
    Replies:
    7
    Views:
    440
    Mahhek
    Apr 21, 2006
  3. Marcelo

    MySql Data Truncation

    Marcelo, Dec 19, 2005, in forum: Java
    Replies:
    3
    Views:
    17,172
    Roedy Green
    Dec 21, 2005
  4. Peter Otten
    Replies:
    0
    Views:
    216
    Peter Otten
    Jul 19, 2012
  5. Tim Chase
    Replies:
    0
    Views:
    155
    Tim Chase
    Jul 19, 2012
Loading...

Share This Page