detecting newline character

Discussion in 'Python' started by Daniel Geržo, Apr 23, 2011.

  1. Hello guys,

    I need to detect the newline characters used in the file I am reading.
    For this purpose I am using the following code:

    def _read_lines(self):
    with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
    fobj.readlines()
    if isinstance(fobj.newlines, tuple):
    self.newline = fobj.newlines[0]
    else:
    self.newline = fobj.newlines

    This works fine, if I call codecs.open() without encoding argument; I am
    testing with an ASCII enghlish text file, and in such case the
    fobj.newlines is correctly detected being as '\r\n'. However, when I
    call codecs.open() with encoding='ascii' argument, the fobj.newlines is
    None and I can't figure out why that is the case. Reading the PEP at
    http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
    would I end up with newlines being None after I call readlines().

    Anyone has an idea? You can fetch the file I am testing with from
    http://danger.rulez.sk/subrip_ascii.srt

    Thanks.
     
    Daniel Geržo, Apr 23, 2011
    #1
    1. Advertisements

  2. I see nothing suspicious in your .srt *after* downloading it. file -i
    confirms that it only contains US-ASCII characters (but see below).

    The only reason I can think of for this not working ATM comes from the
    documentation, where it says that 'U' requires Python to be built with
    universal newline support; that it is *usually* so, but might not be so in
    your case (but then the question remains: How could it be not None without
    `encoding' argument?)

    <http://docs.python.org/library/codecs.html?highlight=codecs.open#codecs.open>
    <http://docs.python.org/library/functions.html#open>

    WFM with and without `encoding' argument in python-2.7.1-8 (CPython), Debian
    GNU/Linux 6.0.1, Linux 2.6.35.5-pe (custom) SMP i686.

    Which Python implementation and version are you using on which system?

    On which system has the "ASCII" file been created and how? Note that both
    uploading the file with FTP in ASCII mode and downloading over HTTP might
    have removed the problem Python has with it.
     
    Thomas 'PointedEars' Lahn, Apr 23, 2011
    #2
    1. Advertisements

  3. That is indeed the case in my environment too.

    danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file -i
    subrip_ascii.srt
    subrip_ascii.srt: regular file
    danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file
    subrip_ascii.srt
    subrip_ascii.srt: ASCII English text, with CRLF line terminators

    Yes, this is what does not make sense. If I didn't have the universal
    newline support enabled, I wouldn't have the newlines attribute at all.
    This is a standard python installation from MacPorts. System is OS X
    10.6.7. I have now tried both python 2.7.1 and python 2.6.6 from
    MacPorts and also 2.6.6 on FreeBSD. All fail for me when I set encoding.
    Unfortunately I am not 100% sure where I created the file, it was quite
    some time ago, but it was either WinXP, or OS X Leopard. The source code
    can be found at https://bitbucket.org/danger/pysublib/src - I noticed
    the subtitle file tests (e.g. test/test_subripfile.py) are failing for
    me and I have identified the problem with newlines being None after
    calling read().
     
    Daniel Geržo, Apr 24, 2011
    #3
  4. True. But good to know to have a test with hasattr(fileobj, 'newlines')!
    I think this discussion, in particular <>,
    <and finally
    <http://bugs.python.org/issue691291>, is providing a good explanation now.

    To summarize:

    1. From Python 2.6.5-rc1 and Python 2.7-alpha4 forward, codecs.open()
    does not support universal newlines and will ignore any 'U' in its
    `mode' argument when the `encoding' argument is different from None.

    2. As a result, file.newlines will be None if if exists.

    3. This is by design, fixing a bug back from Python 2.3a.

    4. Use another approach.

    :)
    Well, you have two alternatives now (codecs.open() with
    list(set(re.search(newlines, readlines())) and io.open()), and you appear to
    have decided for `io', so there should not be a problem anymore.

    I wish you good luck with your project, it looks really interesting (I
    remember having written a DVD subtitle script based on gocr in bash a few
    years ago).
     
    Thomas 'PointedEars' Lahn, Apr 24, 2011
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.