detecting newline character

Daniel Geržo · Apr 23, 2011

Hello guys,

I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

Thanks.

Thomas 'PointedEars' Lahn · Apr 23, 2011

Daniel said:
I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

I see nothing suspicious in your .srt *after* downloading it. file -i
confirms that it only contains US-ASCII characters (but see below).

The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so in
your case (but then the question remains: How could it be not None without
`encoding' argument?)

<http://docs.python.org/library/codecs.html?highlight=codecs.open#codecs.open>
<http://docs.python.org/library/functions.html#open>

WFM with and without `encoding' argument in python-2.7.1-8 (CPython), Debian
GNU/Linux 6.0.1, Linux 2.6.35.5-pe (custom) SMP i686.

Which Python implementation and version are you using on which system?

On which system has the "ASCII" file been created and how? Note that both
uploading the file with FTP in ASCII mode and downloading over HTTP might
have removed the problem Python has with it.

Daniel GerÅ¾o · Apr 24, 2011

Daniel said:
Daniel said:

I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

Click to expand...

I see nothing suspicious in your .srt *after* downloading it. file -i
confirms that it only contains US-ASCII characters (but see below).

That is indeed the case in my environment too.

danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file -i
subrip_ascii.srt
subrip_ascii.srt: regular file
danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file
subrip_ascii.srt
subrip_ascii.srt: ASCII English text, with CRLF line terminators

The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so in
your case (but then the question remains: How could it be not None without
`encoding' argument?)

Yes, this is what does not make sense. If I didn't have the universal
newline support enabled, I wouldn't have the newlines attribute at all.

<http://docs.python.org/library/codecs.html?highlight=codecs.open#codecs.open>
<http://docs.python.org/library/functions.html#open>

WFM with and without `encoding' argument in python-2.7.1-8 (CPython), Debian
GNU/Linux 6.0.1, Linux 2.6.35.5-pe (custom) SMP i686.

Which Python implementation and version are you using on which system?

This is a standard python installation from MacPorts. System is OS X
10.6.7. I have now tried both python 2.7.1 and python 2.6.6 from
MacPorts and also 2.6.6 on FreeBSD. All fail for me when I set encoding.

On which system has the "ASCII" file been created and how? Note that both
uploading the file with FTP in ASCII mode and downloading over HTTP might
have removed the problem Python has with it.

Unfortunately I am not 100% sure where I created the file, it was quite
some time ago, but it was either WinXP, or OS X Leopard. The source code
can be found at https://bitbucket.org/danger/pysublib/src - I noticed
the subtitle file tests (e.g. test/test_subripfile.py) are failing for
me and I have identified the problem with newlines being None after
calling read().

Thomas 'PointedEars' Lahn · Apr 24, 2011

Daniel said:
Daniel said:

[f = codecs.open(â€¦, mode='rU', encoding='ascii') and f.newlines]

Click to expand...

[â€¦]
The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so
in your case (but then the question remains: How could it be not None
without `encoding' argument?)

Click to expand...

Yes, this is what does not make sense. If I didn't have the universal
newline support enabled, I wouldn't have the newlines attribute at all.

True. But good to know to have a test with hasattr(fileobj, 'newlines')!

This is a standard python installation from MacPorts. System is OS X
10.6.7. I have now tried both python 2.7.1 and python 2.6.6 from
MacPorts and also 2.6.6 on FreeBSD. All fail for me when I set encoding.

I think this discussion, in particular <[email protected]>,
<and finally
<http://bugs.python.org/issue691291>, is providing a good explanation now.

To summarize:

1. From Python 2.6.5-rc1 and Python 2.7-alpha4 forward, codecs.open()
does not support universal newlines and will ignore any 'U' in its
`mode' argument when the `encoding' argument is different from None.

2. As a result, file.newlines will be None if if exists.

3. This is by design, fixing a bug back from Python 2.3a.

4. Use another approach.

Unfortunately I am not 100% sure where I created the file, it was quite
some time ago, but it was either WinXP, or OS X Leopard. The source code
can be found at https://bitbucket.org/danger/pysublib/src - I noticed
the subtitle file tests (e.g. test/test_subripfile.py) are failing for
me and I have identified the problem with newlines being None after
calling read().

Well, you have two alternatives now (codecs.open() with
list(set(re.search(newlines, readlines())) and io.open()), and you appear to
have decided for `io', so there should not be a problem anymore.

I wish you good luck with your project, it looks really interesting (I
remember having written a DVD subtitle script based on gocr in bash a few
years ago).

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Newline interpretation issue with MIMEApplication with binary data,Python 3.3.2	7	Sep 25, 2013
Network Newline	11	May 28, 2012
codecs, csv issues	2	Aug 22, 2008
doctest.testfile universal newline -- only when module_relative=True?	0	Jan 10, 2008
Sniffing encoding type by looking at file BOM header	2	Mar 24, 2010
NEWLINE character problem	1	Jan 30, 2004
Unhelpful traceback	8	Mar 7, 2013

detecting newline character

Daniel Geržo

Thomas 'PointedEars' Lahn

Daniel GerÅ¾o

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads