detecting newline character

D

Daniel Geržo

Hello guys,

I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

Thanks.
 
T

Thomas 'PointedEars' Lahn

Daniel said:
I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

I see nothing suspicious in your .srt *after* downloading it. file -i
confirms that it only contains US-ASCII characters (but see below).

The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so in
your case (but then the question remains: How could it be not None without
`encoding' argument?)

<http://docs.python.org/library/codecs.html?highlight=codecs.open#codecs.open>
<http://docs.python.org/library/functions.html#open>

WFM with and without `encoding' argument in python-2.7.1-8 (CPython), Debian
GNU/Linux 6.0.1, Linux 2.6.35.5-pe (custom) SMP i686.

Which Python implementation and version are you using on which system?

On which system has the "ASCII" file been created and how? Note that both
uploading the file with FTP in ASCII mode and downloading over HTTP might
have removed the problem Python has with it.
 
D

Daniel Geržo

Daniel said:
I need to detect the newline characters used in the file I am reading.
For this purpose I am using the following code:

def _read_lines(self):
with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
fobj.readlines()
if isinstance(fobj.newlines, tuple):
self.newline = fobj.newlines[0]
else:
self.newline = fobj.newlines

This works fine, if I call codecs.open() without encoding argument; I am
testing with an ASCII enghlish text file, and in such case the
fobj.newlines is correctly detected being as '\r\n'. However, when I
call codecs.open() with encoding='ascii' argument, the fobj.newlines is
None and I can't figure out why that is the case. Reading the PEP at
http://www.python.org/dev/peps/pep-0278/ I don't see any reason why
would I end up with newlines being None after I call readlines().

Anyone has an idea? You can fetch the file I am testing with from
http://danger.rulez.sk/subrip_ascii.srt

I see nothing suspicious in your .srt *after* downloading it. file -i
confirms that it only contains US-ASCII characters (but see below).

That is indeed the case in my environment too.

danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file -i
subrip_ascii.srt
subrip_ascii.srt: regular file
danger@[danger-mbp ~/devel/pysublib/pysublib/test/files]> file
subrip_ascii.srt
subrip_ascii.srt: ASCII English text, with CRLF line terminators

The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so in
your case (but then the question remains: How could it be not None without
`encoding' argument?)

Yes, this is what does not make sense. If I didn't have the universal
newline support enabled, I wouldn't have the newlines attribute at all.
<http://docs.python.org/library/codecs.html?highlight=codecs.open#codecs.open>
<http://docs.python.org/library/functions.html#open>

WFM with and without `encoding' argument in python-2.7.1-8 (CPython), Debian
GNU/Linux 6.0.1, Linux 2.6.35.5-pe (custom) SMP i686.

Which Python implementation and version are you using on which system?

This is a standard python installation from MacPorts. System is OS X
10.6.7. I have now tried both python 2.7.1 and python 2.6.6 from
MacPorts and also 2.6.6 on FreeBSD. All fail for me when I set encoding.
On which system has the "ASCII" file been created and how? Note that both
uploading the file with FTP in ASCII mode and downloading over HTTP might
have removed the problem Python has with it.

Unfortunately I am not 100% sure where I created the file, it was quite
some time ago, but it was either WinXP, or OS X Leopard. The source code
can be found at https://bitbucket.org/danger/pysublib/src - I noticed
the subtitle file tests (e.g. test/test_subripfile.py) are failing for
me and I have identified the problem with newlines being None after
calling read().
 
T

Thomas 'PointedEars' Lahn

Daniel said:
Daniel said:
[f = codecs.open(…, mode='rU', encoding='ascii') and f.newlines]

[…]
The only reason I can think of for this not working ATM comes from the
documentation, where it says that 'U' requires Python to be built with
universal newline support; that it is *usually* so, but might not be so
in your case (but then the question remains: How could it be not None
without `encoding' argument?)

Yes, this is what does not make sense. If I didn't have the universal
newline support enabled, I wouldn't have the newlines attribute at all.

True. But good to know to have a test with hasattr(fileobj, 'newlines')!
This is a standard python installation from MacPorts. System is OS X
10.6.7. I have now tried both python 2.7.1 and python 2.6.6 from
MacPorts and also 2.6.6 on FreeBSD. All fail for me when I set encoding.

I think this discussion, in particular <[email protected]>,
<and finally
<http://bugs.python.org/issue691291>, is providing a good explanation now.

To summarize:

1. From Python 2.6.5-rc1 and Python 2.7-alpha4 forward, codecs.open()
does not support universal newlines and will ignore any 'U' in its
`mode' argument when the `encoding' argument is different from None.

2. As a result, file.newlines will be None if if exists.

3. This is by design, fixing a bug back from Python 2.3a.

4. Use another approach.

:)
Unfortunately I am not 100% sure where I created the file, it was quite
some time ago, but it was either WinXP, or OS X Leopard. The source code
can be found at https://bitbucket.org/danger/pysublib/src - I noticed
the subtitle file tests (e.g. test/test_subripfile.py) are failing for
me and I have identified the problem with newlines being None after
calling read().

Well, you have two alternatives now (codecs.open() with
list(set(re.search(newlines, readlines())) and io.open()), and you appear to
have decided for `io', so there should not be a problem anymore.

I wish you good luck with your project, it looks really interesting (I
remember having written a DVD subtitle script based on gocr in bash a few
years ago).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top