Unicode and Python - how often do you index strings?

R

Ryan Hiebert

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
On 05.06.2014 20:16, Paul Rubin wrote:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.


How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?
line = "Hello,\nworld!\n\n"
line[:-1] 'Hello,\nworld!\n'
line.rstrip('\n')
'Hello,\nworld!'

If it's guaranteed to end with exactly one newline, then and only then
will they be identical.

OK, that's not an issue for my case, and additionally I'm using the
open(_, 'U') file iterable, so I shouldn't see multiple trailing newlines
anyway.
 
I

Ian Kelly

Ryan Hiebert said:
How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple. In perl the difference is chomp vs chop. line=line[:-1]
removes one character, that might or might not be a newline.

Given the description that the input string is "a textfile line", if
it has multiple newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so that I don't have to worry
about files with alternative line terminators.

If you want to be really picky about removing exactly one line
terminator, then this captures all the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)
 
A

Albert-Jan Roskam

----- Original Message -----
From: Ian Kelly <[email protected]>
To: Python <[email protected]>
Cc:
Sent: Thursday, June 5, 2014 10:18 PM
Subject: Re: Unicode and Python - how often do you index strings?

Ryan Hiebert said:
How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing
differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple.  In perl the difference is chomp vs chop.  line=line[:-1]
removes one character, that might or might not be a newline.

Given the description thatthe input string is "a textfile line", if
it has multiple newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so that I don't have
to worry
about files with alternative line terminators.

I tend to use: s.rstrip(os.linesep)
If you want to be really picky about removing exactly one line
terminator, then this captures allthe relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)

or perhaps: re.sub("[^ \S]+$", "", line)
 
R

Roy Smith

----- Original Message -----
From: Ian Kelly <[email protected]>
Cc:
Sent: Thursday, June 5, 2014
10:18 PM
Subject: Re: Unicode and Python - how often do you index strings? using line=line[:-1] for removing the trailing newline,
andreplaced it with rstrip('\n'). What are you doing
differently?rstrip removes all the newlines off the end, whether there are zero ormultiple.? In perl the difference is chomp vs chop.? line=line[:-1]removes one character, that might or might not be a newline.

Given the
description that the input string is "a textfile line", if
it has multiple
newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so
that I don't have
to worry
about files with alternative line
terminators.

I tend to use: s.rstrip(os.linesep)
If you want to be really
picky about removing exactly one line
terminator, then this captures all
the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '',
count=1)

or perhaps: re.sub("[^ \S]+$", "", line)

Just for fun, I took a screen-shot of what this looks like in my
newsreader. URL below. Looks like something chomped on unicode pretty
hard :)

http://www.panix.com/~roy/unicode.pdf
 
N

Ned Deily

Rustom Mody said:
Yiiiiiiiiiiiiiiiiii!!!!!!!!!!!!

Roy is using MT-NewsWatcher as a client. Because its codebase's origins
are back in classic MacOS (<= 9), it has its own *interesting* ways to
deal with encodings. BTW, don't upgrade to OS X 10.9 Mavericks if
you're dependent on MT-NW; it finally stops working there because what
was left of Open Transport support in OS X has finally been ripped out
of 10.9.
 
I

Ian Kelly

If you want to be really picky about removing exactly one line
terminator, then this captures all the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)

or perhaps: re.sub("[^ \S]+$", "", line)

That will remove more than one terminator, plus tabs. Points for
including \f and \v though.

I suppose if we want to be absolutely correct, we should follow the
Unicode standard:
re.sub(r'\r?\n$|[\r\v\f\x85\u2028\u2029]$', line, '', count=1)
 
R

Roy Smith

Ned Deily said:
Roy is using MT-NewsWatcher as a client.

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.
BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on
MT-NW; it finally stops working there because what was left of Open
Transport support in OS X has finally been ripped out of 10.9.

Hmmm, good to know. I'm still on 10.7, and don't see any reason to
move. But, then again, you'd expect that from somebody who's still on
Python 2.x, wouldn't you?
 
N

Ned Deily

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.

I agree about the U/I, although I'm sure a lot of that has to do with
familiarity. However, netnews isn't dead, it has just morphed a bit. A
newsreader, like MT-NW, is great for following mailing lists like this
(and most other Python-related lists) via gmane.org's bi-directional
mailing list - NNTP gateways. And for this list it's usually better to
read the mailing list variant via gmane.org NNTP than the Usenet group
variant via a traditional USENET NNTP server because there's less spam
with the former.
Hmmm, good to know. I'm still on 10.7, and don't see any reason to
move. But, then again, you'd expect that from somebody who's still on
Python 2.x, wouldn't you?

Heh. Well, both 10.8 and 10.9 proved various improvements, both feature
and performance, over 10.7. Alas, Apple won't likely be supporting 10.7
with security updates for as long as the PSF will be supporting 2.7.x.
But, by then, you'll have had a chance to re-implement MT-NW in Python.
 
J

Johannes Bauer

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

Ah, I didn't know rstrip() accepted parameters and since you wrote
line.rstrip() this would also cut away whitespaces (which sadly are
relevant in odd cases).

Thanks for the clarification, I'll definitely introduce that.

Cheers,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
J

Johannes Bauer

Personally I tend toward rstrip('\r\n') so that I don't have to worry
about files with alternative line terminators.

Hm, I was under the impression that Python already took care of removing
the \r at a line ending. Checking that right now:

(DOS encoded file "y")....
b'foo\n'
b'bar\n'
b'moo\n'
b'koo\n'

Yup, the \r was removed automatically. Are there cases when it isn't?

Cheers,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
T

Tim Chase

Hm, I was under the impression that Python already took care of
removing the \r at a line ending. Checking that right now:

(DOS encoded file "y")
...
b'foo\n'
b'bar\n'
b'moo\n'
b'koo\n'

Yup, the \r was removed automatically. Are there cases when it
isn't?

It's possible if the file is opened as binary:
....
'hello\r\n'
'world\r\n'


-tkc
 
S

Steven D'Aprano

Hm, I was under the impression that Python already took care of removing
the \r at a line ending. Checking that right now:
[snip example]

This is called "Universal Newlines". Technically it is a build-time
option which applies when you build the Python interpreter from source,
so, yes, some Pythons may not implement it at all. But I think that it
has been on by default for a long time, and the option to turn it off may
have been removed in Python 3.3 or 3.4. In practical terms, you should
normally expect it to be on.


Here's the PEP that introduced it:
http://legacy.python.org/dev/peps/pep-0278/


The idea is that when universal newlines support is enabled, by default
will convert any of \n, \r or \r\n into \n when reading from a file in
text mode, and convert back the other way when writing the file.

In binary mode, newlines are *never* changed.

In Python 3, you can return end-of-lines unchanged by passing newline=''
to the open() function.

https://docs.python.org/2/library/functions.html#open
https://docs.python.org/3/library/functions.html#open
 
G

Grant Edwards

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.

There are still a few active groups, but reading e-mail lists via NNTP
(in my case using slrn) via gmane is a huge reason to have an
efficient, well-designed "news" client.

If usenet does really pack it in someday and I have to switch from
comp.lang.python to the mailing list, it will be done by pointing slrn
at new.gmane.org -- not by having all those e-mails sent to me so I
can try to sort through them...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,585
Members
45,080
Latest member
mikkipirss

Latest Threads

Top