Unicode and Python - how often do you index strings?

Ryan Hiebert · Jun 5, 2014

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
2014-06-05 13:42 GMT-05:00 Johannes Bauer said:

On 05.06.2014 20:16, Paul Rubin wrote:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.

Click to expand...

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

line = "Hello,\nworld!\n\n"
line[:-1] 'Hello,\nworld!\n'
line.rstrip('\n')

Click to expand...

Click to expand...

'Hello,\nworld!'

If it's guaranteed to end with exactly one newline, then and only then
will they be identical.

OK, that's not an issue for my case, and additionally I'm using the

open(_, 'U') file iterable, so I shouldn't see multiple trailing newlines
anyway.

Ian Kelly · Jun 5, 2014

Ryan Hiebert said:
Ryan Hiebert said:

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

Click to expand...

rstrip removes all the newlines off the end, whether there are zero or
multiple. In perl the difference is chomp vs chop. line=line[:-1]
removes one character, that might or might not be a newline.

Given the description that the input string is "a textfile line", if
it has multiple newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so that I don't have to worry
about files with alternative line terminators.

If you want to be really picky about removing exactly one line
terminator, then this captures all the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)

Albert-Jan Roskam · Jun 5, 2014

----- Original Message -----

Ryan Hiebert said:
From: Ian Kelly <[email protected]>
To: Python <[email protected]>
Cc:
Sent: Thursday, June 5, 2014 10:18 PM
Subject: Re: Unicode and Python - how often do you index strings?

Ryan Hiebert said:

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing

Click to expand...

differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple. In perl the difference is chomp vs chop. line=line[:-1]
removes one character, that might or might not be a newline.

Click to expand...

Given the description thatthe input string is "a textfile line", if
it has multiple newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so that I don't have
to worry
about files with alternative line terminators.

I tend to use: s.rstrip(os.linesep)

If you want to be really picky about removing exactly one line
terminator, then this captures allthe relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)

or perhaps: re.sub("[^ \S]+$", "", line)

Roy Smith · Jun 5, 2014

----- Original Message -----

From: Ian Kelly <[email protected]>

To: Python <[email protected]>

Click to expand...

Cc:
Sent: Thursday, June 5, 2014
10:18 PM
Subject: Re: Unicode and Python - how often do you index strings? using line=line[:-1] for removing the trailing newline,
andreplaced it with rstrip('\n'). What are you doing
differently?rstrip removes all the newlines off the end, whether there are zero ormultiple.? In perl the difference is chomp vs chop.? line=line[:-1]removes one character, that might or might not be a newline.

Given the
description that the input string is "a textfile line", if
it has multiple
newlines then it's invalid.

Personally I tend toward rstrip('\r\n') so
that I don't have
to worry
about files with alternative line
terminators.

I tend to use: s.rstrip(os.linesep)

If you want to be really
picky about removing exactly one line
terminator, then this captures all
the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '',
count=1)

or perhaps: re.sub("[^ \S]+$", "", line)

Just for fun, I took a screen-shot of what this looks like in my
newsreader. URL below. Looks like something chomped on unicode pretty
hard

http://www.panix.com/~roy/unicode.pdf

Rustom Mody · Jun 5, 2014

Just for fun, I took a screen-shot of what this looks like in my
newsreader. URL below. Looks like something chomped on unicode pretty
hard

http://www.panix.com/~roy/unicode.pdf

Yiiiiiiiiiiiiiiiiii!!!!!!!!!!!!

Ned Deily · Jun 5, 2014

Rustom Mody said:
Yiiiiiiiiiiiiiiiiii!!!!!!!!!!!!

Roy is using MT-NewsWatcher as a client. Because its codebase's origins
are back in classic MacOS (<= 9), it has its own *interesting* ways to
deal with encodings. BTW, don't upgrade to OS X 10.9 Mavericks if
you're dependent on MT-NW; it finally stops working there because what
was left of Open Transport support in OS X has finally been ripped out
of 10.9.

Ian Kelly · Jun 6, 2014

If you want to be really picky about removing exactly one line
terminator, then this captures all the relatively modern variations:
re.sub('\r?\n$|\n?\r$', line, '', count=1)

Click to expand...

or perhaps: re.sub("[^ \S]+$", "", line)

That will remove more than one terminator, plus tabs. Points for
including \f and \v though.

I suppose if we want to be absolutely correct, we should follow the
Unicode standard:
re.sub(r'\r?\n$|[\r\v\f\x85\u2028\u2029]$', line, '', count=1)

Roy Smith · Jun 6, 2014

Ned Deily said:
Roy is using MT-NewsWatcher as a client.

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.

BTW, don't upgrade to OS X 10.9 Mavericks if you're dependent on
MT-NW; it finally stops working there because what was left of Open
Transport support in OS X has finally been ripped out of 10.9.

Hmmm, good to know. I'm still on 10.7, and don't see any reason to
move. But, then again, you'd expect that from somebody who's still on
Python 2.x, wouldn't you?

Ned Deily · Jun 6, 2014

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.

I agree about the U/I, although I'm sure a lot of that has to do with
familiarity. However, netnews isn't dead, it has just morphed a bit. A
newsreader, like MT-NW, is great for following mailing lists like this
(and most other Python-related lists) via gmane.org's bi-directional
mailing list - NNTP gateways. And for this list it's usually better to
read the mailing list variant via gmane.org NNTP than the Usenet group
variant via a traditional USENET NNTP server because there's less spam
with the former.

Hmmm, good to know. I'm still on 10.7, and don't see any reason to
move. But, then again, you'd expect that from somebody who's still on
Python 2.x, wouldn't you?

Heh. Well, both 10.8 and 10.9 proved various improvements, both feature
and performance, over 10.7. Alas, Apple won't likely be supporting 10.7
with security updates for as long as the PSF will be supporting 2.7.x.
But, by then, you'll have had a chance to re-implement MT-NW in Python.

Johannes Bauer · Jun 6, 2014

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
2014-06-05 13:42 GMT-05:00 Johannes Bauer said:

line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

Click to expand...

rstrip has different functionality than what I'm doing.

Click to expand...

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

Ah, I didn't know rstrip() accepted parameters and since you wrote
line.rstrip() this would also cut away whitespaces (which sadly are
relevant in odd cases).

Thanks for the clarification, I'll definitely introduce that.

Cheers,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Johannes Bauer · Jun 6, 2014

Personally I tend toward rstrip('\r\n') so that I don't have to worry
about files with alternative line terminators.

Hm, I was under the impression that Python already took care of removing
the \r at a line ending. Checking that right now:

(DOS encoded file "y")....
b'foo\n'
b'bar\n'
b'moo\n'
b'koo\n'

Yup, the \r was removed automatically. Are there cases when it isn't?

Cheers,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Tim Chase · Jun 6, 2014

Hm, I was under the impression that Python already took care of
removing the \r at a line ending. Checking that right now:

(DOS encoded file "y")
...
b'foo\n'
b'bar\n'
b'moo\n'
b'koo\n'

Yup, the \r was removed automatically. Are there cases when it
isn't?

It's possible if the file is opened as binary:

....
'hello\r\n'
'world\r\n'

-tkc

Steven D'Aprano · Jun 6, 2014

Hm, I was under the impression that Python already took care of removing
the \r at a line ending. Checking that right now:

[snip example]

This is called "Universal Newlines". Technically it is a build-time
option which applies when you build the Python interpreter from source,
so, yes, some Pythons may not implement it at all. But I think that it
has been on by default for a long time, and the option to turn it off may
have been removed in Python 3.3 or 3.4. In practical terms, you should
normally expect it to be on.

Here's the PEP that introduced it:
http://legacy.python.org/dev/peps/pep-0278/

The idea is that when universal newlines support is enabled, by default
will convert any of \n, \r or \r\n into \n when reading from a file in
text mode, and convert back the other way when writing the file.

In binary mode, newlines are *never* changed.

In Python 3, you can return end-of-lines unchanged by passing newline=''
to the open() function.

https://docs.python.org/2/library/functions.html#open
https://docs.python.org/3/library/functions.html#open

Grant Edwards · Jun 6, 2014

Yes. Except for the fact that it hasn't kept up with unicode, I find
the U/I pretty much perfect. I imagine at some point I'll be force to
look elsewhere, but then again, netnews is pretty much dead.

There are still a few active groups, but reading e-mail lists via NNTP
(in my case using slrn) via gmane is a huge reason to have an
efficient, well-designed "news" client.

If usenet does really pack it in someday and I have to switch from
comp.lang.python to the mailing list, it will be done by pointing slrn
at new.gmane.org -- not by having all those e-mails sent to me so I
can try to sort through them...

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode strings, struct, and files	2	Oct 9, 2006
unicode and data strings	0	Jan 28, 2005
How Python works: What do you know about support for negative indices?	13	Sep 10, 2010
how do you do this	16	Oct 1, 2009
SQLAlchemy: How to do Table Reflection and MySQL?	3	Oct 21, 2012
Interning own classes like strings for speed and size?	11	Dec 27, 2010
File names, character sets and Unicode	1	Dec 12, 2008

Unicode and Python - how often do you index strings?

Ryan Hiebert

Ian Kelly

Albert-Jan Roskam

Roy Smith

Rustom Mody

Ned Deily

Ian Kelly

Roy Smith

Ned Deily

Johannes Bauer

Johannes Bauer

Tim Chase

Steven D'Aprano

Grant Edwards

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads