"convert" string to bytes without changing data (encoding)

S

Steven D'Aprano

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I am
often dealing with data, that is basically text, but it can contain
8-bit bytes.

All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.
In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?

You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:

bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')

When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:

string = bytes.decode('latin1')

although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.

But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.


As it seems, this would be far easier with python 2.x.

It only seems that way until you try.
 
P

Prasad, Ramit

You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:

Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden.Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.

The paragraph is from the German Wikipedia on ASCII, in UTF-8.

I see no non-ASCII characters, notsure if that is because the source
has none or something else. Fromthis example I would not say that
the rest of the text is "unchanged". Decode converts to Unicode,
did you mean encode?

I think "ignore" will remove non-translatable characters and not
leave them in the returned string.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
T

Tim Chase

But it is in fact only stored in one particular way, as a series of bytes.


Nonsense. Play all the semantic games you want, it already is a series
of bytes.

Internally, they're a series of bytes, but they are MEANINGLESS
bytes unless you know how they are encoded internally. Those
bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
possible encodings[1]. If you get the internal byte stream,
there's no way to meaningfully operate on it unless you also know
how it's encoded (or you're willing to sacrifice the ability to
reliably get the string back).

-tkc

[1]
http://docs.python.org/library/codecs.html#standard-encodings
 
E

Ethan Furman

I see no non-ASCII characters, not sure if that is because the source
has none or something else.

The 'ignore' argument to .decode() caused all non-ascii characters to be
removed.

~Ethan~
 
P

Prasad, Ramit

The right way to convert bytes to strings, and vice versa, is via
If you want to dictateto the original poster the correct way to do
things then you don't need to do anything more that. You don't need to
pretend like Chris Angelico that there's isn't a direct mapping from
the his Python 3 implementation's internal respresentation of strings
to bytes in order to label what he's asking for as being "silly".

It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:

You can't generally just "deal with the ascii portions" without
knowing something about the encoding. Say you encounter a byte
greater than 127. Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character? If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character? For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure. If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Technically, ASCII goes up to 256 but they are not A-z letters.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
R

Ross Ridge

Tim Chase said:
Internally, they're a series of bytes, but they are MEANINGLESS
bytes unless you know how they are encoded internally. Those
bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
possible encodings[1]. If you get the internal byte stream,
there's no way to meaningfully operate on it unless you also know
how it's encoded (or you're willing to sacrifice the ability to
reliably get the string back).

In practice the number of ways that CPython (the only Python 3
implementation) represents strings is much more limited. Pretending
otherwise really isn't helpful.

Still, if Chris Angelico had used your much less misleading explaination,
then this could've been resolved much quicker. The original poster
didn't buy Chris's bullshit for a minute, instead he had to find out on
his own that that the internal representation of strings wasn't what he
expected to be.

Ross Ridge
 
E

Evan Driscoll

If you want to dictate to the original poster the correct way to do
things then you don't need to do anything more that. You don't need to
pretend like Chris Angelico that there's isn't a direct mapping from
the his Python 3 implementation's internal respresentation of strings
to bytes in order to label what he's asking for as being "silly".

That mapping may as well be:

def get_bytes(some_string):
import random
length = random.randint(len(some_string), 5*len(some_string))
bytes = [0] * length
for i in xrange(length):
bytes = random.randint(0, 255)
return bytes

Of course this is hyperbole, but it's essentially about as much
guarantee as to what the result is.

As many others have said, the encoding isn't defined, and I would guess
varies between implementations. (E.g. if Jython and IronPython use their
host platforms' native strings, both have 16-bit chars and thus probably
use UTF-16 encoding. I am not sure what CPython uses, but I bet it's
*not* that.)

It's even guaranteed that the byte representation won't change! If
something is lazily evaluated or you have a COW string or something, the
bytes backing it will differ.


So yes, you can say that pretending there's not a mapping of strings to
internal representation is silly, because there is. However, there's
nothing you can say about that mapping.

Evan
 
J

John Nagle

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

So why let the data get into a "str" type at all? Do everything
end to end with "bytes" or "bytearray" types.

John Nagle
 
G

Grant Edwards

All bytes are 8-bit, at least on modern hardware. I think you have to
go back to the 1950s to find 10-bit or 12-bit machines.

Well, on anything likely to run Python that's true. There are modern
DSP-oriented CPUs where a byte is 16 or 32 bits (and so is an int and
a long, and a float and a double).
It only seems that way until you try.

It's easy as long as you deal with nothing but ASCII and Latin-1. ;)
 
R

Ross Ridge

Evan Driscoll said:
So yes, you can say that pretending there's not a mapping of strings to
internal representation is silly, because there is. However, there's
nothing you can say about that mapping.

I'm not the one labeling anything as being silly. I'm the one labeling
the things as bullshit, and that's what you're doing here. I can in
fact say what the internal byte string representation of strings is any
given build of Python 3. Just because I can't say what it would be in
an imaginary hypothetical implementation doesn't mean I can never say
anything about it.

Ross Ridge
 
G

Grant Edwards

Technically, ASCII goes up to 256

No, ASCII only defines 0-127. Values >=128 are not ASCII.

From https://en.wikipedia.org/wiki/ASCII:

ASCII includes definitions for 128 characters: 33 are non-printing
control characters (now mostly obsolete) that affect how text and
space is processed and 95 printable characters, including the space
(which is considered an invisible graphic).
 
M

MRAB

It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:


Technically, ASCII goes up to 256 but they are not A-z letters.
Technically, ASCII is 7-bit, so it goes up to 127.
 
M

Mark Lawrence

I'm not the one labeling anything as being silly. I'm the one labeling
the things as bullshit, and that's what you're doing here. I can in
fact say what the internal byte string representation of strings is any
given build of Python 3. Just because I can't say what it would be in
an imaginary hypothetical implementation doesn't mean I can never say
anything about it.

Ross Ridge

Bytes is bytes and strings is strings
And the wrong one I have chose
Let's go where they keep on wearin'
Those frills and flowers and buttons and bows
Rings and things and buttons and bows.

No guessing the tune.
 
N

Neil Cerutti

I'm not the one labeling anything as being silly. I'm the one
labeling the things as bullshit, and that's what you're doing
here. I can in fact say what the internal byte string
representation of strings is any given build of Python 3. Just
because I can't say what it would be in an imaginary
hypothetical implementation doesn't mean I can never say
anything about it.

I am in a similar situation viz a viz my wife's undergarments.
 
T

Terry Reedy

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x.

I strongly agree with that unless you have reason to use 2.7. Python 3.3
(.0a1 in nearly out) has an improved unicode implementation, among other
things.

< The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

You are assuming, or must assume, that the text is in an
ascii-compatible encoding, meaning that bytes 0-127 really represent
ascii chars. Otherwise, you cannot reliably interpret anything, let
alone change it.

This problem of knowing that much but not the specific encoding is
unfortunately common. It has been discussed among core developers and
others the last few months. Different people prefer one of the following
approaches.

1. Keep the bytes as bytes and use bytes literals and bytes functions as
needed. The danger, as you noticed, is forgetting the 'b' prefix.

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
chars. When done, encode back to 'latin-1' and the non-ascii chars will
be as they originally were. The danger is forgetting the pretense, and
perhaps passing on the the string (as a string, not bytes) to other
modules that will not know the pretense.

3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
(using the surrogate-pair second-half code units). This is probably the
safest in that invalid operations on the non-chars should raise an
exception. Re-encoding with the same setting will reproduce the original
hi-bit chars. The main danger is passing the illegal strings out of your
local sandbox.
 
S

Steven D'Aprano

I can in
fact say what the internal byte string representation of strings is any
given build of Python 3.

Don't keep us in suspense! Given:

Python 3.2.2 (default, Mar 4 2012, 10:50:33)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2

what *is* the internal byte representation of the string "a∫©πz"?

(lowercase a, integral sign, copyright symbol, lowercase Greek pi,
lowercase z)


And more importantly, given that internal byte representation, what could
you do with it?
 
E

Evan Driscoll

I'm not the one labeling anything as being silly. I'm the one labeling
the things as bullshit, and that's what you're doing here. I can in
fact say what the internal byte string representation of strings is any
given build of Python 3. Just because I can't say what it would be in
an imaginary hypothetical implementation doesn't mean I can never say
anything about it.

People like you -- who write to assumptions which are not even remotely
guaranteed by the spec -- are part of the reason software sucks.

People like you hold back progress, because system implementers aren't
free to make changes without breaking backwards compatibility. Enormous
amounts of effort are expended to test programs and diagnose problems
which are caused by unwarranted assumptions like "the encoding of a
string is UTF-8". In the worst case, assumptions like that lead to
security fixes that don't go as far as they could, like the recent
discussion about hashing.

Python is definitely closer to the "willing to break backwards
compatibility to improve" end of the spectrum than some other projects
(*cough* Windows *cough*), but that still doesn't mean that you can make
assumptions like that.


This email is a bit harsher than it deserves -- but I feel not by much.

Evan
 
R

Ross Ridge

Evan Driscoll said:
People like you -- who write to assumptions which are not even remotely
guaranteed by the spec -- are part of the reason software sucks. ....
This email is a bit harsher than it deserves -- but I feel not by much.

I don't see how you could feel the least bit justified. Well meaning,
if unhelpful, lies about the nature Python strings in order to try to
convince someone to follow what you think are good programming practices
is one thing. Maliciously lying about someone else's code that you've
never seen is another thing entirely.

Ross Ridge
 
C

Chris Angelico

I don't see how you could feel the least bit justified.  Well meaning,
if unhelpful, lies about the nature Python strings in order to try to
convince someone to follow what you think are good programming practices
is one thing.  Maliciously lying about someone else's code that you've
never seen is another thing entirely.

Actually, he is justified. It's one thing to work in C or assembly and
write code that depends on certain bit-pattern representations of data
(although even that causes trouble - assuming that
sizeof(int)==sizeof(int*) isn't good for portability), but in a high
level language, you cannot assume any correlation between objects and
bytes. Any code that depends on implementation details is risky.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top