"convert" string to bytes without changing data (encoding)

P

Peter Daum

Hi,

is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

Regards,
Peter
 
C

Chris Angelico

Hi,

is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

What is a string? It's not a series of bytes. You can't convert it
without encoding those characters into bytes in some way.

ChrisA
 
S

Stefan Behnel

Peter Daum, 28.03.2012 10:56:
is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

If you can tell us what you actually want to achieve, i.e. why you want to
do this, we may be able to tell you how to do what you want.

Stefan
 
P

Peter Daum

What is a string? It's not a series of bytes. You can't convert it
without encoding those characters into bytes in some way.

.... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...)

would of course work in this case, but in general, if s holds any
data with bytes > 127, the actual data will be changed according
to the provided encoding.

What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

Regards,
Peter
 
H

Heiko Wundram

Am 28.03.2012 11:43, schrieb Peter Daum:
... in my example, the variable s points to a "string", i.e. a series
of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No; a string contains a series of codepoints from the unicode plane,
representing natural language characters (at least in the simplistic
view, I'm not talking about surrogates). These can be encoded to
different binary storage representations, of which ascii is (a common)
one.
What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

There is "logically" no raw data in the string, just a series of
codepoints, as stated above. You'll have to specify the encoding to use
to get at "raw" data, and from what I gather you're interested in the
latin-1 (or iso-8859-15) encoding, as you're specifically referencing
chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
speak).
 
S

Stefan Behnel

Peter Daum, 28.03.2012 11:43:
What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

That's why I asked about your use case - where does the data come from and
why is it contained in a character string in the first place? If you could
provide that information, we can help you further.

Stefan
 
R

Ross Ridge

Chris Angelico said:
What is a string? It's not a series of bytes.

Of course it is. Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

Ross Ridge
 
C

Chris Angelico

Of course it is.  Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

Note that distinction. I said that a string "is not" a series of
bytes; you say that it "is stored" as bytes.
What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

That's equivalent to taking a 64-bit integer and trying to treat it as
a 64-bit floating point number. They're all just bits in memory, and
in C it's quite easy to cast a pointer to a different type and
dereference it. But a Python Unicode string might be stored in several
ways; for all you know, it might actually be stored as a sequence of
apples in a refrigerator, just as long as they can be referenced
correctly. There's no logical Python way to turn that into a series of
bytes.

ChrisA
 
G

Grant Edwards

for all you know, it might actually be stored as a sequence of
apples in a refrigerator
[...]

There's no logical Python way to turn that into a series of bytes.

There's got to be a joke there somewhere about how to eat an apple...
 
P

Peter Daum

Am 28.03.2012 11:43, schrieb Peter Daum:

No; a string contains a series of codepoints from the unicode plane,
representing natural language characters (at least in the simplistic
view, I'm not talking about surrogates). These can be encoded to
different binary storage representations, of which ascii is (a common) one.


There is "logically" no raw data in the string, just a series of
codepoints, as stated above. You'll have to specify the encoding to use
to get at "raw" data, and from what I gather you're interested in the
latin-1 (or iso-8859-15) encoding, as you're specifically referencing
chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
speak).

.... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Regards,
Peter
 
S

Steven D'Aprano

Of course it is. Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

You don't know that. They might be stored as a tree, or a rope, or some
even more complex data structure. In fact, in Python, they are stored as
an object.

But even if they were stored as a simple series of bytes, you don't know
what bytes they are. That is an implementation detail of the particular
Python build being used, and since Python doesn't give direct access to
memory (at least not in pure Python) there's no way to retrieve those
bytes using Python code.

Saying that strings are stored in memory as bytes is no more sensible
than saying that dicts are stored in memory as bytes. Yes, they are. So
what? Taken out of context in a running Python interpreter, those bytes
are pretty much meaningless.

What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

The right way to convert bytes to strings, and vice versa, is via
encoding and decoding operations. What the OP is asking for is as silly
as somebody asking to turn a float 1.3792 into a string without calling
str() or any equivalent float->string conversion. They're both made up of
bytes, right? Yeah, they are. So what?

Even if you do a hex dump of float 1.3792, the result will NOT be the
string "1.3792". And likewise, even if you somehow did a hex dump of the
memory representation of a string, the result will NOT be the equivalent
sequence of bytes except *maybe* for some small subset of possible
strings.
 
R

Ross Ridge

Ross Ridge said:
Of course it is. =A0Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

Chris Angelico said:
Note that distinction. I said that a string "is not" a series of
bytes; you say that it "is stored" as bytes.

The distinction is meaningless. I'm not going argue with you about what
you or I ment by the word "is".
But a Python Unicode string might be stored in several
ways; for all you know, it might actually be stored as a sequence of
apples in a refrigerator, just as long as they can be referenced
correctly.

But it is in fact only stored in one particular way, as a series of bytes.
There's no logical Python way to turn that into a series of bytes.

Nonsense. Play all the semantic games you want, it already is a series
of bytes.

Ross Ridge
 
T

Terry Reedy

Of course it is. Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

*If* it is stored in byte memory. If you execute a 3.x program mentally
or on paper, then there are no bytes.

If you execute a 3.3 program on a byte-oriented computer, then the 'a'
in the string might be represented by 1, 2, or 4 bytes, depending on the
other characters in the string. The actual logical bit pattern will
depend on the big versus little endianness of the system.

My impression is that if you go down to the physical bit level, then
again there are, possibly, no 'bytes' as a physical construct as the
bits, possibly, are stored in parallel on multiple ram chips.
What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

The python-level way to get the bytes of an object that supports the
buffer interface is memoryview(). 3.x strings intentionally do not
support the buffer interface as there is not any particular
correspondence between characters (codepoints) and bytes.

The OP could get the ordinal for each character and decide how *he*
wants to convert them to bytes.

ba = bytearray()
for c in s:
i = ord(c)
<append bytes to ba corresponding to i>

To get the particular bytes used for a particular string on a particular
system, OP should use the C API, possibly through ctypes.
 
S

Steven D'Aprano

... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No. Strings are not sequences of bytes (except in the trivial sense that
everything in computer memory is made of bytes). They are sequences of
CODE POINTS. (Roughly speaking, code points are *almost* but not quite
the same as characters.)

I suggest that you need to reset your understanding of strings and bytes.
I suggest you start by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

Then come back and try to explain what actual problem you are trying to
solve.
 
H

Heiko Wundram

Am 28.03.2012 19:43, schrieb Peter Daum:
As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

It seems that you're mixing things up wrt. the string/bytes
distinction; it's not as "complicated" as it might seem.

1) Strings

s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...

all create/refer to string objects. How Python internally stores them
is none of your concern (actually, that's rather complicated anyway, at
least with the upcoming Python 3.3), and processing a string basically
means that you'll work on the natural language characters present in the
string. Python strings can store (pretty much) all characters and
surrogates that unicode allows, and when the python interpreter/compiler
reads strings from input (I'm talking about source files), a default
encoding defines how the bytes in your input file get interpreted as
unicode codepoint encodings (generally, it depends on your system locale
or file header indications) to construct the internal string object
you're using to access the data in the string.

There is no such thing as a type for a single character; single
characters are simply strings of length 1 (and so indexing also returns
a [new] string object).

Single/double quotes work no different.

The internal encoding used by the Python interpreter is of no concern
to you.

2) Bytes

s = b'this is a byte-string'
s = b'\x22\x33\x44'

The above define bytes. Think of the bytes type as arrays of 8-bit
integers, only representing a buffer which you can process as an array
of fixed-width integers. Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.

Indexing the bytes type returns an integer (which is the clearest
distinction between string and bytes).

Being able to input "string-looking" data in source files as bytes is a
debatable "feature" (IMHO; see the first example), simply because it
breaks the semantic difference between the two types in the eye of the
programmer looking at source.

3) Conversions

To get from bytes to string, you have to decode the bytes buffer,
telling Python what kind of character data is contained in the array of
integers. After decoding, you'll get a string object which you can
process using the standard string methods. For decoding to succeed, you
have to tell Python how the natural language characters are encoded in
your array of bytes:

b'hello'.decode('iso-8859-15')

To get from string back to bytes (you want to write the natural
language character data you've processed to a file), you have to encode
the data in your string buffer, which gets you an array of 8-bit
integers to write to the output:

'hello'.encode('iso-8859-15')

Most output methods will happily do the encoding for you, using a
standard encoding, and if that happens to be ASCII, you're getting
UnicodeEncodeErrors which tell you that a character in your string
source is unsuited to be transmitted using the encoding you've
specified.

If the above doesn't make the string/bytes-distinction and usage
clearer, and you have a C#-background, check out the distinction between
byte[] (which the System.IO-streams get you), and how you have to use a
System.Encoding-derived class to get at actual System.String objects to
manipulate character data. Pythons type system wrt. character data is
pretty much similar, except for missing the "single character" type
(char).

Anyway, back to what you wrote: how are you getting the input data? Why
are "high bytes" in there which you do not know the encoding for?
Generally, from what I gather, you'll decode data from some source,
process it, and write it back using the same encoding which you used for
decoding, which should do exactly what you want and not get you into any
trouble with encodings.
 
J

Jussi Piitulainen

Peter said:
... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:
Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.

The paragraph is from the German Wikipedia on ASCII, in UTF-8.
 
E

Ethan Furman

Peter said:
The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

Where is the data coming from? Files? In that case, it sounds like you
will want to decode/encode using 'latin-1', as the bulk of your text is
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~
 
P

Prasad, Ramit

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Just a small note as you are new to Python, string concatenation can
be expensive (quadratic time). The Python (2.x and 3.x) idiom for
frequent string concatenation is to append to a list and then join
them like the following (linear time).

lst = [ 'Hi,' ]
lst.append( 'how' )
lst.append( 'are' )
lst.append( 'you?' )
sentence = ' '.join( lst ) # use a space separating each element
print sentence
Hi, how are you?

You can use join on an empty string, but then they will not be
separated by spaces.

Hi,howareyou?

You can use any string as a separator, length does not matter.

Hi,@-Qhow@-Qare@-Qyou?


Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
I

Ian Kelly

... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

No, unicode strings can be stored internally as any of UCS-1, UCS-2,
UCS-4, C wchar strings, or even plain ASCII. And those are all
implementation details that could easily change in future versions of
Python.
The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

You can't generally just "deal with the ascii portions" without
knowing something about the encoding. Say you encounter a byte
greater than 127. Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character? If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character? For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure. If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Cheers,
Ian
 
R

Ross Ridge

Steven D'Aprano said:
The right way to convert bytes to strings, and vice versa, is via
encoding and decoding operations.

If you want to dictate to the original poster the correct way to do
things then you don't need to do anything more that. You don't need to
pretend like Chris Angelico that there's isn't a direct mapping from
the his Python 3 implementation's internal respresentation of strings
to bytes in order to label what he's asking for as being "silly".

Ross Ridge
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top