hex dump w/ or w/out utf-8 chars

Dave Angel · Jul 9, 2013

One of the first Python project I undertook was a program to dump
the ZSCII strings from Infocom game files. They are mostly packed
one character per 5 bits, with escapes to (I had to recheck the
Z-machine spec) latin-1. Oh, those clever implementors: thwarting
hexdumping cheaters and cramming their games onto microcomputers
with one blow.

In 1973 I played with encoding some data that came over the public
airwaves (I never learned the specific radio technology, probably used
sidebands of FM stations). The data was encoded, with most characters
taking 5 bits, and the decoded stream was like a ticker-tape. With some
hardware and the right software, you could track Wall Street in real
time. (Or maybe it had the usual 15 minute delay).

Obviously, they didn't publish the spec any place. But some others had
the beginnings of a decoder, and I expanded on that. We never did
anything with it, it was just an interesting challenge.

Neil Cerutti · Jul 9, 2013

In 1973 I played with encoding some data that came over the
public airwaves (I never learned the specific radio technology,
probably used sidebands of FM stations). The data was encoded,
with most characters taking 5 bits, and the decoded stream was
like a ticker-tape. With some hardware and the right software,
you could track Wall Street in real time. (Or maybe it had the
usual 15 minute delay).

Obviously, they didn't publish the spec any place. But some
others had the beginnings of a decoder, and I expanded on that.
We never did anything with it, it was just an interesting
challenge.

Interestingly similar scheme. It wonder if 5-bit chars was a
common compression scheme. The Z-machine spec was never
officially published either. I believe a "task force" reverse
engineered it sometime in the 90's.

Skip Montanaro · Jul 9, 2013

It wonder if 5-bit chars was a

common compression scheme.

http://en.wikipedia.org/wiki/List_of_binary_codes

Baudot was pretty common, as I recall, though ASCII and EBCDIC ruled
by the time I started punching cards.

Skip

Dave Angel · Jul 9, 2013

On 07/09/2013 09:00 AM, Neil Cerutti wrote:

Interestingly similar scheme. It wonder if 5-bit chars was a
common compression scheme. The Z-machine spec was never
officially published either. I believe a "task force" reverse
engineered it sometime in the 90's.

Baudot was 5 bits. It used shift-codes to get upper case and digits, if
I recall.

And ASCII was 7 bits so there could be one more for parity.

Steven D'Aprano · Jul 9, 2013

Note the difference between SS and áºž 'FRANZ-JOSEF-STRAUSS-STRAáºžE'

Click to expand...

This is a capital Eszett. Which just happens not to exist in German.
Germans do not use this character, it is not available on German
keyboards, and the German spelling rules have you replace ÃŸ with SS.
And, surprise surprise, STRASSE is the example the Council for German
Orthography used ([0] page 29, Â§25 E3).

[0]: http://www.neue-rechtschreibung.de/regelwerk.pdf

Only half-right. Uppercase Eszett has been used in Germany going back at
least to 1879, and appears to be gaining popularity. In 2010 the use of
uppercase ÃŸ apparently became mandatory for geographical place names when
written in uppercase in official documentation.

http://opentype.info/blog/2011/01/24/capital-sharp-s/

http://en.wikipedia.org/wiki/Capital_áºž

Font support is still quite poor, but at least half a dozen Windows 7
fonts provide it, and at least one Mac font.

wxjmfauth · Jul 10, 2013

For those who are interested. The official proposal request
for the encoding of the Latin uppercase letter Sharp S in
ISO/IEC 10646; DIN (The German Institute for Standardization)
proposal is available on the web. A pdf with the rationale.
I do not remember from where I got it, probably from a German
web site.

Fonts:
I'm observing the inclusion of this glyph since years. More
and more fonts are supporting it. Available in many fonts,
it is suprisingly not available in Cambria (at least the version
I'm using). STIX does not includes it, it has been requested. Ditto,
for the Latin Modern, the default bundle of fonts for the Unicode
TeX engines.

Last but not least, Python.
Thanks to the Flexible String Representation, it is not
necessary to mention the disastrous, erratic behaviour of
Python, when processing text containing it. It's more than
clear, a serious user willing to process the contain of
'DER GROáºžE DUDEN' (a reference German dictionary) will be
better served by using something else.

The irony is that this Flexible String Representation has
been created by a German.

jmf

wxjmfauth · Jul 11, 2013

Le lundi 8 juillet 2013 19:52:17 UTC+2, Chris Angelico a Ã©critÂ :

Well, there won't be a Python 2.8, so you really should consider

moving at some point. Python 3.3 is already way better than 2.7 in

many ways, 3.4 will improve on 3.3, and the future is pretty clear.

But nobody's forcing you, and 2.7.x will continue to get

bugfix/security releases for a while. (Personally, I'd be happy if

everyone moved off the 2.3/2.4 releases. It's not too hard supporting

2.6+ or 2.7+.)

The thing is, you're thinking about UTF-8, but you should be thinking

about Unicode. I recommend you read these articles:

http://www.joelonsoftware.com/articles/Unicode.html

http://unspecified.wordpress.com/20...e-of-language-level-abstract-unicode-strings/

So long as you are thinking about different groups of characters as

different, and wanting a solution that maps characters down into the

<256 range, you will never be able to cleanly internationalize. With

Python 3.3+, you can ignore the differences between ASCII, BMP, and

SMP characters; they're all just "characters". Everything works

perfectly with Unicode.

-----------

Just to stick with this funny character áºž, a ucs-2 char
in the Flexible String Representation nomenclature.

It seems to me that, when one needs more than ten bytes
to encode it,
40

this is far away from the perfection.

BTW, for a modern language, is not ucs2 considered
as obsolete since many, many years?

jmf

Chris Angelico · Jul 11, 2013

Just to stick with this funny character áºž, a ucs-2 char
in the Flexible String Representation nomenclature.

It seems to me that, when one needs more than ten bytes
to encode it,

40

this is far away from the perfection.

Better comparison is to see how much space is used by one copy of it,
and how much by two copies:
2

String objects have overhead. Big deal.

BTW, for a modern language, is not ucs2 considered
as obsolete since many, many years?

Clearly. And similarly, the 16-bit integer has been completely
obsoleted, as there is no reason anyone should ever bother to use it.
Same with the float type - everyone uses double or better these days,
right?

http://www.postgresql.org/docs/current/static/datatype-numeric.html
http://www.cplusplus.com/doc/tutorial/variables/

Nope, nobody uses small integers any more, they're clearly completely obsolete.

ChrisA

wxjmfauth · Jul 11, 2013

Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a Ã©critÂ :

Better comparison is to see how much space is used by one copy of it,

and how much by two copies:

2

String objects have overhead. Big deal.

Clearly. And similarly, the 16-bit integer has been completely

obsoleted, as there is no reason anyone should ever bother to use it.

Same with the float type - everyone uses double or better these days,

right?

http://www.postgresql.org/docs/current/static/datatype-numeric.html

http://www.cplusplus.com/doc/tutorial/variables/

Nope, nobody uses small integers any more, they're clearly completely obsolete.

Sure there is some overhead because a str is a class.
It still remain that a "áºž" weights 14 bytes more than
an "a".

In "aáºž", the áºž weights 6 bytes.
42

and in "aáºžáºž", the áºž weights 2 bytes

sys.getsizeof('aáºžáºž')

And what to say about this "ucs4" char/string '\U0001d11e' which
is weighting 18 bytes more than an "a".
44

A total absurdity. How does is come? Very simple, once you
split Unicode in subsets, not only you have to handle these
subsets, you have to create "markers" to differentiate them.
Not only, you produce "markers", you have to handle the
mess generated by these "markers". Hiding this markers
in the everhead of the class does not mean that they should
not be counted as part of the coding scheme. BTW, since
when a serious coding scheme need an extermal marker?

1

Shortly, if my algebra is still correct:

(overhead + marker + 2*'a') - (overhead + marker + 'a')
= (overhead + marker + 2*'a') - overhead - marker - 'a')
= overhead - overhead + marker - marker + 2*'a' - 'a'
= 0 + 0 + 'a'
= 1

The "marker" has magically disappeared.

jmf

wxjmfauth · Jul 11, 2013

Le jeudi 11 juillet 2013 20:42:26 UTC+2, (e-mail address removed) a Ã©critÂ :

Chris Angelico · Jul 11, 2013

BTW, since
when a serious coding scheme need an extermal marker?

All of them.

Content-type: text/plain; charset=UTF-8

ChrisA

Steven D'Aprano · Jul 11, 2013

And what to say about this "ucs4" char/string '\U0001d11e' which is
weighting 18 bytes more than an "a".

44

A total absurdity.

You should stick to Python 3.1 and 3.2 then:

py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36

Now all your strings will be just as heavy, every single variable name
and attribute name will use four times as much memory. Happy now?

How does is come? Very simple, once you split Unicode
in subsets, not only you have to handle these subsets, you have to
create "markers" to differentiate them. Not only, you produce "markers",
you have to handle the mess generated by these "markers". Hiding this
markers in the everhead of the class does not mean that they should not
be counted as part of the coding scheme. BTW, since when a serious
coding scheme need an extermal marker?

Since always.

How do you think that (say) a C compiler can tell the difference between
the long 1199876496 and the float 67923.125? They both have exactly the
same four bytes:

py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'

*Everything* in a computer is bytes. The only way to tell them apart is
by external markers.

wxjmfauth · Jul 12, 2013

Le vendredi 12 juillet 2013 05:18:44 UTC+2, Steven D'Aprano a Ã©critÂ :

On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:

Now all your strings will be just as heavy, every single variable name

and attribute name will use four times as much memory. Happy now?

.... cÅ“ur = 'heart'
....

- Why always this magic number "four"?
- Are you able to think once non-ascii?
- Have you once had the feeling to be penalized,
because you are using fonts with OpenType technology?
- Have once had problem with pdf? I can tell you,
utf32 is peanuts compared to the used CID-font you
are using.
- Did you toy once with a unicode TeX engine?
- Did you take a look at a rendering engine code like HarfBuzz?

jmf

Joshua Landau · Jul 12, 2013

There is no symbole for radian because mathematically
radian is a pure number, a unitless number. You can
hower sepecify a = ... in radian (rad).

Isn't a superscript "c" the symbol for radians?

Steven D'Aprano · Jul 13, 2013

Isn't a superscript "c" the symbol for radians?

Only in the sense that a superscript "o" is the symbol for degrees.

Semantically, both degree-sign and radian-sign are different "things"
than merely an o or c in superscript.

Nevertheless, in mathematics at least, it is normal to leave out the
radian sign when talking about angles. By default, "1.2" means "1.2
radians", not "1.2 degrees".

wxjmfauth · Jul 13, 2013

Le samedi 13 juillet 2013 11:49:10 UTC+2, Steven D'Aprano a écrit :

*May* have, in a *mandatory* way?

JMF, I know you are not a native English speaker, so you might not be

aware just how silly your statement is. If it *may* have, it is optional,

since it *may not* have instead. But if it is optional, it is not

mandatory.

You are making so much fuss over such a simple, obvious implementation

for strings. The language Pike has done the same thing for probably a

decade or so.

Ironically, Python has done the same thing for integers for many versions

too. They just didn't call it "Flexible Integer Representation", but

that's what it is. For integers smaller than 2**31, they are stored as C

longs (plus object overhead). For integers larger than 2**31, they are

promoted to a BigNum implementation that can handle unlimited digits.

Using Python 2.7, where it is more obvious because the BigNum has an L

appended to the display, and a different type:

py> for n in (1, 2**20, 2**30, 2**31, 2**65):

... print repr(n), type(n), sys.getsizeof(n)

...

1 <type 'int'> 12

1048576 <type 'int'> 12

1073741824 <type 'int'> 12

2147483648L <type 'long'> 18

36893488147419103232L <type 'long'> 22

You have been using Flexible Integer Representation for *years*, and it

works great, and you've never noticed any problems.

------

The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].

I'm the first to recognize that Python and/or Pike are
free to do what they wish.

Luckily, for the crowd, those who do not even know that the
coding of characters exists, all the serious actors active in
text processing are working properly.

jmf

* By nature characters and numbers are differents.

Dave Angel · Jul 13, 2013

On 07/13/2013 10:37 AM, (e-mail address removed) wrote:

The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].

That would be very hard, since you certainly do not.

I'm the first to recognize that Python and/or Pike are
free to do what they wish.

Fortunately for us, Python (in version 3.3 and later) and Pike did it
right. Some day the others may decide to do similarly.

Luckily, for the crowd, those who do not even know that the
coding of characters exists, all the serious actors active in
text processing are working properly.

Here, I'm really glad you don't know English, because if you had a
decent grasp of the language, somebody might assume you knew what you
were talking about.

jmf

* By nature characters and numbers are differents.

By nature Jmf has his own distorted reality.

wxjmfauth · Jul 14, 2013

Le samedi 13 juillet 2013 21:02:24 UTC+2, Dave Angel a écrit :

On 07/13/2013 10:37 AM, (e-mail address removed) wrote:

Fortunately for us, Python (in version 3.3 and later) and Pike did it

right. Some day the others may decide to do similarly.

-----------
Possible but I doubt.
For a very simple reason, the latin-1 block: considered
and accepted today as beeing a Unicode design mistake.

jmf

Steven D'Aprano · Jul 14, 2013

For a very simple reason, the latin-1 block: considered and accepted
today as beeing a Unicode design mistake.

Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational
Character Set", which goes back to 1983. ISO-8859-1 was first published
in 1985, and was in use on Commodore computers the same year.

The concept of Unicode wasn't even started until 1987, and the first
draft wasn't published until the end of 1990. Unicode wasn't considered
ready for production use until 1991, six years after Latin-1 was already
in use in people's computers.

wxjmfauth · Jul 14, 2013

Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :

Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational

Character Set", which goes back to 1983. ISO-8859-1 was first published

in 1985, and was in use on Commodore computers the same year.

The concept of Unicode wasn't even started until 1987, and the first

draft wasn't published until the end of 1990. Unicode wasn't considered

ready for production use until 1991, six years after Latin-1 was already

in use in people's computers.

------

"Unicode" (in fact iso-14xxx) was not created in one
night (Deus ex machina).

What's count today is this:

timeit.repeat("a = 'hundred'; 'x' in a") [0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
timeit.repeat("a = 'hundreœ'; 'x' in a") [0.23955250303158593, 0.2195812612416752, 0.22133896997401692]

sys.getsizeof('d') 26
sys.getsizeof('œ') 40
sys.version

Click to expand...

Click to expand...

'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

jmf

Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
UTF-8 output problems	2	Mar 9, 2007
UTF-8 question from Dive into Python 3	19	Jan 17, 2011
Unicode/UTF-8 confusion	1	Mar 15, 2008
usage of <string>.encode('utf-8','xmlcharrefreplace')?	7	Feb 19, 2008
Tkinter and utf-8	7	Oct 28, 2004
Translater + module + tkinter	1	Feb 16, 2023
UTF-8 to unicode or latin-1 (and yes, I read the FAQ)	10	Oct 19, 2006

hex dump w/ or w/out utf-8 chars

Dave Angel

Neil Cerutti

Skip Montanaro

Dave Angel

Steven D'Aprano

wxjmfauth

wxjmfauth

Chris Angelico

wxjmfauth

wxjmfauth

Chris Angelico

Steven D'Aprano

wxjmfauth

Joshua Landau

Steven D'Aprano

wxjmfauth

Dave Angel

wxjmfauth

Steven D'Aprano

wxjmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads