A few questiosn about encoding

Joel Goldstick · Jun 15, 2013

On 15/6/2013 5:59 Î¼Î¼, Roy Smith wrote:

And, yes, especially in networking, everybody talks about octets when

1 byte = 8 bits

in networking though since we do not use encoding schemes with variable
lengths like utf-8 is, how do we separate when a byte value start and when
it stops?

do we need a start bit and a stop bit for that?

Dennis Lee Bieber · Jun 15, 2013

It depends on the context.

Maybe the OP should give up on Python and switch to Regina/Rexx...

-=-=-=-=-=-
/* */

numint = 123 /* a "number" */
numstr = def /* unknown variable? */
strstr = "abc" /* a string containing alphabetics */
strint = "456" /* a string containing decimal digits */

signal on syntax name next1
say "Adding strstr and numint"
say strstr + numint
next1:
signal on syntax name next2
say "Adding strstr and strint"
say strstr + strint
next2:
signal on syntax name next3
say "Adding numint and strint"
say numint + strint
next3:
signal on syntax name next4
say "Adding numstr and numint"
say numstr + numint
next4:
say "Concatenate numint and strstr"
say numint || strstr
say "Concatenate strint and numint"
say strint || numint
say "Concatenate numstr and strint"
say numstr || strint
say "Concatenate numstr and numint"
say numstr || numint

-=-=-=-=-=-

E:\UserData\Wulfraed\MYDOCU~1>rexx t.rx
Adding strstr and numint
Adding strstr and strint
Adding numint and strint
579
Adding numstr and numint
Concatenate numint and strstr
123abc
Concatenate strint and numint
456123
Concatenate numstr and strint
DEF456
Concatenate numstr and numint
DEF123

E:\UserData\Wulfraed\MYDOCU~1>

{Pity SYNTAX error can't be trapped with a CALL, i'd have been able to
cleanly report results and return to the next statement}

Nick the Gr33k · Jun 15, 2013

The only thing that i didn't understood is this line.
First please tell me what is a byte value

\x1b is a character(ESC) represented in hex format

b'\x1b' is a byte object that represents what?

'\x1b'

After decoding it gives the char ESC in hex format
Shouldn't it result in value 27 which is the ordinal of ESC ?

Why Unicode charset doesn't just contain characters, but instead it
contains a mapping of (characters <--> ordinals) ?

I mean what we do is to encode a character like chr(65).encode('utf-8')

What's the reason of existence of its corresponding ordinal value since
it doesn't get involved into the encoding process?

Thank you very much for taking the time to explain.

Can someone please explain these questions too?

Benjamin Schollnick · Jun 15, 2013

Nick,

I'm sorry are you not listening?

1b is a HEXADECIMAL Number. As a so-called programmer, did you seriously not consider that?

Try this:

1) Open a Web browser
2) Go to Google.com
3) Type in "Hex 1B"
4) Click on the first link
5) In the Hexadecimal column find 1B.

Or open your favorite calculator, and convert Hexadecimal 1B to Decimal (Base 10).

- Benjamin

Antoon Pardon · Jun 17, 2013

Op 15-06-13 02:28, Cameron Simpson schreef:

| So, a numeral = a string representation of a number. Is this correct?

No, a numeral is an individual digit from the string representation of a number.
So: 65 requires two numerals: '6' and '5'.

Wrong context. A numeral as an individual digit is when you are talking about
individual characters in a font. In such a context the set of glyphs that
represent a digit are the numerals.

However in a context of programming, numerals in general refer to the set of
strings that represent a number.

Guy Scree · Jun 17, 2013

I recommend that all participants in this thread, especially Alex and
Anton, research the term "Pathological Altruism"

Chris Angelico · Jun 17, 2013

I recommend that all participants in this thread, especially Alex and
Anton, research the term "Pathological Altruism"

I don't intend to buy a book about it, but based on flipping through a
few Google results and snippets, I'm thinking that this is the
"Paladin fault" that I know from Dungeons & Dragons.

ChrisA

Rick Johnson · Jun 20, 2013

Gah! That's twice I've screwed that up.
Sorry about that!

Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen:

"If the implementation is hard to explain, it's a bad idea."

Steven D'Aprano · Jun 20, 2013

Yeah, and your difficulty explaining the Unicode implementation reminds
me of a passage from the Python zen:

"If the implementation is hard to explain, it's a bad idea."

The *implementation* is easy to explain. It's the names of the encodings
which I get tangled up in.

ASCII: Supports exactly 127 code points, each of which takes up exactly 7
bits. Each code point represents a character.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
about a gazillion other legacy charsets, all of which are mutually
incompatible: supports anything from 127 to 65535 different code points,
usually under 256.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly
two bytes. That's fewer than required, so it is obsoleted by:

UTF-16: Supports all 1114111 code points in the Unicode charset, using a
variable-width system where the most popular characters use exactly two-
bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up
exactly four bytes. That is more than needed for the Unicode charset, so
this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each.
Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system
where popular ASCII characters require 1 byte, and others use 2, 3 or 4
bytes as needed.

Ignoring the legacy charsets, only UTF-16 is a terribly complicated
implementation, due to the surrogate pairs. But even that is not too bad.
The real complication comes from the interactions between systems which
use different encodings, and that's nothing to do with Unicode.

MRAB · Jun 20, 2013

The *implementation* is easy to explain. It's the names of the encodings
which I get tangled up in.
You're off by one below!

ASCII: Supports exactly 127 code points, each of which takes up exactly 7
bits. Each code point represents a character.

128 codepoints.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
about a gazillion other legacy charsets, all of which are mutually
incompatible: supports anything from 127 to 65535 different code points,
usually under 256.

128 to 65536 codepoints.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly
two bytes. That's fewer than required, so it is obsoleted by:

65536 codepoints.

etc.

Rick Johnson · Jun 20, 2013

The *implementation* is easy to explain. It's the names of
the encodings which I get tangled up in.

Well, ignoring the fact that you're last explanation is
still buggy, you have not actually described an
"implementation", no, you've merely generalized ( and quite
vaguely i might add) the technical specification of a few
encoding. Let's ask Wikipedia to enlighten us on the
subject of "implementation":

############################################################
# Define: Implementation #
############################################################
# In computer science, an implementation is a realization #
# of a technical specification or algorithm as a program, #
# software component, or other computer system through #
# computer programming and deployment. Many #
# implementations may exist for a given specification or #
# standard. For example, web browsers contain #
# implementations of World Wide Web Consortium-recommended #
# specifications, and software development tools contain #
# implementations of programming languages. #
############################################################

Do you think someone could reliably implement the alphabet of a new
language in Unicode by using the general outline you
provided? -- again, ignoring your continual fumbling when
explaining that simple generalization

Your generalization is analogous to explaining web browsers
as: "software that allows a user to view web pages in the
range www.*" Do you think someone could implement a web
browser from such limited specification? (if that was all
they knew?).

============================================================
Since we're on the subject of Unicode:
============================================================
One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of "reading" for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?

That should haunt your nightmares for some time.

Andrew Berg · Jun 20, 2013

One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of "reading" for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?

Two dimensional characters can be made into 3 dimensional shapes.
Building numbers are a good example of this.
We already have one Unicode troll; do we really need you too?

Rick Johnson · Jun 20, 2013

On 2013.06.20 08:40, Rick Johnson wrote:
Two dimensional characters can be made into 3 dimensional shapes.

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this,
please share, as it sure would make internet porn more
interesting!

Building numbers are a good example of this.

Either the matrix is reality or you must live inside your
computer as a virtual being. Is your name Tron? Are you a pawn
of Master Control? He's such a tyrant!

Chris Angelico · Jun 20, 2013

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this,
please share, as it sure would make internet porn more
interesting!

I had a device for creating embossed text. It predated Unicode by a
couple of years at least (not sure how many, because I was fairly
young at the time). It was made by a company called Epson, it plugged
into the computer via a 25-pin plug, and when it was properly
functioning, it had a ribbon of ink that it would bash through to
darken the underside of the embossed text. But sometimes that ribbon
slipped out of position, and we had beautifully-hammered ASCII text,
unsullied by ink. And since the device did graphics too, it could be
used for the entire Unicode character set if you wanted.

Not sure that it would improve your porn any, but I've no doubt you
could try if you wanted.

ChrisA

Chris Angelico · Jun 20, 2013

Your generalization is analogous to explaining web browsers
as: "software that allows a user to view web pages in the
range www.*" Do you think someone could implement a web
browser from such limited specification? (if that was all
they knew?).

Wow. That spec isn't limited, it's downright faulty. Or do you really
think that (a) there is such a thing as the "range www.*", and that
(b) that "range" has anything to do with web browsers?

ChrisA

wxjmfauth · Jun 20, 2013

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :

You're off by one below!

128 codepoints.

128 to 65536 codepoints.

65536 codepoints.

etc.

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even "exotic" schemes like "CID-fonts" used in pdf
are based on that scheme.

jmf

Chris Angelico · Jun 20, 2013

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.

ChrisA

MRAB · Jun 20, 2013

UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.

UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!

Chris Angelico · Jun 20, 2013

UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!

Yes, but there's never (AFAIK) been a Python implementation that
represents strings in UTF-8; UTF-16 was one of two options for Python
2.2 through 3.2, and is the one that jmf always seems to be measuring
against.

ChrisA

Jussi Piitulainen · Jun 20, 2013

Rick said:
Yes in the real world. But what about on your computer screen? How
do you plan on creating tactile representations of braille glyphs on
my monitor? Hey, if you can already do this, please share, as it
sure would make internet porn more interesting!

Search for braille display on the web. A wikipedia article also led me
to braille e-book. (Or search for braille porn, since you are so
inclined - the concept turns out to be already out there on the web.)

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
newbie with a encoding question, please help	8	Apr 1, 2010
Question of UTF16BE encoding / decoding	2	May 5, 2009
Python3 - encoding issues	4	Nov 29, 2009
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013

A few questiosn about encoding

Joel Goldstick

Dennis Lee Bieber

Nick the Gr33k

Benjamin Schollnick

Antoon Pardon

Guy Scree

Chris Angelico

Rick Johnson

Steven D'Aprano

MRAB

Rick Johnson

Andrew Berg

Rick Johnson

Chris Angelico

Chris Angelico

wxjmfauth

Chris Angelico

MRAB

Chris Angelico

Jussi Piitulainen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads