Unicode values

tamizhselvys · Feb 19, 2008

Hi,

Can any one explain me the difference between unicode and hexadecimal
entity used in xml.

Thanks,
Anu.

Andreas Prilop · Feb 19, 2008

Can any one explain me the difference between unicode and hexadecimal
entity used in xml.

For example, the Devanagari letter 'ka' has the position U+0915
in Unicode and can be referenced in both HTML and XML as क
or as क .
http://www.unics.uni-hannover.de/nhtcapri/sanskrit-alphabet

Andy Dingley · Feb 19, 2008

Can any one explain me the difference between unicode and hexadecimal
entity used in xml.

Try searching for "Jukka Korpela" and Unicode. He has an O'Reilly book
and a very useful website on the topic. Wikipedia is worth reading
too.

"Unicode" defines a "character set". There are also "encodings" that
specify how computers interpret sequences of bytes or numbers to turn
them into characters. There may be many encodings that all specify the
same character in the same character set, which can get complicated.

Character sets before Unicode tended to work for only one language at
a time. This made them manageably smaller, but also inconvenient for
multi-language work. Unicode takes the different approach: one single,
huge character set for everything.

When you use HTML or XML, there is only _one_ character set that is
ever used: Unicode.

There may be lots of different encodings for a HTML or XML document
(one at a time), but they all lead to Unicode characters. Most
commonly you will specify a character directly (e.g. by typing it),
which also requires you to make sure it's in a suitable encoding for
the document. Alternatively you can use a "numeric character entity"
to specify the Unicode character "ø" by its identifying number, either
in decimal ø or in hexadecimal ø No matter what the
document's encoding, these same numbers refer to these same
characters: it's skipping the encoding and going straight to Unicode.
This works equally in XML or HTML.

For a few of these characters, there are also "character entity
references" defined for HTML, such as ø (meaning the same "o
with a slash" character as before). These are a bit more readable than
the raw numbers. However remember that they're part of HTML only, not
XML! So you can use them in XHTML, but not in RSS.

(I've confused some definitions here between bytes / octets,
characters / codepoints and Unicode / UCS / ISO10646 in an attempt at
brevity, if not clarity. Jukka will probably accuse me of "worthless
babbling" again as a result)

Please explain: #define MOVE 0x05	1	Jun 4, 2023
Nan values after merging 2 dataframes	1	Apr 19, 2023
Incrementing Values	1	Aug 21, 2022
portable unicode literals	4	Oct 15, 2012
Unicode help please	5	Oct 19, 2013
Java MemoryLayout/ValueLayout Questions.	2	Feb 5, 2023
New to VHDL... Trying to convert a 2-bytes number into an decimal	0	Dec 9, 2022
Unicode	20	Dec 16, 2012

Unicode values

tamizhselvys

Andreas Prilop

Andy Dingley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads