unicode "table of character" implementation in python

Nicolas Pontoizeau · Aug 22, 2006

Hi,

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Thanks in advance

Brian Beck · Aug 22, 2006

Nicolas said:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

Nicolas Pontoizeau · Aug 22, 2006

2006/8/22 said:
Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

As usual, Python has a solution that goes beyond my needs!
Thanks for the links I will dive into it.

Nicolas

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 28, 2006

Nicolas said:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

This is a bit unspecific, so likely, nothing that already exists will
be completely correct for your needs. If you need to escape characters
for latex, I would expect that there is a more precise specification
of what you need to escape - I doubt the fact that a character is used
primarily in Asia matters much to latex.

In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
2D00..2D2F; Georgian Supplement
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs
FB50..FDFF; Arabic Presentation Forms-A
FE30..FE4F; CJK Compatibility Forms
FE70..FEFF; Arabic Presentation Forms-B
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement

Notice that some scripts are used both in Asia and elsewhere,
e.g. Latin and Cyrillic. Arabic probably doesn't belong in
this list, either, being used both in Asia and elsewhere
as the script of the official language.

Regards,
Martin

Tim Roberts · Aug 30, 2006

Martin v. Löwis said:
In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
...

This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 9, 2006

Tim said:
This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?

It's part of the Unicode Consortium's database (UCD, Unicode Character
Database). This specific table is called "code blocks":

http://www.unicode.org/Public/UNIDATA/Blocks.txt

Python currently has this table not compiled in, but it should be
trivial to compile this into a pure-Python table (either as a
dictionary, or a list of triples).

Regards,
Martin

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode questions	17	Oct 19, 2010
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009
Unicode (UTF-8) in C	13	Mar 16, 2014
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
File names, character sets and Unicode	1	Dec 12, 2008
Sort by number of characters	1	Nov 2, 2023

unicode "table of character" implementation in python

Nicolas Pontoizeau

Brian Beck

Nicolas Pontoizeau

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tim Roberts

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads