unicode "table of character" implementation in python

N

Nicolas Pontoizeau

Hi,

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Thanks in advance
 
B

Brian Beck

Nicolas said:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
 
N

Nicolas Pontoizeau

2006/8/22 said:
Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

As usual, Python has a solution that goes beyond my needs!
Thanks for the links I will dive into it.

Nicolas
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Nicolas said:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

This is a bit unspecific, so likely, nothing that already exists will
be completely correct for your needs. If you need to escape characters
for latex, I would expect that there is a more precise specification
of what you need to escape - I doubt the fact that a character is used
primarily in Asia matters much to latex.

In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
2D00..2D2F; Georgian Supplement
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs
FB50..FDFF; Arabic Presentation Forms-A
FE30..FE4F; CJK Compatibility Forms
FE70..FEFF; Arabic Presentation Forms-B
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement

Notice that some scripts are used both in Asia and elsewhere,
e.g. Latin and Cyrillic. Arabic probably doesn't belong in
this list, either, being used both in Asia and elsewhere
as the script of the official language.

Regards,
Martin
 
T

Tim Roberts

Martin v. Löwis said:
In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
...

This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tim said:
This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?

It's part of the Unicode Consortium's database (UCD, Unicode Character
Database). This specific table is called "code blocks":

http://www.unicode.org/Public/UNIDATA/Blocks.txt

Python currently has this table not compiled in, but it should be
trivial to compile this into a pure-Python table (either as a
dictionary, or a list of triples).

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,998
Latest member
MarissaEub

Latest Threads

Top