Unicode-AGE of a character?

Ilya Zakharevich · Jan 10, 2012

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

b) manually parsing $out = do 'unicore/To/Age.pl';.

Do I miss anything?

Thanks,
Ilya

Ilya Zakharevich · Jan 10, 2012

I don't think so. Note that before (I think) 5.14 unicore/To/Age.pl
doesn't exist, and before (I think) 5.12 unicode/DAge.txt doesn't exist
either. You may be better off just grabbing a copy of DerivedAge.txt
from the Unicode Consortium directly, and using that.

What would be the best fix? (Myself, so far I do not use Perl's
digested data, and parse Unicode Consortium files directly - so I
do not qualify to judge.) Put the stuff into Unicode::UCD::age?

BTW, why Unicode::UCD has so bizzare interface? Why not have
Unicode::UCD::Name, for example? (The most important piece of data of
those not available via Perl4 interfaces...)

Ilya

P.S. Is unicore/NamesList.txt included with latest distributions of
Perl? My module relies on parsing this file, and... Aha, found it on

http://cpansearch.perl.org/src/FLORA/perl-5.14.2/lib/unicore/

, good!

Ilya Zakharevich · Jan 11, 2012

Well, they're not strictly 'Perl4' interfaces, of course, since none of
this existed before 5.8...

As far as my memory serves me, lc, /\w/ etc were very well supported
in Perl4. ;-)

Thanks for the other [omitted] input,
Ilya

brian d foy · Jan 16, 2012

Ilya said:
I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

Tom C. and I talked about this for a bit. You're stuck with testing
each of the age properties until you find the earliest that matches.
Or, you can go through all characters, determin their age, and create
your own properties.

I had to do this when we were tracking down some font issues for
Programming Perl. It turned out that all the problems were related to
Unicode 5 characters.

Helmut Wollmersdorfer · Jan 18, 2012

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

If you want to know the AGE then you should match the Age property;-)

$ perl -E 'say "matches" if ("\x{0514}" =~ m/\p{Age=5.1}/)'
matches

b) manually parsing $out = do 'unicore/To/Age.pl';.

Or write a better Unicode::UCD module.
I really wanted to do this, because Unicode::UCD does not use the
original UCD--it also uses unicore.

Do I miss anything?

You can install Unicode::Tussle from CPAN. It provides some scripts.

Examples:

$ perl /usr/local/bin/unichars '\p{Age=6.0}' '\p{Cyrillic}' | cat
Ô¦ U+0526 CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER
Ô§ U+0527 CYRILLIC SMALL LETTER SHHA WITH DESCENDER
ê™ U+A660 CYRILLIC CAPITAL LETTER REVERSED TSE
ê™¡ U+A661 CYRILLIC SMALL LETTER REVERSED TSE

$ time perl /usr/local/bin/uniprops -au U+0526 | grep -P '(Age|Pre)'
Age=6.0 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Block=Cyrillic_Supplement Block=Cyrillic_Supplementary
Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Cyrl
Script=Cyrl Sentence_Break=UP Sentence_Break=Upper SB=UP

real 0m1.380s
user 0m1.352s
sys 0m0.040s

You see, that's very famous information, but it's very slow.

Another disadvantage of uniprops is that it also uses unicore-files and
thus depends on perl-5.14 (more or less). 5.10 misses many properties in
unicore.

IMHO you want something what I am also missing:

use Unicode:

roperties;

my $u = Unicode:

roperties->new();

my $age = $u->get_property($char, 'Age');
my $script = $u->get_property($char, 'Script');

Helmut Wollmersdorfer

tchrist · Feb 15, 2012

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

b) manually parsing $out = do 'unicore/To/Age.pl';.

Do I miss anything?

I don’t think so. When preparing the 4th Edition of Programming Perl
for printing, we needed to run an analysis of code point use by age. I
ended up doing this:

$char_info->{Age} = do { given ( $char ) {

when( /\p{Age=1.1}/ ) { '1.1' }

when( /\p{Age=2.0}/ ) { '2.0' }
when( /\p{Age=2.1}/ ) { '2.1' }

when( /\p{Age=3.0}/ ) { '3.0' }
when( /\p{Age=3.1}/ ) { '3.1' }
when( /\p{Age=3.2}/ ) { '3.2' }

when( /\p{Age=4.0}/ ) { '4.0' }
when( /\p{Age=4.1}/ ) { '4.1' }

when( /\p{Age=5.0}/ ) { '5.0' }
when( /\p{Age=5.1}/ ) { '5.1' }
when( /\p{Age=5.2}/ ) { '5.2' }

when( /\p{Age=6.0}/ ) { '6.0' }

default { 'N/A' }
} };

Which of course is suboptimal to say the least. I can criticize
it in quote a few directions. But it's what we used anyway.

I believe that Karl has some new stuff in the current blead that
exposes some of the character maps so you don't have to parse
the .pl files yourself. You might check into that.

--tom

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
I need help with a Gemini prompt	1	May 14, 2025
Raspberry Pi Open Source PLC Communication Wonder LECPython, and Example of Communication with Omron PLC	0	Oct 9, 2024
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Encoding of character literals	4	Nov 3, 2011
Seeking co-founders for my company.	3	Sep 8, 2024
FAQ 4.31 How can I split a [character] delimited string except when inside [character]?	0	Apr 13, 2011
Want to host websites that I will probably be the only user from home. Sacrilege, I know, but it has always been a dream of mine. Where do I start?	2	Aug 13, 2024

Unicode-AGE of a character?

Ilya Zakharevich

Ilya Zakharevich

Ilya Zakharevich

brian d foy

Helmut Wollmersdorfer

tchrist

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads