Unicode-AGE of a character?

I

Ilya Zakharevich

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

b) manually parsing $out = do 'unicore/To/Age.pl';.

Do I miss anything?

Thanks,
Ilya
 
I

Ilya Zakharevich

I don't think so. Note that before (I think) 5.14 unicore/To/Age.pl
doesn't exist, and before (I think) 5.12 unicode/DAge.txt doesn't exist
either. You may be better off just grabbing a copy of DerivedAge.txt
from the Unicode Consortium directly, and using that.

What would be the best fix? (Myself, so far I do not use Perl's
digested data, and parse Unicode Consortium files directly - so I
do not qualify to judge.) Put the stuff into Unicode::UCD::age?

BTW, why Unicode::UCD has so bizzare interface? Why not have
Unicode::UCD::Name, for example? (The most important piece of data of
those not available via Perl4 interfaces...)

Ilya

P.S. Is unicore/NamesList.txt included with latest distributions of
Perl? My module relies on parsing this file, and... Aha, found it on

http://cpansearch.perl.org/src/FLORA/perl-5.14.2/lib/unicore/

, good!
 
I

Ilya Zakharevich

Well, they're not strictly 'Perl4' interfaces, of course, since none of
this existed before 5.8...

As far as my memory serves me, lc, /\w/ etc were very well supported
in Perl4. ;-)

Thanks for the other [omitted] input,
Ilya
 
B

brian d foy

Ilya said:
I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

Tom C. and I talked about this for a bit. You're stuck with testing
each of the age properties until you find the earliest that matches.
Or, you can go through all characters, determin their age, and create
your own properties.

I had to do this when we were tracking down some font issues for
Programming Perl. It turned out that all the problems were related to
Unicode 5 characters.
 
H

Helmut Wollmersdorfer

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

If you want to know the AGE then you should match the Age property;-)

$ perl -E 'say "matches" if ("\x{0514}" =~ m/\p{Age=5.1}/)'
matches
b) manually parsing $out = do 'unicore/To/Age.pl';.

Or write a better Unicode::UCD module.
I really wanted to do this, because Unicode::UCD does not use the
original UCD--it also uses unicore.
Do I miss anything?

You can install Unicode::Tussle from CPAN. It provides some scripts.

Examples:

$ perl /usr/local/bin/unichars '\p{Age=6.0}' '\p{Cyrillic}' | cat
Ô¦ U+0526 CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER
Ô§ U+0527 CYRILLIC SMALL LETTER SHHA WITH DESCENDER
ê™  U+A660 CYRILLIC CAPITAL LETTER REVERSED TSE
ꙡ U+A661 CYRILLIC SMALL LETTER REVERSED TSE

$ time perl /usr/local/bin/uniprops -au U+0526 | grep -P '(Age|Pre)'
Age=6.0 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Block=Cyrillic_Supplement Block=Cyrillic_Supplementary
Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Cyrl
Script=Cyrl Sentence_Break=UP Sentence_Break=Upper SB=UP

real 0m1.380s
user 0m1.352s
sys 0m0.040s

You see, that's very famous information, but it's very slow.

Another disadvantage of uniprops is that it also uses unicore-files and
thus depends on perl-5.14 (more or less). 5.10 misses many properties in
unicore.

IMHO you want something what I am also missing:

use Unicode::properties;

my $u = Unicode::properties->new();

my $age = $u->get_property($char, 'Age');
my $script = $u->get_property($char, 'Script');

Helmut Wollmersdorfer
 
T

tchrist

I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

b) manually parsing $out = do 'unicore/To/Age.pl';.

Do I miss anything?

I don’t think so. When preparing the 4th Edition of Programming Perl
for printing, we needed to run an analysis of code point use by age. I
ended up doing this:

$char_info->{Age} = do { given ( $char ) {

when( /\p{Age=1.1}/ ) { '1.1' }

when( /\p{Age=2.0}/ ) { '2.0' }
when( /\p{Age=2.1}/ ) { '2.1' }

when( /\p{Age=3.0}/ ) { '3.0' }
when( /\p{Age=3.1}/ ) { '3.1' }
when( /\p{Age=3.2}/ ) { '3.2' }

when( /\p{Age=4.0}/ ) { '4.0' }
when( /\p{Age=4.1}/ ) { '4.1' }

when( /\p{Age=5.0}/ ) { '5.0' }
when( /\p{Age=5.1}/ ) { '5.1' }
when( /\p{Age=5.2}/ ) { '5.2' }

when( /\p{Age=6.0}/ ) { '6.0' }

default { 'N/A' }
} };

Which of course is suboptimal to say the least. I can criticize
it in quote a few directions. But it's what we used anyway.

I believe that Karl has some new stuff in the current blead that
exposes some of the character maps so you don't have to parse
the .pl files yourself. You might check into that.

--tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,690
Members
48,796
Latest member
Greg L.

Latest Threads

Top