Unicode-AGE of a character?

Discussion in 'Perl Misc' started by Ilya Zakharevich, Jan 10, 2012.

  1. I looked through the docs I could find, and can't find any way to
    determine the "Unicode AGE" of a particular codepoint except for:

    a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

    b) manually parsing $out = do 'unicore/To/Age.pl';.

    Do I miss anything?

    Thanks,
    Ilya
    Ilya Zakharevich, Jan 10, 2012
    #1
    1. Advertising

  2. Re: UCD, and Unicode-AGE of a character?

    On 2012-01-10, Ben Morrow <> wrote:
    >
    > Quoth Ilya Zakharevich <>:
    >> I looked through the docs I could find, and can't find any way to
    >> determine the "Unicode AGE" of a particular codepoint except for:
    >>
    >> a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;
    >>
    >> b) manually parsing $out = do 'unicore/To/Age.pl';.
    >>
    >> Do I miss anything?

    >
    > I don't think so. Note that before (I think) 5.14 unicore/To/Age.pl
    > doesn't exist, and before (I think) 5.12 unicode/DAge.txt doesn't exist
    > either. You may be better off just grabbing a copy of DerivedAge.txt
    > from the Unicode Consortium directly, and using that.


    What would be the best fix? (Myself, so far I do not use Perl's
    digested data, and parse Unicode Consortium files directly - so I
    do not qualify to judge.) Put the stuff into Unicode::UCD::age?

    BTW, why Unicode::UCD has so bizzare interface? Why not have
    Unicode::UCD::Name, for example? (The most important piece of data of
    those not available via Perl4 interfaces...)

    Ilya

    P.S. Is unicore/NamesList.txt included with latest distributions of
    Perl? My module relies on parsing this file, and... Aha, found it on

    http://cpansearch.perl.org/src/FLORA/perl-5.14.2/lib/unicore/

    , good!
    Ilya Zakharevich, Jan 10, 2012
    #2
    1. Advertising

  3. Re: UCD, and Unicode-AGE of a character?

    On 2012-01-10, Ben Morrow <> wrote:
    >> BTW, why Unicode::UCD has so bizzare interface? Why not have
    >> Unicode::UCD::Name, for example?
    >> (The most important piece of data of those not available via Perl4
    >> interfaces...)

    >
    > Well, they're not strictly 'Perl4' interfaces, of course, since none of
    > this existed before 5.8...


    As far as my memory serves me, lc, /\w/ etc were very well supported
    in Perl4. ;-)

    Thanks for the other [omitted] input,
    Ilya
    Ilya Zakharevich, Jan 11, 2012
    #3
  4. Ilya Zakharevich

    brian d foy Guest

    In article <>, Ilya
    Zakharevich <> wrote:

    > I looked through the docs I could find, and can't find any way to
    > determine the "Unicode AGE" of a particular codepoint except for:
    >
    > a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;


    Tom C. and I talked about this for a bit. You're stuck with testing
    each of the age properties until you find the earliest that matches.
    Or, you can go through all characters, determin their age, and create
    your own properties.

    I had to do this when we were tracking down some font issues for
    Programming Perl. It turned out that all the problems were related to
    Unicode 5 characters.
    brian d foy, Jan 16, 2012
    #4
  5. On 01/10/2012 07:47 AM, Ilya Zakharevich wrote:
    > I looked through the docs I could find, and can't find any way to
    > determine the "Unicode AGE" of a particular codepoint except for:
    >
    > a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;


    If you want to know the AGE then you should match the Age property;-)

    $ perl -E 'say "matches" if ("\x{0514}" =~ m/\p{Age=5.1}/)'
    matches

    > b) manually parsing $out = do 'unicore/To/Age.pl';.


    Or write a better Unicode::UCD module.
    I really wanted to do this, because Unicode::UCD does not use the
    original UCD--it also uses unicore.

    > Do I miss anything?


    You can install Unicode::Tussle from CPAN. It provides some scripts.

    Examples:

    $ perl /usr/local/bin/unichars '\p{Age=6.0}' '\p{Cyrillic}' | cat
    Ô¦ U+0526 CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER
    Ô§ U+0527 CYRILLIC SMALL LETTER SHHA WITH DESCENDER
    ê™  U+A660 CYRILLIC CAPITAL LETTER REVERSED TSE
    ꙡ U+A661 CYRILLIC SMALL LETTER REVERSED TSE

    $ time perl /usr/local/bin/uniprops -au U+0526 | grep -P '(Age|Pre)'
    Age=6.0 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
    Block=Cyrillic_Supplement Block=Cyrillic_Supplementary
    Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Cyrl
    Script=Cyrl Sentence_Break=UP Sentence_Break=Upper SB=UP

    real 0m1.380s
    user 0m1.352s
    sys 0m0.040s

    You see, that's very famous information, but it's very slow.

    Another disadvantage of uniprops is that it also uses unicore-files and
    thus depends on perl-5.14 (more or less). 5.10 misses many properties in
    unicore.

    IMHO you want something what I am also missing:

    use Unicode::properties;

    my $u = Unicode::properties->new();

    my $age = $u->get_property($char, 'Age');
    my $script = $u->get_property($char, 'Script');

    Helmut Wollmersdorfer
    Helmut Wollmersdorfer, Jan 18, 2012
    #5
  6. Ilya Zakharevich

    Guest

    On Monday, January 9, 2012 11:47:56 PM UTC-7, Ilya Zakharevich wrote:
    > I looked through the docs I could find, and can't find any way to
    > determine the "Unicode AGE" of a particular codepoint except for:
    >
    > a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;
    >
    > b) manually parsing $out = do 'unicore/To/Age.pl';.
    >
    > Do I miss anything?


    I don’t think so. When preparing the 4th Edition of Programming Perl
    for printing, we needed to run an analysis of code point use by age. I
    ended up doing this:

    $char_info->{Age} = do { given ( $char ) {

    when( /\p{Age=1.1}/ ) { '1.1' }

    when( /\p{Age=2.0}/ ) { '2.0' }
    when( /\p{Age=2.1}/ ) { '2.1' }

    when( /\p{Age=3.0}/ ) { '3.0' }
    when( /\p{Age=3.1}/ ) { '3.1' }
    when( /\p{Age=3.2}/ ) { '3.2' }

    when( /\p{Age=4.0}/ ) { '4.0' }
    when( /\p{Age=4.1}/ ) { '4.1' }

    when( /\p{Age=5.0}/ ) { '5.0' }
    when( /\p{Age=5.1}/ ) { '5.1' }
    when( /\p{Age=5.2}/ ) { '5.2' }

    when( /\p{Age=6.0}/ ) { '6.0' }

    default { 'N/A' }
    } };

    Which of course is suboptimal to say the least. I can criticize
    it in quote a few directions. But it's what we used anyway.

    I believe that Karl has some new stuff in the current blead that
    exposes some of the character maps so you don't have to parse
    the .pl files yourself. You might check into that.

    --tom
    , Feb 15, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. RickB
    Replies:
    0
    Views:
    466
    RickB
    Feb 6, 2004
  2. =?iso-8859-1?B?bW9vcJk=?=
    Replies:
    7
    Views:
    819
    Roedy Green
    Jan 2, 2006
  3. cylin
    Replies:
    6
    Views:
    576
    Mike Wahler
    Aug 19, 2003
  4. Kenneth McDonald
    Replies:
    1
    Views:
    832
    Carl Banks
    Dec 27, 2006
  5. Tyler
    Replies:
    1
    Views:
    939
    Robert Klemme
    Jul 29, 2011
Loading...

Share This Page