FAQ 6.23 How can I match strings with multibyte characters?

Discussion in 'Perl Misc' started by PerlFAQ Server, Feb 23, 2011.

  1. This is an excerpt from the latest version perlfaq6.pod, which
    comes with the standard Perl distribution. These postings aim to
    reduce the number of repeated questions as well as allow the community
    to review and update the answers. The latest version of the complete
    perlfaq is at http://faq.perl.org .

    --------------------------------------------------------------------

    6.23: How can I match strings with multibyte characters?

    Starting from Perl 5.6 Perl has had some level of multibyte character
    support. Perl 5.8 or later is recommended. Supported multibyte character
    repertoires include Unicode, and legacy encodings through the Encode
    module. See perluniintro, perlunicode, and Encode.

    If you are stuck with older Perls, you can do Unicode with the
    "Unicode::String" module, and character conversions using the
    "Unicode::Map8" and "Unicode::Map" modules. If you are using Japanese
    encodings, you might try using the jperl 5.005_03.

    Finally, the following set of approaches was offered by Jeffrey Friedl,
    whose article in issue #5 of The Perl Journal talks about this very
    matter.

    Let's suppose you have some weird Martian encoding where pairs of ASCII
    uppercase letters encode single Martian letters (i.e. the two bytes "CV"
    make a single Martian letter, as do the two bytes "SG", "VS", "XX",
    etc.). Other bytes represent single characters, just like ASCII.

    So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
    nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

    Now, say you want to search for the single character "/GX/". Perl
    doesn't know about Martian, so it'll find the two bytes "GX" in the "I
    am CVSGXX!" string, even though that character isn't there: it just
    looks like it is because "SG" is next to "XX", but there's no real "GX".
    This is a big problem.

    Here are a few ways, all painful, to deal with it:

    # Make sure adjacent "martian" bytes are no longer adjacent.
    $martian =~ s/([A-Z][A-Z])/ $1 /g;

    print "found GX!\n" if $martian =~ /GX/;

    Or like this:

    @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
    # above is conceptually similar to: @chars = $text =~ m/(.)/g;
    #
    foreach $char (@chars) {
    print "found GX!\n", last if $char eq 'GX';
    }

    Or like this:

    while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
    print "found GX!\n", last if $1 eq 'GX';
    }

    Here's another, slightly less painful, way to do it from Benjamin
    Goldberg, who uses a zero-width negative look-behind assertion.

    print "found GX!\n" if $martian =~ m/
    (?<![A-Z])
    (?:[A-Z][A-Z])*?
    GX
    /x;

    This succeeds if the "martian" character GX is in the string, and fails
    otherwise. If you don't like using (?<!), a zero-width negative
    look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).

    It does have the drawback of putting the wrong thing in $-[0] and $+[0],
    but this usually can be worked around.



    --------------------------------------------------------------------

    The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
    are not necessarily experts in every domain where Perl might show up,
    so please include as much information as possible and relevant in any
    corrections. The perlfaq-workers also don't have access to every
    operating system or platform, so please include relevant details for
    corrections to examples that do not work on particular platforms.
    Working code is greatly appreciated.

    If you'd like to help maintain the perlfaq, see the details in
    perlfaq.pod.
     
    PerlFAQ Server, Feb 23, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Morgan

    Validating multibyte strings

    Simon Morgan, Sep 23, 2005, in forum: C Programming
    Replies:
    3
    Views:
    321
    Richard Bos
    Sep 26, 2005
  2. TK

    multibyte characters

    TK, Nov 15, 2007, in forum: C++
    Replies:
    13
    Views:
    2,445
    Pete Becker
    Nov 17, 2007
  3. Vladimir Agafonkin

    reading multibyte characters

    Vladimir Agafonkin, Jun 21, 2006, in forum: Ruby
    Replies:
    0
    Views:
    115
    Vladimir Agafonkin
    Jun 21, 2006
  4. Robert Dodier

    seek/tell in presence of multibyte characters

    Robert Dodier, Nov 24, 2006, in forum: Perl Misc
    Replies:
    7
    Views:
    148
    Brian McCauley
    Nov 26, 2006
  5. PerlFAQ Server
    Replies:
    0
    Views:
    132
    PerlFAQ Server
    Jan 11, 2011
Loading...

Share This Page