polymorphic regex -- encoding issue

Discussion in 'Perl Misc' started by Dale, Oct 18, 2007.

  1. Dale

    Dale Guest

    Consider the following:

    my $html_string = get "http://stock.narod.ru/fibo.htm";
    my $russian_page = decode("cp1251", $html_string);
    while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
    print "$1 $2\n";
    }

    I get a CP1251-encoded page from a Russian site and search for words
    that might follow the word Фибоначчи (Fibonacci). But isn't this bit
    of code inefficient? I start right off by decoding the whole page,
    where I really only need to have decoded those portions of the page
    that match. So wouldn't it be better to encode the regex in CP1251 to
    do the matching, and then convert any matched strings to the encoding
    I want before printing out. Something like the following:

    $russian_page = get "http://stock.narod.ru/fibo.htm";
    my $search_word = encode("cp1251", "Фибоначчи");
    while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
    print decode("cp1251", "$1 $2\n");
    }

    This doesn't obviously fail, but it doesn't give the expected result
    either. Presumably, the problem is that I've only encoded part of my
    regex in CP1251. So the question is: Is there a way to change the
    encoding of a regular expression?

    A couple details:

    Perl version:
    5.8.8

    Pragmas and modules used:
    LWP::Simple
    utf8;
    Encode;
    binmode(STDOUT, ":utf8");
    Dale, Oct 18, 2007
    #1
    1. Advertising

  2. Dale

    Ben Morrow Guest

    Quoth Dale <>:
    > Consider the following:
    >
    > my $html_string = get "http://stock.narod.ru/fibo.htm";
    > my $russian_page = decode("cp1251", $html_string);
    > while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
    > print "$1 $2\n";
    > }
    >
    > I get a CP1251-encoded page from a Russian site and search for words
    > that might follow the word Фибоначчи (Fibonacci). But isn't this bit
    > of code inefficient? I start right off by decoding the whole page,
    > where I really only need to have decoded those portions of the page
    > that match. So wouldn't it be better to encode the regex in CP1251 to
    > do the matching, and then convert any matched strings to the encoding
    > I want before printing out. Something like the following:
    >
    > $russian_page = get "http://stock.narod.ru/fibo.htm";
    > my $search_word = encode("cp1251", "Фибоначчи");
    > while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
    > print decode("cp1251", "$1 $2\n");
    > }
    >
    > This doesn't obviously fail, but it doesn't give the expected result
    > either. Presumably, the problem is that I've only encoded part of my
    > regex in CP1251. So the question is: Is there a way to change the
    > encoding of a regular expression?


    Nope, there isn't. All you can do is decode all the separate parts into
    bytes, and then ask for a regex that matches by bytes.

    At the very least you want a 'use bytes' around that regex and match.
    You also need to be aware that perl will be doing a byte-by-byte match,
    so if it's possible for part of a character to match (which depends on
    the encoding: it is possible with UTF16, but not with UTF8, for
    instance. I'm afraid I don't know about cp1251.) you will get false
    positives. You also need to be sure that LWP is returning you the page
    as bytes, and not trying to be clever and decoding it to UTF8 already. I
    presume you already know that.

    Unless you have an awful lot of these matches to do (and you know this
    is what's slowing you down), it's not worth the bother.

    Ben
    Ben Morrow, Oct 18, 2007
    #2
    1. Advertising

  3. Dale

    Dale Guest

    Thanks Ben. The problem is, of course consistency. I want to make
    sure, that I also decode '\w' and '\s' so that they match the same
    things that they would have matched in the original regex. The perldoc
    says one can influence what '\w' matches by using locales. But I
    managed to find a consistent translation without using locales (now
    I'm answering my own question):


    # As before, I search for the word Fibonacci, in CP1251-encoded
    Cyrillic
    my $search_word = encode("cp1251", "Фибоначчи");

    # CP1251 is an extended ASCII charset in the range 00-FF. Here we
    # get this set of characters and decode them into Unicode.
    my @cp1251_charset =
    split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

    # Find out which of these characters are matched by '\w' (in Unicode).
    my @cp1251_wordchars =
    grep(/\w/, @cp1251_charset);

    # The matched word characters are put back into CP1251
    my $w = encode("CP1251", join("", @cp1251_wordchars));

    # We follow the same idea as above for the space characters.
    my @cp1251_spacechars =
    grep(/\s/, @cp1251_charset);
    my $s = encode("CP1251", join("", @cp1251_spacechars));

    # Now we just put the pieces together
    my $russian_page = get "http://stock.narod.ru/fibo.htm";
    while ($russian_page =~ m/($search_word)[$s]([$w]+)/g) {
    print decode("cp1251", "$1 $2\n");
    }


    Details (same as in previous version):

    Perl version
    5.8.8

    modules used
    Encode;
    LWP::Simple qw(get);
    utf8;
    binmode(STDOUT, ":utf8");

    Note: Why didn't I use setlocale, as the Perldoc suggests? First
    reason: Our computers are somehow set up with a very limited range of
    possible locales. Second reason: locales are confusing for me. I
    prefer to avoid them. I set my environment to en_US.utf8 and I don't
    want to think about locales any more after that.
    Dale, Oct 19, 2007
    #3
  4. [A complimentary Cc of this posting was sent to
    Dale
    <>], who wrote in article <>:
    > # CP1251 is an extended ASCII charset in the range 00-FF. Here we
    > # get this set of characters and decode them into Unicode.
    > my @cp1251_charset =3D
    > split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));
    >
    > # Find out which of these characters are matched by '\w' (in Unicode).
    > my @cp1251_wordchars =3D
    > grep(/\w/, @cp1251_charset);
    >
    > # The matched word characters are put back into CP1251
    > my $w =3D encode("CP1251", join("", @cp1251_wordchars));


    To baroque, IMO. I would use something like

    my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr, 0x00..0xFF;

    Your approach has a chance to be quickier, though, but since this
    should only run once... [I did not benchmark them.]

    Ilya
    Ilya Zakharevich, Oct 19, 2007
    #4
  5. Dale

    Dr.Ruud Guest

    Ilya Zakharevich schreef:

    > my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr,
    > 0x00..0xFF;


    Alternative:

    my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 20, 2007
    #5
  6. Thanks Ilya and Affijn for your "improvements" but I still like my own
    code better, because at least I break it down into commented steps. I
    know my comments are minimal, but at least I tried. The reader of my
    code is bound to find several things confusing:

    > my @cp1251_charset =
    > split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));


    Here are some questions that are bound to arise:

    Why "decode CP1251"? How can you see that the input was ever encoded
    as CP1251 to begin with? We must be assuming that 'chr' returns
    something that can at least be thought of as as CP1251 encoded. But
    consider the small test program:

    print chr(0xFF);

    This may print out ÿ (LATIN SMALL LETTER Y WITH DIAERESIS), a
    character that doesn't even exist in CP1251. Of course, it only prints
    out this character if you're using "binmode(STDOUT, ":utf8");" or "use
    encoding 'utf8';", but you can see that there is plenty of room for
    confusion.

    Then there is the issue of what is stored in "@cp1251_charset". Since
    it's the output of 'decode', then it must be decoded, right? Whatever
    "decoded" means. You see my point. A comment would be helpful, and
    this won't be possible if you pack everything into one line.

    But what the "improvers" of my code also missed is that I had a second
    reason for the itermediate step. I wanted the complete CP1251 charset
    stored in a variable so that I could make several passes through it.
    As you see in the small example I made two passes. Once for '\w' and
    once for '\s'.

    I'm sure there are legitimate improvements that could be made to my
    code, but it baffles me that people should see packing into a oneliner
    as something virtuous.

    Dale Gerdemann
    Dale Gerdemann, Oct 21, 2007
    #6
  7. Dale

    Dr.Ruud Guest

    Dale Gerdemann schreef:

    > Thanks Ilya and Affijn for your "improvements" but I still like my own
    > code better, because at least I break it down into commented steps.


    Ahem, you are replying to the wrong message. I reply to the part that I
    quote. So the relation to your code was broken by me on purpose.


    > But what the "improvers" of my code also missed is that I had a second
    > reason for the itermediate step. I wanted the complete CP1251 charset
    > stored in a variable so that I could make several passes through it.
    > As you see in the small example I made two passes. Once for '\w' and
    > once for '\s'.


    What you are missing is that the $w in

    my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

    contains exactly what is in your $w.

    So for $s you can just do:

    my $s = pack "C*", grep decode('cp1251', chr) =~ /\s/, 0..255;


    Perhaps you like it more like this:

    $cp1251_word_chars =
    pack("C*", grep decode('cp1251', chr) =~ /\w/, 0..255);
    $cp1251_whitespace_chars =
    pack("C*", grep decode('cp1251', chr) =~ /\s/, 0..255);

    so that your

    m/($search_word)[$s]([$w]+)/g)

    becomes

    m/($search_word)[$cp1251_whitespace_chars]([$cp1251_word_chars]+)/g


    And maybe you should allow more than 1 whitespace character there:

    m/($search_word)[$cp1251_whitespace_chars]+([$cp1251_word_chars]+)/g


    And if your $search_word can ever contain regex metacharacters, look
    into quotemeta.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 21, 2007
    #7
  8. [A complimentary Cc of this posting was sent to
    Dale Gerdemann
    <>], who wrote in article <>:
    > But what the "improvers" of my code also missed is that I had a second
    > reason for the itermediate step. I wanted the complete CP1251 charset
    > stored in a variable so that I could make several passes through it.
    > As you see in the small example I made two passes. Once for '\w' and
    > once for '\s'.


    What makes you think that "improvers of your code" missed this? At
    least, I explicitly said that your solution might be quickier.

    > I'm sure there are legitimate improvements that could be made to my
    > code, but it baffles me that people should see packing into a oneliner
    > as something virtuous.


    It was "your code packed into a oneliner". It was absolutely
    different code; and if you do not like oneliners, just unpack it using
    dummy variables.

    What your code had was using encode/decode cycle, while your intent
    was, obviously, to do only a decode. I corrected your code to match
    your intent.

    Hope this helps,
    Ilya
    Ilya Zakharevich, Oct 24, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Manco

    can webmethods be polymorphic?

    Manco, Feb 3, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    321
    Lionel LASKE
    Feb 3, 2005
  2. Thomas Britton

    polymorphic behaviour from class constant

    Thomas Britton, May 1, 2004, in forum: Java
    Replies:
    1
    Views:
    330
    Chris Uppal
    May 2, 2004
  3. Khanh  Le

    polymorphic question

    Khanh Le, May 2, 2004, in forum: Java
    Replies:
    3
    Views:
    392
    Tim Van Wassenhove
    May 2, 2004
  4. -electric.com

    SOAP: Creating a polymorphic Data Type

    -electric.com, Feb 17, 2005, in forum: Java
    Replies:
    0
    Views:
    386
    -electric.com
    Feb 17, 2005
  5. Replies:
    3
    Views:
    746
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page