DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

Discussion in 'Perl Misc' started by Jochen Lehmeier, Jul 23, 2009.

  1. Hello,

    I am using DBD::Oracle with a client and database characterset of AL32UTF8.

    My problem is that DBD::Oracle assumes utf8-encoded strings in prepare().
    If the input strings are not utf8, and contain non-ASCII characters, then
    they are *not* automatically converted to utf8 by DBD::Oracle. They are
    offered to Oracle as if they were utf8, and Oracle processes them even if
    they are not valid, maybe even returning them to Perl as such; Perl is
    then left with invalid utf8-flagged strings simply actually latin1
    bytes/characters.

    To be very clear: this is *not* an issue of wrong character encodings
    during any time before the calls to DBI/DBD::Oracle (though it is,
    afterwards).

    I seem to remember, although I cannot find it now, that this behaviour
    also was well documented in DBD::Oracle or at some other place, so I am
    not so much complaining about DBD::Oracle (though I'm a bit miffed - this
    would seem to be trivially fixable in there...).

    Here is an example:

    my $sql="select 'a \xe4 a' from dual"; # \xe4 is the german "a
    umlaut"
    diag(DBI::data_string_desc($sql)); # UTF8 off, non-ASCII, 24
    characters 24 bytes
    my $sth=$dbh->prepare($sql);
    $sth->execute();
    my ($rc)=$sth->fetchrow();
    diag("is_utf8: ".utf8::is_utf8($rc)." valid:
    '".utf8::valid($rc)."'"); # is_utf8: 1 valid: ''
    diag(DBI::data_string_desc($rc)); # UTF8 on but INVALID encoding,
    ASCII, 3 characters 5 bytes

    The output:

    # UTF8 off, non-ASCII, 24 characters 24 bytes
    # is_utf8: 1 valid: ''
    Malformed UTF-8 character (unexpected non-continuation byte 0x20,
    immediately after start byte 0xe4) in pattern match (m//) at
    /usr/lib/perl5/site_perl/5.8.3/x86_64-linux-thread-multi/DBI.pm line 1117.
    # UTF8 on but INVALID encoding, ASCII, 3 characters 5 bytes

    This also highlights one nastiness of the situation: no error or warning
    whatsoever about the situation is displayed by the usual
    prepare/execute/fetch calls. The "Malformed ..."-line above is from inside
    DBI::data_string_desc, where the broken string is used in a regexp
    (regexp's seem to be one place where this is noticed). So it's purely by
    chance that there is a notice about this problem in this test script.

    What to do? I have dozens, maybe 100 scripts that would be perfectly
    utf8-capable except for this issue. They use proper :encoding() on their
    file handles etc. . Is there some flag I can activate in DBD::Oracle or
    DBI which automatically recodes "UTF8 off, non-ASCII" strings to utf8? A
    simple Encode::decode("utf8",$sql) fixes the issue of course, but I am not
    really looking forward to changing all those scripts, maybe missing lots
    of stuff along the way or introducing errors.

    We have a module that handles low level DB stuff like opening the DB
    handle. So if there was a fix which could take place *once* after opening
    the DB, that would be great.

    Oh,

    $DBD::Oracle::VERSION = '1.23';
    $DBI::VERSION = "1.52";
    NLS_LANG=GERMAN_GERMANY.AL32UTF8
    Oracle 10

    Thanks in advance!
    Jochen Lehmeier, Jul 23, 2009
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gry
    Replies:
    2
    Views:
    719
    Alf P. Steinbach
    Mar 13, 2012
  2. Feyruz
    Replies:
    4
    Views:
    2,158
    Sherm Pendley
    Oct 14, 2005
  3. Replies:
    0
    Views:
    112
  4. Jim Cochrane
    Replies:
    0
    Views:
    117
    Jim Cochrane
    Aug 25, 2007
  5. bruce
    Replies:
    38
    Views:
    271
    Mark Lawrence
    Nov 1, 2013
Loading...

Share This Page