DBD::Oracle, Unicode, non-UTF8-non-ASCII strings


J

Jochen Lehmeier

Hello,

I am using DBD::Oracle with a client and database characterset of AL32UTF8.

My problem is that DBD::Oracle assumes utf8-encoded strings in prepare().
If the input strings are not utf8, and contain non-ASCII characters, then
they are *not* automatically converted to utf8 by DBD::Oracle. They are
offered to Oracle as if they were utf8, and Oracle processes them even if
they are not valid, maybe even returning them to Perl as such; Perl is
then left with invalid utf8-flagged strings simply actually latin1
bytes/characters.

To be very clear: this is *not* an issue of wrong character encodings
during any time before the calls to DBI/DBD::Oracle (though it is,
afterwards).

I seem to remember, although I cannot find it now, that this behaviour
also was well documented in DBD::Oracle or at some other place, so I am
not so much complaining about DBD::Oracle (though I'm a bit miffed - this
would seem to be trivially fixable in there...).

Here is an example:

my $sql="select 'a \xe4 a' from dual"; # \xe4 is the german "a
umlaut"
diag(DBI::data_string_desc($sql)); # UTF8 off, non-ASCII, 24
characters 24 bytes
my $sth=$dbh->prepare($sql);
$sth->execute();
my ($rc)=$sth->fetchrow();
diag("is_utf8: ".utf8::is_utf8($rc)." valid:
'".utf8::valid($rc)."'"); # is_utf8: 1 valid: ''
diag(DBI::data_string_desc($rc)); # UTF8 on but INVALID encoding,
ASCII, 3 characters 5 bytes

The output:

# UTF8 off, non-ASCII, 24 characters 24 bytes
# is_utf8: 1 valid: ''
Malformed UTF-8 character (unexpected non-continuation byte 0x20,
immediately after start byte 0xe4) in pattern match (m//) at
/usr/lib/perl5/site_perl/5.8.3/x86_64-linux-thread-multi/DBI.pm line 1117.
# UTF8 on but INVALID encoding, ASCII, 3 characters 5 bytes

This also highlights one nastiness of the situation: no error or warning
whatsoever about the situation is displayed by the usual
prepare/execute/fetch calls. The "Malformed ..."-line above is from inside
DBI::data_string_desc, where the broken string is used in a regexp
(regexp's seem to be one place where this is noticed). So it's purely by
chance that there is a notice about this problem in this test script.

What to do? I have dozens, maybe 100 scripts that would be perfectly
utf8-capable except for this issue. They use proper :encoding() on their
file handles etc. . Is there some flag I can activate in DBD::Oracle or
DBI which automatically recodes "UTF8 off, non-ASCII" strings to utf8? A
simple Encode::decode("utf8",$sql) fixes the issue of course, but I am not
really looking forward to changing all those scripts, maybe missing lots
of stuff along the way or introducing errors.

We have a module that handles low level DB stuff like opening the DB
handle. So if there was a fix which could take place *once* after opening
the DB, that would be great.

Oh,

$DBD::Oracle::VERSION = '1.23';
$DBI::VERSION = "1.52";
NLS_LANG=GERMAN_GERMANY.AL32UTF8
Oracle 10

Thanks in advance!
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top