DBD::mysql used to take octets into the utf8 texts but no more inmariadb

P

Peter Vereshagin

Hello,

I had used DBD::mysql for ages with utf8 in mysql 4.1-5.1.
Now I go by with mariadb-5.2.
I notice the strange disappearing of the feature I implicitly used for years, and authors of very well established applications, e. g., bugzilla.org, webgui.org, etc. did, too.
In the past, the text columns ( including char columns ) allowed to insert the data from octet scalars, those without utf8 flag set.
Now it must be a blob column. Or, the variable should be utf8::decode'd, perhaps additionally checked before if it is utf8::valid.
Or any national character will be the question marks instead stored in a database.
There is a test, 55-utf8.t in a DBD::mysql distribution, that inserts and selects back, seperately the octet scalar in blob field, and utf8 scalar in text and varchar field.
This is why the lack of the feature is not noticed on make test.
But the applications I mentioned afore certainly use it. They insert octet perl variables into text and char type fields careless about utf8 flag should be set.

There are three kind authorities that have influence on this: DBD::mysql, mariadb, and authors of those applications.
Question is: whom I shall ask for a response for this?
- mariadb revealed this bug, but it may be not of their's in the C API. It can be because of DBD::mysql's XS.
- DBD::mysql was hiding this bug, but they stated clear in their test that octets should use blob-fields only. Despite I remember the times when mysql did no difference between text and blob fields.
- applications writers may need to rewrite their code taking care on utf8 flag before their scalar to fit the DBI placeholders. But it just may not sound real as there is too much of their too long spaghetti code.

As a workaround, I use the mysql_enable_utf8 attribute included into the dsn whet connect but authors of those apps don't do that in their code supposing that setting this as a dbh property later by the code runs is enough. But it doesn't dwim.

Thank you for any ideas about whom shall I ask about this.

73! Peter pgp: A0E26627 (4A42 6841 2871 5EA7 52AB 12F8 0CE1 4AAC A0E2 6627)
 
D

Dr.Ruud

I had used DBD::mysql for ages with utf8 in mysql 4.1-5.1.
Now I go by with mariadb-5.2.
I notice the strange disappearing of the feature I implicitly used for years, and authors of very well established applications, e. g., bugzilla.org, webgui.org, etc. did, too.
In the past, the text columns ( including char columns ) allowed to insert the data from octet scalars, those without utf8 flag set.
Now it must be a blob column. Or, the variable should be utf8::decode'd, perhaps additionally checked before if it is utf8::valid.
Or any national character will be the question marks instead stored in a database.
There is a test, 55-utf8.t in a DBD::mysql distribution, that inserts and selects back, seperately the octet scalar in blob field, and utf8 scalar in text and varchar field.
This is why the lack of the feature is not noticed on make test.
But the applications I mentioned afore certainly use it. They insert octet perl variables into text and char type fields careless about utf8 flag should be set.

There are three kind authorities that have influence on this: DBD::mysql, mariadb, and authors of those applications.
Question is: whom I shall ask for a response for this?
- mariadb revealed this bug, but it may be not of their's in the C API. It can be because of DBD::mysql's XS.
- DBD::mysql was hiding this bug, but they stated clear in their test that octets should use blob-fields only. Despite I remember the times when mysql did no difference between text and blob fields.
- applications writers may need to rewrite their code taking care on utf8 flag before their scalar to fit the DBI placeholders. But it just may not sound real as there is too much of their too long spaghetti code.

As a workaround, I use the mysql_enable_utf8 attribute included into the dsn whet connect but authors of those apps don't do that in their code supposing that setting this as a dbh property later by the code runs is enough. But it doesn't dwim.

It could also just be that you had a C<SET NAME 'utf8'> done initially
for every session.
This is configurable on the host.
 
P

Peter Vereshagin

Oh Dr.Ruud want you buy me a mersedes benz?
2011/03/13 17:51:56 +0100 "Dr.Ruud" <[email protected]> => comp.lang.perl.misc:
DR> > - mariadb revealed this bug, but it may be not of their's in the C API. It can be because of DBD::mysql's XS.
DR> > - DBD::mysql was hiding this bug, but they stated clear in their test that octets should use blob-fields only. Despite I remember the times when mysql did no difference between text and blob fields.
DR> > - applications writers may need to rewrite their code taking care on utf8 flag before their scalar to fit the DBI placeholders. But it just may not sound real as there is too much of their too long spaghetti code.
DR> >
DR> > As a workaround, I use the mysql_enable_utf8 attribute included into the dsn whet connect but authors of those apps don't do that in their code supposing that setting this as a dbh property later by the code runs is enough. But it doesn't dwim.
DR>
DR> It could also just be that you had a C<SET NAME 'utf8'> done initially
DR> for every session.
DR> This is configurable on the host.

It's not a silver bullet to me.
There are circumstances:
- There are several Cyrillic encodings while utf8 is a one preferable at the application level the cp1251 improves mysql speed as it's an 8-bit one. So it's still a typical use: you send and receive data to mysql as utf8 but mysql converts them to/from cp1251 for storage.
- While mariadb was converting my databases, the default character set in the database definition was reset by conversion process ( myisam to aria or so ). Any tables created anew had that strange behavior as I said on topic start.
- Typically application's authors never care about utf flag on a variable before to put it on a DBI placeholder.
- I did some tricks that made me a thing after some deeper learning about. But I'm still in confusion ( see below ).

Here is the test to try-it-yourself:
Environment is:
show create database test\G

Create Database: CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */

Code is:

#!/usr/bin/perl
use strict;
use warnings;


use DBI;
use Devel::peek;

my( $dsn, $db, ) = ( "dbi:mysql:host=127.0.0.1;database=test", "test", );

my $blob = "\x{d1}\x{84}dam";
my $pwd = 'OLBPVCtEirwJMwE0';

sub getPushPop{
my $dbh = shift;
foreach(
[ "drop table if exists test02" ],
[ "create table test02 ( test02 mediumtext character set cp1251 ) " ],
[ "insert into test02 values ( ? ) ", {}, $blob ],
){
$dbh->do( @$_ ) or die;
}
}

foreach my $db_sub (
sub{ DBI->connect( $dsn, $db, $pwd ) or die $!; },
sub{ my $dbh = DBI->connect( $dsn, $db, $pwd ) or die $!;
$dbh->{mysql_enable_utf8}=1; $dbh;
},
sub{ DBI->connect( $dsn.';mysql_enable_utf8=1', $db, $pwd ) or die $! },
sub{ my $dbh = DBI->connect( $dsn, $db, $pwd ) or die $!;
foreach(
'set character set utf8',
){
$dbh->do( $_ ) or die $!;
}
$dbh;
},
){
my $dbh = &$db_sub;
getPushPop $dbh;
print Devel::peek::Dump $dbh->selectall_arrayref( "select * from test02" )->[0]->[0];
$dbh -> disconnect;
}

The output is:

SV = PV(0x2870194c) at 0x287fb37c
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fc2c "?\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2870194c) at 0x287fb368
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fcec "?\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2881e554) at 0x287fb368
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fccc "\321\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2881e554) at 0x287fb32c
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fcac "?dam"\0
CUR = 4
LEN = 8


Output coming 3rd is my solution and the 4th is yours ( and not only yours ).

but it's enough to put a magic spell on it:
alter database test character set utf8;

and voila:

SV = PV(0x2870194c) at 0x287fb37c
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fc2c "?\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2870194c) at 0x287fb368
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fcec "?\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2881e554) at 0x287fb368
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fccc "\321\204dam"\0
CUR = 5
LEN = 8
SV = PV(0x2881e554) at 0x287fb32c
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2865fcac "\321\204dam"\0
CUR = 5
LEN = 8

we can see the cyrillic letter with 'set names utf8'.

What I see is: cyrillic data can be pushed and popped from the mysql storage despite default set of database is latin1.
But this depends very implicitly on a feature that they shouldn't be dependent of.
Actually, not on a feature itself but on a way this feature is used.

Yes, I found a solution for myself, but the questions aforequoted remain: whom of software authors should I ask about that strange behavior?

73! Peter pgp: A0E26627 (4A42 6841 2871 5EA7 52AB 12F8 0CE1 4AAC A0E2 6627)
 
D

Dr.Ruud

Output coming 3rd is my solution and the 4th is yours ( and not only yours ).

I didn't give you any solution, just a pointer to 'SET NAMES'.

whom of software authors should I ask about that strange behavior?

There is nothing strange about that behavior, it all works as documented.

You need to check (and compare) the character set settings on all levels
(connection, database, table, column), to find out where the differences
come from.

Good reads are:
http://dev.mysql.com/doc/refman/5.1/en/charset.html
http://code.openark.org/blog/mysql/mysqls-character-sets-and-collations-demystified

Another confused person:
http://www.bluebox.net/news/2009/07/mysql_encoding
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top