Unicode from Web to MySQL

B

Bill Eldridge

I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = ???
....
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
2) Convert to unicode before UTF-8
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')

etc., etc., but after numerous tries in the end I still keep getting
either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

Can someone give me any explanation of how to do this easily?

Thanks,
Bill
 
F

Francis Avila

Bill Eldridge wrote in message ...
etc., etc., but after numerous tries in the end I still keep getting
either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

Here's your clue: your string contains a byte which is not representable by
ascii.
Can someone give me any explanation of how to do this easily?

Quickstart guide:
You first need to decode your string to unicode. You do this by
'stringfromweb'.decode('encoding-of-the-string'). So if you grab a web page
that's in latin-1, you do 'stringfromweb'.decode('latin-1') and get unicode.
If you later want utf-8 (to plunk into SQL), take that unicode and
..encode('utf8').

Path-to-understanding:
You need to understand how unicode plays in to this first.

Unicode is not an encoding. Unicode is an abstract mapping of numbers
(called code points) to letters. Pure, undistilled "Unicode" is what you
see in those huge charts which show a number on the left and a long
uppercase letter/symbol description on the right. Unicode itself has nothing
to do with bytes, or even with computers.

A Python Unicode object is just that: an ordered sequence of unicode code
points. It has no natural byte representation. If you want that, you need
to encode it.

Note that unicode objects have no "decode" method. This is because unicode
is a LACK of encoding! Encoding maps symbols to byte representations, and a
unicode object is the explicit lack of a byte representation. So there are
no bytes to decode from. (Now of course the computer needs *some*
representation, becuase all it knows is bytes, but that could be anything,
and is entirely an implementation detail that you don't need to know about.
But you can see it with the 'unicode-internal' codec.)

A Python str object is an ordered sequence of 8-bit bytes. It is not really
a string--that's a holdover from the bygone days of pre-unicode Python.
When you encode a unicode object, you get raw bytes in some representation
of unicode characters, which are held by a str. When you want a unicode
object, you give it a str and a *known encoding*.

Now, what is the encoding of a str? You see this is like a strange Koan,
because bytes is bytes. Bytes have no intrinsic meaning until we give them
some. So whenever you decode a string to get unicode, you MUST supply the
encoding of the string!

There are ways to specify a default encoding for strings in Python (see your
site.py and sys.get/setdefaultencoding), but the default default is ascii.
Hence if byte '\xef' is found in a str, any attempt to encode it will choke,
because that byte is not in the 'ascii' encoding and thus the claim that
this str is encoded in ascii is false. (str.encode(codec) is really
shorthand for str.decode(default-encoding) -> unicode.encode(codec) )

Now, lets examine:
u'abc'

"Take three bytes 'abc', and decode it as if it were a unicode string
encoded as utf8."

Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What you really need to do, then, is:

Or:

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = data.decode(<the-encoding-of-this-string>).encode('utf8')
....
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Finding the encoding of that string from the web is where the tears come in.
If you're lucky, urllib.urlopen('http://....').info().getencoding() will
give it to you. However, this gets its info from the http headers, and if
they don't specify encoding, it defaults to '7bit'. But the html page
itself *might* have a different idea about its own encoding in the <meta>
element, 'content' attribute, which may be of the form "text/html;
charset=ISO-8859-1". Or it might not, who knows?

In other words, there is no standard, 100% reliable method of getting the
encoding of a web page. In an ideal world, the http header would have it,
and that's that. In the real world, you have to juggle various combinations
of information, missing information, and disinformation from the http
protocol header's info, the html file's meta info, and charset guessing
algorithms (look for Enca).

There might be a way to get urllib to request an encoding (as browsers do),
so that the http header will at least give some slightly more useful
information back, but I don't know how. As it is, it will almost always not
specify the charset if urllib is used, forcing you to look in the html file
itself.

But once you get the encoding, everything is fine....
 
S

Serge Orlov

Bill Eldridge said:
I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

You've got 8-bit data here. urllib doesn't currently handle encoding issues,
maybe submitting sf feature request will change that. But right now you have
to do it yourself. Scan (or parse) the html header for encoding, if it's absent
grad the encoding from the http header. If it's absent too, then the encoding
ASAIR is latin1. So you code should look like:
connection = urllib.urlopen('http://localhost/test.html')
encoding = 'latin-1'
header_encoding = get_http_header_encoding(connection)
data = connection.read()
content_encoding = get_http_content_encoding(data)
if header_encoding:
encoding = header_encoding
if content_encoding:
encoding = content_encoding
data2 = ???

data2 = data.decode(encoding,'replace')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?
where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)

Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.

As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)

You don't need re.UNICODE if you don't use \w, \W, \b, or \B
2) Convert to unicode before UTF-8

Not sure what that means.
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')

It's not a unicode problem, is it?

-- Serge.
 
S

Serge Orlov

Francis Avila said:
In other words, there is no standard, 100% reliable method of getting the
encoding of a web page.

There is a standard way. But you're right, it's not 100% reliable.
In an ideal world, the http header would have it, and that's that.
That's actually is not a good idea, because it will force the http server
writers to parse html header for the encoding. They will get away
with configuration parameter forcing all server files to be in one
encoding. But one day somebody will store a file in the wrong
encoding *for sure*. http header encoding is a bad idea.
In the real world, you have to juggle various combinations
of information, missing information, and disinformation from the http
protocol header's info, the html file's meta info, and charset guessing
algorithms (look for Enca).

It's not so bad. Web server and content editor writers are slowing
getting a clue. It used to be very bad, I remember it. I think the peak of
problems was in 98-99 years. But nowadays more than 99% of
web documents get encoding right. So having a simple read_unicode()
method of urlopener class would be very useful.
 
B

Bill Eldridge

Serge said:
You've got 8-bit data here. urllib doesn't currently handle encoding issues,
maybe submitting sf feature request will change that. But right now you have
to do it yourself. Scan (or parse) the html header for encoding, if it's absent
grad the encoding from the http header.
This part is fairly known - I'm setting up feeds that I'll actually look
at to scrape content, so identifying the encoding will be part of that.
.. What's driving me crazy is knowing the encoding
but still not getting the data all the way through the chain to MySQL.
If it's absent too, then the encoding
ASAIR is latin1. So you code should look like:
connection = urllib.urlopen('http://localhost/test.html')
encoding = 'latin-1'
header_encoding = get_http_header_encoding(connection)
data = connection.read()
content_encoding = get_http_content_encoding(data)
if header_encoding:
encoding = header_encoding
if content_encoding:
encoding = content_encoding
The latin-1 stuff isn't giving me problems, it's the Asian languages,
but I'll look at the connection end.
data2 = data.decode(encoding,'replace')




Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?
No, I haven't, I'll try it.
Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.
Alright, I've tried it both ways, but this makes it clearer why.
As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.





You don't need re.UNICODE if you don't use \w, \W, \b, or \B
Thanks, I don't.
Not sure what that means.


data.decode(None,'strict')
or
unicode(data,'unicode','strict')


It's not a unicode problem, is it?

Occasionally instead of getting the encoding error I get a SQL syntax error,
and figured somewhere it was misinterpreting something like the end
delimiter.
No proof though, just a guess, so I tried the replaces.

Thanks much,
Bill
 
B

Bill Eldridge

data2 = data.decode(encoding,'replace')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )



Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?
I added that now, but it doesn't seem to make much difference
(I think it's more for returning data from MySQL, not storing it,
but that will still be useful)

I did a test where I grabbed the URL using the same routines and
dumped the thing to a file, and then edited out all the English and
various HTML, and the SQL insert works at that point.

It seems the mixed language is throwing stuff off, which wouldn't
bother me if my re.search for only the Vietnamese text were working
properly, but it isn't.
 
S

Serge Orlov

. What's driving me crazy is knowing the encoding
but still not getting the data all the way through the chain to MySQL.

That's because you're trying to create SQL statements manually.
data.decode(None,'strict')
or
unicode(data,'unicode','strict')

See Francis' excellent post why both of these calls do not make sence at all.
Occasionally instead of getting the encoding error I get a SQL syntax error,
and figured somewhere it was misinterpreting something like the end
delimiter.
No proof though, just a guess, so I tried the replaces.

Uh, I see. If you're creating SQL statement yourself, you have to take care
of escaping. I guess after data2.encode('utf-8') you've got illegal (in SQL)
characters and you have to escape them. Another quick scan of mysql-python
docs reveals that .execute method can be called as
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''', args=(data2,) )
where data2 is a unicode string. Then the python wrapper will take care of
unicode convertions and escaping. Don't forget that you need to .connect with
unicode='utf-8' parameter.

-- Serge.
 
S

Serge Orlov

Quick scanning of mysql-python docs reveals that you should also
I added that now, but it doesn't seem to make much difference
(I think it's more for returning data from MySQL, not storing it,
but that will still be useful)

It's for storing too. If you pass str to .execute it will take the string
and just pass it to MySQL as you created it. But if you use args
parameter of .execute method it will take care of convertions and
escaping. In fact you should pass even ascii strings via args parameter,
if you don't want to deal with escaping yourself.
I did a test where I grabbed the URL using the same routines and
dumped the thing to a file, and then edited out all the English and
various HTML, and the SQL insert works at that point.

It probably contained illegal characters from SQL point of view.
It seems the mixed language is throwing stuff off, which wouldn't
bother me if my re.search for only the Vietnamese text were working
properly, but it isn't.

Why? re.search works fine for my Russian unicode characters.

-- Serge.
 
B

Bill Eldridge

Serge said:
That's because you're trying to create SQL statements manually.

As opposed to what?
See Francis' excellent post why both of these calls do not make sence at all.
Understand, I tried doing absolutely nothing
for the supposedly Unicode strings, just
taking them as-is and putting them into the
database, and I get these ASCII decoding errors.
I try decoding them as UTF-8, which supposedly
they are, and I get these errors. So then I
try stupid stuff just to see if something stupid
works.
Uh, I see. If you're creating SQL statement yourself, you have to take care
of escaping. I guess after data2.encode('utf-8') you've got illegal (in SQL)
characters and you have to escape them. Another quick scan of mysql-python
docs reveals that .execute method can be called as
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''', args=(data2,) )
where data2 is a unicode string. Then the python wrapper will take care of
unicode convertions and escaping. Don't forget that you need to .connect with
unicode='utf-8' parameter.
Again, what do you mean "creating SQL statement yourself"? and what is the
alternative?

I'll try out this version of the execute statement.
 
B

Bill Eldridge

Serge said:
Why? re.search works fine for my Russian unicode characters.
Yes, it seems and should be very easy and straight-forward,
do not know why, will have to spend more time on it.
 
S

Skip Montanaro

That's because you're trying to create SQL statements manually.

Bill> As opposed to what?

Let MySQLdb do the data escaping for you:

conn = MySQLdb.Connection(host=..., etc)
curs = conn.cursor()
curs.execte("insert into sometable"
" (field1, field2)"
" values"
" (%s, %s)",
(val1, val2))

Val1 and val2 can be utf-8-encoded strings. MySQLdb will do the right thing
for you...

Skip
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top