Unicode from Web to MySQL

Discussion in 'Python' started by Bill Eldridge, Dec 20, 2003.

  1. I'm trying to grab a document off the Web and toss it
    into a MySQL database, but I keep running into the
    various encoding problems with Unicode (that aren't
    a problem for me with GB2312, BIG 5, etc.)

    What I'd like is something as simple as:

    CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

    import MySQLdb, re,urllib

    data = urllib.urlopen('http://localhost/test.html').read()

    data2 = ???
    ....
    c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

    where data2 is somehow the UTF-8 converted version of the original Web page.

    Additionally, I'd like to be able to do:

    body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

    data = urllib.urlopen('http://localhost/test.html').read()

    main_body = body_expr.search(data).group(1)

    and insert that into the database, and most likely I need to

    I'm sitting with a dozen explanations from the Web explaining
    how to do this,
    0) decode('utf-8','ignore') or 'strict', or 'replace'...
    1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
    re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
    2) Convert to unicode before UTF-8
    3) replace quotation marks within the SQL statement:
    data2.replace(u'"',u'\\"')

    etc., etc., but after numerous tries in the end I still keep getting
    either SQL errors or
    the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

    Can someone give me any explanation of how to do this easily?

    Thanks,
    Bill
    Bill Eldridge, Dec 20, 2003
    #1
    1. Advertising

  2. Bill Eldridge wrote in message ...
    >etc., etc., but after numerous tries in the end I still keep getting
    >either SQL errors or
    >the dreaded 'ascii' codec can't decode byte ... in position ...' errors.


    Here's your clue: your string contains a byte which is not representable by
    ascii.

    >Can someone give me any explanation of how to do this easily?


    Quickstart guide:
    You first need to decode your string to unicode. You do this by
    'stringfromweb'.decode('encoding-of-the-string'). So if you grab a web page
    that's in latin-1, you do 'stringfromweb'.decode('latin-1') and get unicode.
    If you later want utf-8 (to plunk into SQL), take that unicode and
    ..encode('utf8').

    Path-to-understanding:
    You need to understand how unicode plays in to this first.

    Unicode is not an encoding. Unicode is an abstract mapping of numbers
    (called code points) to letters. Pure, undistilled "Unicode" is what you
    see in those huge charts which show a number on the left and a long
    uppercase letter/symbol description on the right. Unicode itself has nothing
    to do with bytes, or even with computers.

    A Python Unicode object is just that: an ordered sequence of unicode code
    points. It has no natural byte representation. If you want that, you need
    to encode it.

    Note that unicode objects have no "decode" method. This is because unicode
    is a LACK of encoding! Encoding maps symbols to byte representations, and a
    unicode object is the explicit lack of a byte representation. So there are
    no bytes to decode from. (Now of course the computer needs *some*
    representation, becuase all it knows is bytes, but that could be anything,
    and is entirely an implementation detail that you don't need to know about.
    But you can see it with the 'unicode-internal' codec.)

    A Python str object is an ordered sequence of 8-bit bytes. It is not really
    a string--that's a holdover from the bygone days of pre-unicode Python.
    When you encode a unicode object, you get raw bytes in some representation
    of unicode characters, which are held by a str. When you want a unicode
    object, you give it a str and a *known encoding*.

    Now, what is the encoding of a str? You see this is like a strange Koan,
    because bytes is bytes. Bytes have no intrinsic meaning until we give them
    some. So whenever you decode a string to get unicode, you MUST supply the
    encoding of the string!

    There are ways to specify a default encoding for strings in Python (see your
    site.py and sys.get/setdefaultencoding), but the default default is ascii.
    Hence if byte '\xef' is found in a str, any attempt to encode it will choke,
    because that byte is not in the 'ascii' encoding and thus the claim that
    this str is encoded in ascii is false. (str.encode(codec) is really
    shorthand for str.decode(default-encoding) -> unicode.encode(codec) )

    Now, lets examine:

    >>> 'abc'.decode('utf8')

    u'abc'

    "Take three bytes 'abc', and decode it as if it were a unicode string
    encoded as utf8."


    >>> '\xef'.encode('utf8')

    Traceback (most recent call last):
    ...
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
    ordinal not in range(128)

    What you really need to do, then, is:

    >>> PureUnicodeUnsulliedByBits = stringfromtheBADBADweb.decode('latin-1')


    Or:

    import MySQLdb, re,urllib

    data = urllib.urlopen('http://localhost/test.html').read()

    data2 = data.decode(<the-encoding-of-this-string>).encode('utf8')
    ....
    c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

    Finding the encoding of that string from the web is where the tears come in.
    If you're lucky, urllib.urlopen('http://....').info().getencoding() will
    give it to you. However, this gets its info from the http headers, and if
    they don't specify encoding, it defaults to '7bit'. But the html page
    itself *might* have a different idea about its own encoding in the <meta>
    element, 'content' attribute, which may be of the form "text/html;
    charset=ISO-8859-1". Or it might not, who knows?

    In other words, there is no standard, 100% reliable method of getting the
    encoding of a web page. In an ideal world, the http header would have it,
    and that's that. In the real world, you have to juggle various combinations
    of information, missing information, and disinformation from the http
    protocol header's info, the html file's meta info, and charset guessing
    algorithms (look for Enca).

    There might be a way to get urllib to request an encoding (as browsers do),
    so that the http header will at least give some slightly more useful
    information back, but I don't know how. As it is, it will almost always not
    specify the charset if urllib is used, forcing you to look in the html file
    itself.

    But once you get the encoding, everything is fine....
    --
    Francis Avila
    Francis Avila, Dec 20, 2003
    #2
    1. Advertising

  3. Bill Eldridge

    Serge Orlov Guest

    "Bill Eldridge" <> wrote in message news:...
    >
    > I'm trying to grab a document off the Web and toss it
    > into a MySQL database, but I keep running into the
    > various encoding problems with Unicode (that aren't
    > a problem for me with GB2312, BIG 5, etc.)
    >
    > What I'd like is something as simple as:
    >
    > CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
    >
    > import MySQLdb, re,urllib
    >
    > data = urllib.urlopen('http://localhost/test.html').read()


    You've got 8-bit data here. urllib doesn't currently handle encoding issues,
    maybe submitting sf feature request will change that. But right now you have
    to do it yourself. Scan (or parse) the html header for encoding, if it's absent
    grad the encoding from the http header. If it's absent too, then the encoding
    ASAIR is latin1. So you code should look like:
    connection = urllib.urlopen('http://localhost/test.html')
    encoding = 'latin-1'
    header_encoding = get_http_header_encoding(connection)
    data = connection.read()
    content_encoding = get_http_content_encoding(data)
    if header_encoding:
    encoding = header_encoding
    if content_encoding:
    encoding = content_encoding

    >
    > data2 = ???


    data2 = data.decode(encoding,'replace')

    > ...
    > c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )


    Quick scanning of mysql-python docs reveals that you should also
    call connect with unicode='utf-8' parameter. Have you done that?

    >
    > where data2 is somehow the UTF-8 converted version of the original Web page.
    >
    > Additionally, I'd like to be able to do:
    >
    > body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
    >
    > data = urllib.urlopen('http://localhost/test.html').read()
    >
    > main_body = body_expr.search(data).group(1)


    Don't do that to data var because data is an 8bit string which
    contains bytes not characters, use data2 instead.

    As a rule of thumb you should decode to unicode as soon as you
    can and leave unicode world as late as you can. And use unicode
    aware APIs when they are available, this way you won't even
    need to encode unicode objects.


    >
    > and insert that into the database, and most likely I need to
    >
    > I'm sitting with a dozen explanations from the Web explaining
    > how to do this,
    > 0) decode('utf-8','ignore') or 'strict', or 'replace'...
    > 1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
    > re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)


    You don't need re.UNICODE if you don't use \w, \W, \b, or \B

    > 2) Convert to unicode before UTF-8


    Not sure what that means.

    > 3) replace quotation marks within the SQL statement:
    > data2.replace(u'"',u'\\"')


    It's not a unicode problem, is it?

    -- Serge.
    Serge Orlov, Dec 20, 2003
    #3
  4. Bill Eldridge

    Serge Orlov Guest

    "Francis Avila" <> wrote in message news:...

    > In other words, there is no standard, 100% reliable method of getting the
    > encoding of a web page.


    There is a standard way. But you're right, it's not 100% reliable.

    > In an ideal world, the http header would have it, and that's that.

    That's actually is not a good idea, because it will force the http server
    writers to parse html header for the encoding. They will get away
    with configuration parameter forcing all server files to be in one
    encoding. But one day somebody will store a file in the wrong
    encoding *for sure*. http header encoding is a bad idea.

    > In the real world, you have to juggle various combinations
    > of information, missing information, and disinformation from the http
    > protocol header's info, the html file's meta info, and charset guessing
    > algorithms (look for Enca).


    It's not so bad. Web server and content editor writers are slowing
    getting a clue. It used to be very bad, I remember it. I think the peak of
    problems was in 98-99 years. But nowadays more than 99% of
    web documents get encoding right. So having a simple read_unicode()
    method of urlopener class would be very useful.
    Serge Orlov, Dec 20, 2003
    #4
  5. Serge Orlov wrote:

    >"Bill Eldridge" <> wrote in message news:...
    >
    >
    >>I'm trying to grab a document off the Web and toss it
    >>into a MySQL database, but I keep running into the
    >>various encoding problems with Unicode (that aren't
    >>a problem for me with GB2312, BIG 5, etc.)
    >>
    >>What I'd like is something as simple as:
    >>
    >>CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
    >>
    >>import MySQLdb, re,urllib
    >>
    >>data = urllib.urlopen('http://localhost/test.html').read()
    >>
    >>

    >
    >You've got 8-bit data here. urllib doesn't currently handle encoding issues,
    >maybe submitting sf feature request will change that. But right now you have
    >to do it yourself. Scan (or parse) the html header for encoding, if it's absent
    >grad the encoding from the http header.
    >

    This part is fairly known - I'm setting up feeds that I'll actually look
    at to scrape content, so identifying the encoding will be part of that.
    .. What's driving me crazy is knowing the encoding
    but still not getting the data all the way through the chain to MySQL.

    > If it's absent too, then the encoding
    >ASAIR is latin1. So you code should look like:
    >connection = urllib.urlopen('http://localhost/test.html')
    >encoding = 'latin-1'
    >header_encoding = get_http_header_encoding(connection)
    >data = connection.read()
    >content_encoding = get_http_content_encoding(data)
    >if header_encoding:
    > encoding = header_encoding
    >if content_encoding:
    > encoding = content_encoding
    >
    >
    >

    The latin-1 stuff isn't giving me problems, it's the Asian languages,
    but I'll look at the connection end.

    >>data2 = ???
    >>
    >>

    >
    >data2 = data.decode(encoding,'replace')
    >
    >
    >
    >>...
    >>c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
    >>
    >>

    >
    >Quick scanning of mysql-python docs reveals that you should also
    >call connect with unicode='utf-8' parameter. Have you done that?
    >
    >
    >

    No, I haven't, I'll try it.

    >>where data2 is somehow the UTF-8 converted version of the original Web page.
    >>
    >>Additionally, I'd like to be able to do:
    >>
    >>body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
    >>
    >>data = urllib.urlopen('http://localhost/test.html').read()
    >>
    >>main_body = body_expr.search(data).group(1)
    >>
    >>

    >
    >Don't do that to data var because data is an 8bit string which
    >contains bytes not characters, use data2 instead.
    >
    >
    >

    Alright, I've tried it both ways, but this makes it clearer why.

    >As a rule of thumb you should decode to unicode as soon as you
    >can and leave unicode world as late as you can. And use unicode
    >aware APIs when they are available, this way you won't even
    >need to encode unicode objects.
    >
    >
    >
    >
    >>and insert that into the database, and most likely I need to
    >>
    >>I'm sitting with a dozen explanations from the Web explaining
    >>how to do this,
    >>0) decode('utf-8','ignore') or 'strict', or 'replace'...
    >>1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
    >> re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
    >>
    >>

    >
    >You don't need re.UNICODE if you don't use \w, \W, \b, or \B
    >
    >
    >

    Thanks, I don't.

    >>2) Convert to unicode before UTF-8
    >>
    >>

    >
    >Not sure what that means.
    >
    >
    >

    data.decode(None,'strict')
    or
    unicode(data,'unicode','strict')

    >>3) replace quotation marks within the SQL statement:
    >>data2.replace(u'"',u'\\"')
    >>
    >>

    >
    >It's not a unicode problem, is it?
    >
    >
    >


    Occasionally instead of getting the encoding error I get a SQL syntax error,
    and figured somewhere it was misinterpreting something like the end
    delimiter.
    No proof though, just a guess, so I tried the replaces.

    Thanks much,
    Bill
    Bill Eldridge, Dec 21, 2003
    #5

  6. >>What I'd like is something as simple as:
    >>
    >>CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
    >>
    >>import MySQLdb, re,urllib
    >>
    >>data = urllib.urlopen('http://localhost/test.html').read()
    >>
    >>

    >
    >data2 = data.decode(encoding,'replace')
    >...
    >c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
    >
    >
    >
    >Quick scanning of mysql-python docs reveals that you should also
    >call connect with unicode='utf-8' parameter. Have you done that?
    >
    >
    >

    I added that now, but it doesn't seem to make much difference
    (I think it's more for returning data from MySQL, not storing it,
    but that will still be useful)

    I did a test where I grabbed the URL using the same routines and
    dumped the thing to a file, and then edited out all the English and
    various HTML, and the SQL insert works at that point.

    It seems the mixed language is throwing stuff off, which wouldn't
    bother me if my re.search for only the Vietnamese text were working
    properly, but it isn't.

    >>where data2 is somehow the UTF-8 converted version of the original Web page.
    >>
    >>Additionally, I'd like to be able to do:
    >>
    >>body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
    >>
    >>data = urllib.urlopen('http://localhost/test.html').read()
    >>
    >>main_body = body_expr.search(data).group(1)
    >>
    >>

    >
    >Don't do that to data var because data is an 8bit string which
    >contains bytes not characters, use data2 instead.
    >
    >As a rule of thumb you should decode to unicode as soon as you
    >can and leave unicode world as late as you can. And use unicode
    >aware APIs when they are available, this way you won't even
    >need to encode unicode objects.
    >
    >
    >
    >
    >>and insert that into the database, and most likely I need to
    >>
    >>I'm sitting with a dozen explanations from the Web explaining
    >>how to do this,
    >>0) decode('utf-8','ignore') or 'strict', or 'replace'...
    >>1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
    >> re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
    >>
    >>

    >
    >You don't need re.UNICODE if you don't use \w, \W, \b, or \B
    >
    >
    >
    >>2) Convert to unicode before UTF-8
    >>
    >>

    >
    >Not sure what that means.
    >
    >
    >
    >>3) replace quotation marks within the SQL statement:
    >>data2.replace(u'"',u'\\"')
    >>
    >>

    >
    >It's not a unicode problem, is it?
    >
    >-- Serge.
    >
    >
    >
    >
    Bill Eldridge, Dec 21, 2003
    #6
  7. Bill Eldridge

    Serge Orlov Guest

    > . What's driving me crazy is knowing the encoding
    > but still not getting the data all the way through the chain to MySQL.


    That's because you're trying to create SQL statements manually.

    >>>2) Convert to unicode before UTF-8

    >>Not sure what that means.

    > data.decode(None,'strict')
    > or
    > unicode(data,'unicode','strict')


    See Francis' excellent post why both of these calls do not make sence at all.

    >>>3) replace quotation marks within the SQL statement:
    >>>data2.replace(u'"',u'\\"')

    >>It's not a unicode problem, is it?

    > Occasionally instead of getting the encoding error I get a SQL syntax error,
    > and figured somewhere it was misinterpreting something like the end
    > delimiter.
    > No proof though, just a guess, so I tried the replaces.


    Uh, I see. If you're creating SQL statement yourself, you have to take care
    of escaping. I guess after data2.encode('utf-8') you've got illegal (in SQL)
    characters and you have to escape them. Another quick scan of mysql-python
    docs reveals that .execute method can be called as
    c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''', args=(data2,) )
    where data2 is a unicode string. Then the python wrapper will take care of
    unicode convertions and escaping. Don't forget that you need to .connect with
    unicode='utf-8' parameter.

    -- Serge.
    Serge Orlov, Dec 21, 2003
    #7
  8. Bill Eldridge

    Serge Orlov Guest

    >>Quick scanning of mysql-python docs reveals that you should also
    >>call connect with unicode='utf-8' parameter. Have you done that?
    >>

    > I added that now, but it doesn't seem to make much difference
    > (I think it's more for returning data from MySQL, not storing it,
    > but that will still be useful)


    It's for storing too. If you pass str to .execute it will take the string
    and just pass it to MySQL as you created it. But if you use args
    parameter of .execute method it will take care of convertions and
    escaping. In fact you should pass even ascii strings via args parameter,
    if you don't want to deal with escaping yourself.

    > I did a test where I grabbed the URL using the same routines and
    > dumped the thing to a file, and then edited out all the English and
    > various HTML, and the SQL insert works at that point.


    It probably contained illegal characters from SQL point of view.

    > It seems the mixed language is throwing stuff off, which wouldn't
    > bother me if my re.search for only the Vietnamese text were working
    > properly, but it isn't.


    Why? re.search works fine for my Russian unicode characters.

    -- Serge.
    Serge Orlov, Dec 21, 2003
    #8
  9. Serge Orlov wrote:

    >>. What's driving me crazy is knowing the encoding
    >>but still not getting the data all the way through the chain to MySQL.
    >>
    >>

    >
    >That's because you're trying to create SQL statements manually.
    >
    >


    As opposed to what?

    >
    >
    >>>>2) Convert to unicode before UTF-8
    >>>>
    >>>>
    >>>Not sure what that means.
    >>>
    >>>

    >>data.decode(None,'strict')
    >>or
    >>unicode(data,'unicode','strict')
    >>
    >>

    >
    >See Francis' excellent post why both of these calls do not make sence at all.
    >
    >
    >

    Understand, I tried doing absolutely nothing
    for the supposedly Unicode strings, just
    taking them as-is and putting them into the
    database, and I get these ASCII decoding errors.
    I try decoding them as UTF-8, which supposedly
    they are, and I get these errors. So then I
    try stupid stuff just to see if something stupid
    works.

    >>>>3) replace quotation marks within the SQL statement:
    >>>>data2.replace(u'"',u'\\"')
    >>>>
    >>>>
    >>>It's not a unicode problem, is it?
    >>>
    >>>

    >>Occasionally instead of getting the encoding error I get a SQL syntax error,
    >>and figured somewhere it was misinterpreting something like the end
    >>delimiter.
    >>No proof though, just a guess, so I tried the replaces.
    >>
    >>

    >
    >Uh, I see. If you're creating SQL statement yourself, you have to take care
    >of escaping. I guess after data2.encode('utf-8') you've got illegal (in SQL)
    >characters and you have to escape them. Another quick scan of mysql-python
    >docs reveals that .execute method can be called as
    >c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''', args=(data2,) )
    >where data2 is a unicode string. Then the python wrapper will take care of
    >unicode convertions and escaping. Don't forget that you need to .connect with
    >unicode='utf-8' parameter.
    >
    >
    >

    Again, what do you mean "creating SQL statement yourself"? and what is the
    alternative?

    I'll try out this version of the execute statement.
    Bill Eldridge, Dec 28, 2003
    #9
  10. Serge Orlov wrote:

    >>It seems the mixed language is throwing stuff off, which wouldn't
    >>bother me if my re.search for only the Vietnamese text were working
    >>properly, but it isn't.
    >>
    >>

    >
    >Why? re.search works fine for my Russian unicode characters.
    >
    >
    >

    Yes, it seems and should be very easy and straight-forward,
    do not know why, will have to spend more time on it.
    Bill Eldridge, Dec 28, 2003
    #10
  11. >> That's because you're trying to create SQL statements manually.

    Bill> As opposed to what?

    Let MySQLdb do the data escaping for you:

    conn = MySQLdb.Connection(host=..., etc)
    curs = conn.cursor()
    curs.execte("insert into sometable"
    " (field1, field2)"
    " values"
    " (%s, %s)",
    (val1, val2))

    Val1 and val2 can be utf-8-encoded strings. MySQLdb will do the right thing
    for you...

    Skip
    Skip Montanaro, Dec 29, 2003
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,909
    Robert Mark Bram
    Sep 28, 2003
  2. Bill Eldridge

    Re: Unicode from Web to MySQL

    Bill Eldridge, Dec 21, 2003, in forum: Python
    Replies:
    2
    Views:
    343
    Francis Avila
    Dec 21, 2003
  3. Hans Müller
    Replies:
    1
    Views:
    277
    Hans Müller
    Dec 3, 2009
  4. neha shena
    Replies:
    1
    Views:
    249
    Dennis Lee Bieber
    Oct 29, 2010
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    940
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page