How to store ASCII encoded python string?

M

micahc

I currently have a Python program that reads in emails from a POP3
server. As soon as the message is read in it is fed directly into a
PostgreSQL database for storage. Later, it is broken down into it's
parts and displayed to the user.

My problem is that when I try to pass "\tsome text\xa7some more text\n"
into the database it gives me a unicode decode error. At this point in
the program I don't know what codec it is (I won't know that until I
break apart the message later) and it may even be binary data so I just
want to store it in the database with the escape characters, not as a
decoded/encoded string.

So, how do I store "\tsome text\xa7 some more text\n" as that instead
of:
" some text§ some more text
"

I don't have a problem escaping it so the above would look like
"\\tsome text\\xa7 some more text\\n" as long as I have a way to later
unescape it when I want to actual do something with the data.
 
F

Fredrik Lundh

I currently have a Python program that reads in emails from a POP3
server. As soon as the message is read in it is fed directly into a
PostgreSQL database for storage. Later, it is broken down into it's
parts and displayed to the user.

My problem is that when I try to pass "\tsome text\xa7some more text\n"
into the database it gives me a unicode decode error.

"\xa7" is not a valid ASCII character, so that's not really an "ASCII
encoded" string.

looks like your database expects Unicode strings, but you're passing in
binary data. to solve this, you can:

1) change the database table to use a "blob" field instead of a text field

or

2) configure the database interface to pass 8-bit strings through to the
database engine (if possible; see the database interface documentation
for details)

or

3) convert the data to Unicode before passing it to the database
interface, and leave it to the interface to convert it to whatever
encoding your database uses:

data = ... get encoded string from email ...
text = data.decode("iso-8859-1")
... write text to database ...

</F>
 
M

Marc 'BlackJack' Rintsch

So, how do I store "\tsome text\xa7 some more text\n" as that instead
of:
" some text§ some more text
"

I don't have a problem escaping it so the above would look like
"\\tsome text\\xa7 some more text\\n" as long as I have a way to later
unescape it when I want to actual do something with the data.

In [6]: '\tsome text\xa7some more text\n'.encode('string_escape')
Out[6]: '\\tsome text\\xa7some more text\\n'

Ciao,
Marc 'BlackJack' Rintsch
 
M

micahc

Fredrik said:
3) convert the data to Unicode before passing it to the database
interface, and leave it to the interface to convert it to whatever
encoding your database uses:

data = ... get encoded string from email ...
text = data.decode("iso-8859-1")
... write text to database ...

Wouldn't that have to assume that all incoming data is in iso-8859-1?
If someone sends me an email with chinese characters would that still
work (I don't know the character set at data insert time)?

In [6]: '\tsome text\xa7some more text\n'.encode('string_escape')
Out[6]: '\\tsome text\\xa7some more text\\n'

Thanks, I think this is what I will end up doing just for simplicity,
though I'm still curious about the above question.
 
F

Fredrik Lundh

Wouldn't that have to assume that all incoming data is in iso-8859-1?
If someone sends me an email with chinese characters would that still
work (I don't know the character set at data insert time)?

if you're reading mail, chances are that you know the encoding (it's
specified in the message headers).

or are you saying that you're treating the mail as binary data? if so,
why are you trying to store that in a *text* field in the database?

</F>
 
D

Dennis Lee Bieber

I currently have a Python program that reads in emails from a POP3
server. As soon as the message is read in it is fed directly into a
PostgreSQL database for storage. Later, it is broken down into it's
parts and displayed to the user.
IMO -- you have just explained all problems at that point...
My problem is that when I try to pass "\tsome text\xa7some more text\n"
into the database it gives me a unicode decode error. At this point in
the program I don't know what codec it is (I won't know that until I
break apart the message later) and it may even be binary data so I just
want to store it in the database with the escape characters, not as a
decoded/encoded string.

If the message may contain text in some non-ASCII (or ISO-Latin-1 if
that is native to your system), and you do NOT take the time to first
check the headers for the encoding used IN the message, the only way to
handle the message, then, is to store it as a binary blob type, and not
as some sort of "text" format.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top