From python to LaTeX in emacs on windows

B

Brian Elmegaard

Hi group

I hope this is not a faq...

I try to understand how to use the new way of specifying a files
encoding, but no matter what I do I get strange characters in the
output.

I have a text file which I have generated in python by parsing some
html.

In the file there is international characters like é and ó.
I can see the file in emacs it is encoded as
mule-utf-8-dos

I read the file into python as a string and suddenly the characters
when printed looks strange and consists of two characters.

First problem: How do I avoid this?

Second problem is that I make some string replacements and more in
the string to write a latex output file. When I open this file in
emacs the characters now are not the same?

Second problem: How do I avoid this?

tia,
 
B

Benjamin Niemann

Brian said:
Hi group

I hope this is not a faq...

I try to understand how to use the new way of specifying a files
encoding, but no matter what I do I get strange characters in the
output.

I have a text file which I have generated in python by parsing some
html.

In the file there is international characters like é and ó.
I can see the file in emacs it is encoded as
mule-utf-8-dos

I read the file into python as a string and suddenly the characters
when printed looks strange and consists of two characters.

First problem: How do I avoid this?
>
> Second problem is that I make some string replacements and more in
> the string to write a latex output file. When I open this file in
> emacs the characters now are not the same?
>
> Second problem: How do I avoid this?

When you read the filecontents in python, you'll have the "raw" byte
sequence, in this case it is the UTF-8 encoding of unicode text. But you
probably want a unicode string. Use "text = unicode(data, 'utf-8')"
where "data" is the filecontent you read. After processing you probably
want to write it back to a file. Before you do this, you will have to
convert the unicode string back to a byte sequence. Use "data =
text.encode('utf')".

Handling character encodings correctly *is* difficult. It's no shame, if
you don't get it right on the first attempt.
 
B

Brian Elmegaard

Thank for the help. I solved the problem by specifying the cp1252
encoding for the python file by a magic comment and for the input data file.
When you read the filecontents in python, you'll have the "raw" byte
sequence, in this case it is the UTF-8 encoding of unicode text. But
you probably want a unicode string. Use "text = unicode(data,
'utf-8')" where "data" is the filecontent you read. After processing
you probably want to write it back to a file. Before you do this, you
will have to convert the unicode string back to a byte sequence. Use
"data = text.encode('utf')".

This worked, but when I try to print text I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)
Why is that?
Handling character encodings correctly *is* difficult.

What makes it difficult? The OS, the editor, python, latex?
 
B

Benjamin Niemann

Brian said:
Thank for the help. I solved the problem by specifying the cp1252
encoding for the python file by a magic comment and for the input data file.




This worked, but when I try to print text I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)
Why is that?
The console only understands "byte streams". To print a unicode string,
python tries to encode it using the default encoding, which is 'ascii'
in your case. That encoding is not able to represent characters like
'ü', 'ä'.. which causes the exception. What I usually do is something like:
print text.encode("cp1251", "ignore")

The 'ignore' argument causes all characters, that cannot be represented
in cp1251 to be silently dropped - which is ok, if the output is only
used e.g. to track progress.

Don't know if there's a way to python to do this automagically for all
unicodes passed to stdout...
What makes it difficult? The OS, the editor, python, latex?
At least for me it is difficult, because I'm used to think "1 byte = 1
character" and when I read/write files I could simple handle the data as
strings. Unless you begin to parse arbitrary data from the internet,
there is little chance that you encounter text encodings different from
your operating systems default and you start to believe that e.g.
"ord('ü') == 252" is a universal rule sent by the gods...
If you do it right, then you should convert all data that 'enters' your
application as early as possible to unicode and encode it back when you
print/save/send it - this way you'll only have to deal with unicodes in
your application code. The most difficult part is probably changing old
habbits ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,573
Members
45,046
Latest member
Gavizuho

Latest Threads

Top