Encoding for Devanagari Script.

A

Atul.

Hello All,

I wanted to know what encoding should I use to open the files with
Devanagari characters. I was thinking of UTF-8 but was not sure, any
leads on this? Anyone used it earlier?

Thanks in Advance.

Regards,
Atul.
 
F

Fredrik Lundh

Atul. skrev:
I wanted to know what encoding should I use to open the files with
Devanagari characters. I was thinking of UTF-8 but was not sure, any
leads on this? Anyone used it earlier?

Are we talking about existing files? If you don't know what encoding
the files use, you could always try using the UTF-8 codec; it's very
likely to complain if you're attempting to decode something that's isn't
UTF-8.

If that doesn't work, it's a bit trickier -- there are several ways to
encode Unicode, and then there's ISCII as well. If you cannot sort it
out, try running this:

on one of your files, and post the result, and chances are that someone
will be able to identify the encoding.

</F>
 
T

Terry Reedy

Atul. said:
Hello All,

I wanted to know what encoding should I use to open the files with
Devanagari characters. I was thinking of UTF-8 but was not sure, any
leads on this? Anyone used it earlier?

You cannot hurt your machine by giving that a try.

This is a general comment for all beginners. Before posting, open the
interactive interpreter (or IDLE) and try something(s). If the result
puzzles you, copy and paste into a post. Or if more appropriate, open
the Python manuals and search a bit, or try a search engine.
 
A

Atul.

Hi Fredrik and Terry,

Well I got this on IDLE I think I have done something wrong.

Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
f = open("C:\Documents and Settings\admin\My Documents\corpus
\dainaikAikya collected by sushant.txt","r", "utf_8")
TypeError: an integer is required

after that I tried the read binary mode and tried reading the firt 32
bytes and this is what I got.
'\xef\xbb\xbf\xe0\xa4\xa8\xe0\xa4\xb5\xe0\xa5\x80
\xe0\xa4\xa6\xe0\xa4\xbf\xe0\xa4\xb2\xe0\xa5\x8d
\xe0\xa4\xb2\xe0\xa5\x80,'

Now based on my knowledge of Unicode I think this is a utf-8 file (the
first 3 bytes \xef\xbb\xbf), please correct me if I am wrong. How do I
read this?

Atul.

PS: the above code I wrote using the information from the Library
Reference pdf section 4.8 "Codecs". Something wrong I am doing? Please
do let me know.
 
T

Tim Golden

Atul. said:
Hi Fredrik and Terry,

Well I got this on IDLE I think I have done something wrong.


Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
f = open("C:\Documents and Settings\admin\My Documents\corpus
\dainaikAikya collected by sushant.txt","r", "utf_8")
TypeError: an integer is required

PS: the above code I wrote using the information from the Library
Reference pdf section 4.8 "Codecs". Something wrong I am doing? Please
do let me know.


Only slightly. You're importing the codecs module
but you're not using it. So you're *actually* using
the built-in open function, which doesn't have an
encoding parameter. It does have a third param
which is to do with the buffer size.

Just change your code to use codecs.open ("...")
and, I suggest, either use raw strings for your
filename (r"c:\docume...") or use the other kind
of slash ("c:/documen..."). Otherwise you might
run into some problems.

TJG
 
A

Atul.

Thanks, Tim that did work. I will proceed with my playing around now.

Thanks a ton.

Atul.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top