Unicode support in python

S

sonald

Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal
 
F

Fredrik Lundh

sonald said:
I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Python has built-in Unicode support (which you would probably have noticed
if you'd looked "Unicode" up in the documentation index). for a list of tutorials
and other documentation, see

http://www.google.com/search?q=python+unicode

</F>
 
S

sonald

Fredrik said:
(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>
Thanks!! but i have already tried this...
and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?
 
D

Diez B. Roggisch

sonald said:
Thanks!! but i have already tried this...

Tried - might be. But you certainly didn't understand it. So I suggest
that you read it again.
and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

This will _only_ affect unicode literals inside the script itself -
nothing else! No files read, no files written, and additionally the path
of sun, earth and moon are unaffected as well - just in case you wondered.

This is an example of what is affected now:


--------
# -*- coding: utf-8 -*-
# this string is a byte string. it is created as such,
# regardless of the above encoding. instead, only
# what is in the bytes of the file itself is taken into account
some_string = "büchsenböller"

# this is a unicode literal (note the leading u).
# it will be _decoded_ using the above
# mentioned encoding. So make sure, your file is written in the
# proper encoding
some_unicode_object = u"büchsenböller"
---------



I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?

You shouldn't tamper with the site-wide encoding, as this will mask
errors you made in the best case, let alone not producing new ones.

And what do you think it would help you anyway? Pythons unicode support
would be stupid to say the least if it required the installation changed
before dealing with files of different encodings - don't you think?

As you don't show us the code you actually use to read that file, I'm
down to guessing here, but if you just open it as binary file with

content = open("test.txt").read()

there won't be any magic decoding happening.

What you need to do instead is this (if you happen to know that test.txt
is encoded in utf-8):

content = open("test.txt").read().decode("utf-8")


Then you have a unicode object. Now if you need that to be written to a
terminal (or wherever your "boxes" appear - guessing here too, no code,
you remember?), you need to make sure that

- you know the terminals encoding

- you properly endcode the unicode content to that encoding before
printing, as otherwise the default-encoding will be used


So, in case your terminal uses utf-8, you do

print content.encode("utf-8")


Diez
 
F

Fredrik Lundh

sonald said:
I have added the following line in the script

# -*- coding: utf-8 -*-

that's good.
I have also modified the site.py

that's bad, because this means that your code won't work on standard
Python installations.
Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

what does the word "validate" mean here?
some junk character (box like) is added as the first character
what must be the reason for this?

what did you do to determine that there's a box-like character at the start
of the file?

can you post some code?

</F>
 
J

John Roth

sonald said:
Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

As both of the other responders have said, the
coding comment at the front only affects source
text; it has absolutely no effect at run time. In
particular, it's not even necessary to use it to
handle non-English languages as long as you
don't want to write literals in those languages.

What seems to be missing is the notion that
external files are _always_ byte files, and have to
be _explicitly_ decoded into unicode strings,
and then encoded back to whatever the external
encoding needs to be, each and every time you
read or write a file, or copy string data from
byte strings to unicode strings and back.
There is no good way of handling this implicitly:
you can't simply say "utf-8" or "iso-8859-whatever"
in one place and expect it to work.

You've got to specify the encoding on each and
every open, or else use the encode and decode
string methods. This is a great motivation for
eliminating duplication and centralizing your
code!

For your other question: the general words
are localization and locale. Look up locale in
the index. It's a strange subject which I don't
know much about, but that should get you
started.

John Roth
 
S

sonald

Fredrik said:
what does the word "validate" mean here?
Let me explain our module.
We receive text files (with comma separated values, as per some
predefined format) from a third party.
for example account file comes as "abc.acc" {.acc is the extension for
account file as per our code}
it must contain account_code, account_description, account_balance in
the same order.

So, from the text file("abc.acc") we receive for 2 or more records,
will look like
A001, test account1, 100000
A002, test account2, 500000

We may have multiple .acc files

Our job is to validate the incoming data on the basis of its datatype,
field number, etc and copy all the error free records in acc.txt

for this, we use a schema as follows
----------------------------------------------------------------------------------------------------------
if account_flg == 1:
start = time()

# the input fields
acct_schema = {
0: Text('AccountCode', 50),
1: Text('AccountDescription', 100),
2: Text('AccountBalance', 50)
}

validate( schema = acct_schema,
primary_keys = [acct_pk],
infile = '../data/ACC/*.acc',
outfile = '../data/acc.txt',
update_freq = 10000)
----------------------------------------------------------------------------------------------------------
In a core.py, we have defined a function validate, which checks for the
datatypes & other validations.
All the erroneous records are copied in a error log file, and the
correct records are copied to a clean acc.text file

The validate function is as given below...
---------------------------------------------------------------------------------------------------------------------------
def validate(infile, outfile, schema, primary_keys=[], foreign_keys=[],
record_checks=[], buffer_size=0, update_freq=0):

show("intitalizing ... ")

# find matching input files
all_files = glob.glob(infile)
if not all_files:
raise ValueError('No input files were found.')

# initialize data structures
freq = update_freq or DEFAULT_UPDATE
input = fileinput.FileInput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
output = open(outfile, 'wb+')
logs = {}
for name in all_files:
logs[name] = open(name + DEFAULT_SUFFIX, 'wb+')
#logs[name] = open(name + DEFAULT_SUFFIX, 'a+')

errors = []
num_fields = len(schema)
pk_length = range(len(primary_keys))
fk_length = range(len(foreign_keys))
rc_length = range(len(record_checks))

# initialize the PKs and FKs with the given schema
for idx in primary_keys:
idx.setup(schema)
for idx in foreign_keys:
idx.setup(schema)

# start processing: collect all lines which have errors
for line in input:
rec_num = input.lineno()
if rec_num % freq == 0:
show("processed %d records ... " % (rec_num))
for idx in primary_keys:
idx.flush()
for idx in foreign_keys:
idx.flush()

if BLANK_LINE.match(line):
continue

try:
data = csv.parse(line)

# check number of fields
if len(data) != num_fields:
errors.append( (rec_num, LINE_ERROR, 'incorrect number
of fields') )
continue

# check for well-formed fields
fields_ok = True
for i in range(num_fields):
if not schema.validate(data):
errors.append( (rec_num, FIELD_ERROR, i) )
fields_ok = False
break

# check the PKs
for i in pk_length:
if fields_ok and not primary_keys.valid(rec_num,
data):
errors.append( (rec_num, PK_ERROR, i) )
break

# check the FKs
for i in fk_length:
if fields_ok and not foreign_keys.valid(rec_num,
data):
#print 'here ---> %s, rec_num : %d'%(data,rec_num)
errors.append( (rec_num, FK_ERROR, i) )
break

# perform record-level checks
for i in rc_length:
if fields_ok and not record_checks(schema, data):
errors.append( (rec_num, REC_ERROR, i) )
break

except fastcsv.Error, err:
errors.append( (rec_num, LINE_ERROR, err.__str__()) )

# finalize the indexes to check for any more errors
for i in pk_length:
error_list = primary_keys.finalize()
primary_keys.save()
if error_list:
errors.extend( [ (rec_num, PK_ERROR, i) for rec_num in
error_list ] )

for i in fk_length:
error_list = foreign_keys.finalize()
if error_list:
errors.extend( [ (rec_num, FK_ERROR, i) for rec_num in
error_list ] )


# sort the list of errors by the cumulative line number
errors.sort( lambda l, r: cmp(l[0], r[0]) )

show("saving output ... ")

# reopen input and sort it into either the output file or error log
file
input = fileinput.FileInput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
error_list = iter(errors)
count = input.lineno
filename = input.filename
line_no = input.filelineno


try:
line_num, reason, i = error_list.next()
except StopIteration:
line_num = -1
for line in input:
line = line + '\r\n'
#print '%d,%d'%(line_num,count())
if line_num == count():

if reason == FIELD_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_FIELD % (schema.name), line))
elif reason == LINE_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(), i,
line))
elif reason == PK_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_PK % (primary_keys.name), line))
elif reason == FK_ERROR:
#print 'Test FK %s, rec_num : %d, line :
%s'%(foreign_keys.name,line_no(),line)
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_FK % (foreign_keys.name), line))
elif reason == REC_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_REC % (record_checks.__doc__), line))
else:
raise RuntimeError("shouldn't reach here")

try:
#print 'CURRENT ITERATION, line_num : %d, line :
%s'%(line_num,line)
line_num1 = line_num
line_num, reason, i = error_list.next()
if line_num1 == line_num :
line_num, reason, i = error_list.next()

#print 'FOR NEXT ITERATION, line_num : %d, line :
%s'%(line_num,line)

except StopIteration:
line_num = -1
continue

if not BLANK_LINE.match(line):
output.write(line)

output.close()
for f in logs.values():
f.close()
-----------------------------------------------------------------------------------------------------------------------------

now when I open the error log file, it contains the error message for
each erroneous record, along with the original record copied from the
*.acc file.
Now this record is preceeded with a box like character.

Do you want me to post the complete code , just incase...
It might help... you might then understand my problem well..
plz let me know soon
 
S

sonald

HI
Can u please tell me if there is any package or class that I can import
for internationalization, or unicode support?

This module is just a small part of our application, and we are not
really supposed to alter the code.
We do not have nobody here to help us with python here. and are
supposed to just try and understand the program. Today I am in a
position, that I can fix the bugs arising from the code, but cannot
really try something like internationalization on my own. Can u help?
Do you want me to post the complete code for your reference?
plz lemme know asap.
 
D

Dennis Lee Bieber

now when I open the error log file, it contains the error message for
each erroneous record, along with the original record copied from the
*.acc file.
Now this record is preceeded with a box like character.
A "box like character" doesn't need unicode to generate. Especially
is one is viewing the file with Notepad on Windows.

Shows a box in front of "last line" when the file is viewed with
Notepad, though the rest of the contents "looks" normal (single-spaced
lines, yet).

"""
A test file
second line
third line
[]last line
"""
( [] is supposed to be the "box" that Notepad displays)

Viewing the same output file in Wordpad has a blank line before
"last line", and has "third line" appended to "second line"!

"""
A test file
second line third line

last line
"""

while Scite shows it as

"""
A test file
second line

third line

last line
"""
(which is also what a cut&paste from Notepad turns into)
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top