unicode and dbf files

E

Ethan Furman

Greetings, all!

I would like to add unicode support to my dbf project. The dbf header
has a one-byte field to hold the encoding of the file. For example,
\x03 is code-page 437 MS-DOS.

My google-fu is apparently not up to the task of locating a complete
resource that has a list of the 256 possible values and their
corresponding code pages.

So far I have found this, plus variations:
http://support.microsoft.com/kb/129631

Does anyone know of anything more complete?

~Ethan~
 
J

John Machin

Greetings, all!

I would like to add unicode support to my dbf project.  The dbf header
has a one-byte field to hold the encoding of the file.  For example,
\x03 is code-page 437 MS-DOS.

My google-fu is apparently not up to the task of locating a complete
resource that has a list of the 256 possible values and their
corresponding code pages.

What makes you imagine that all 256 possible values are mapped to code
pages?
So far I have found this, plus variations:http://support.microsoft.com/kb/129631

Does anyone know of anything more complete?

That is for VFP3. Try the VFP9 equivalent.

dBase 5,5,6,7 use others which are not defined in publicly available
dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
source: ESRI support site.
 
E

Ethan Furman

John said:
What makes you imagine that all 256 possible values are mapped to code
pages?

I'm just wanting to make sure I have whatever is available, and
preferably standard. :D

That is for VFP3. Try the VFP9 equivalent.

dBase 5,5,6,7 use others which are not defined in publicly available
dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
source: ESRI support site.

Well, a couple hours later and still not more than I started with.
Thanks for trying, though!

~Ethan~
 
J

John Machin

I'm just wanting to make sure I have whatever is available, and
preferably standard.  :D




Well, a couple hours later and still not more than I started with.
Thanks for trying, though!

Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
keywords and you couldn't come up with anything??
 
E

Ethan Furman

John said:
Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
keywords and you couldn't come up with anything??

Perhaps "nothing new" would have been a better description. I'd already
seen the clicketyclick site (good info there), and all I found at ESRI
were folks trying to figure it out, plus one link to a list that was no
different from the vfp3 list (or was it that the list did not give the
hex values? Either way, of no use to me.)

I looked at dbase.com, but came up empty-handed there (not surprising,
since they are a commercial company).

I searched some more on Microsoft's site in the VFP9 section, and was
able to find the code page section this time. Sadly, it only added
about seven codes.

At any rate, here is what I have come up with so far. Any corrections
and/or additions greatly appreciated.

code_pages = {
'\x01' : ('ascii', 'U.S. MS-DOS'),
'\x02' : ('cp850', 'International MS-DOS'),
'\x03' : ('cp1252', 'Windows ANSI'),
'\x04' : ('mac_roman', 'Standard Macintosh'),
'\x64' : ('cp852', 'Eastern European MS-DOS'),
'\x65' : ('cp866', 'Russian MS-DOS'),
'\x66' : ('cp865', 'Nordic MS-DOS'),
'\x67' : ('cp861', 'Icelandic MS-DOS'),
'\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
'\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy
'\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
'\x6b' : ('cp857', 'Turkish MS-DOS'),

'\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\
Windows'), # wag
'\x79' : ('iso2022_kr', 'Korean Windows'), # wag
'\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
Windows'), # wag
'\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
'\x7c' : ('cp874', 'Thai Windows'), # wag
'\x7d' : ('cp1255', 'Hebrew Windows'),
'\x7e' : ('cp1256', 'Arabic Windows'),
'\xc8' : ('cp1250', 'Eastern European Windows'),
'\xc9' : ('cp1251', 'Russian Windows'),
'\xca' : ('cp1254', 'Turkish Windows'),
'\xcb' : ('cp1253', 'Greek Windows'),
'\x96' : ('mac_cyrillic', 'Russian Macintosh'),
'\x97' : ('mac_latin2', 'Macintosh EE'),
'\x98' : ('mac_greek', 'Greek Macintosh') }

~Ethan~
 
J

John Machin

Perhaps "nothing new" would have been a better description.  I'd already
seen the clicketyclick site (good info there)

Do you think so? My take is that it leaves out most of the codepage
numbers, and these two lines are wrong:
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866

and all I found at ESRI
were folks trying to figure it out, plus one link to a list that was no
different from the vfp3 list (or was it that the list did not give the
hex values?  Either way, of no use to me.)

Try this:
http://webhelp.esri.com/arcpad/8.0/referenceguide/

I looked at dbase.com, but came up empty-handed there (not surprising,
since they are a commercial company).

MS and ESRI have docs ... does that mean that they are non-commercial
companies?
I searched some more on Microsoft's site in the VFP9 section, and was
able to find the code page section this time.  Sadly, it only added
about seven codes.

At any rate, here is what I have come up with so far.  Any corrections
and/or additions greatly appreciated.

code_pages = {
     '\x01' : ('ascii', 'U.S. MS-DOS'),

All of the sources say codepage 437, so why ascii instead of cp437?
     '\x02' : ('cp850', 'International MS-DOS'),
     '\x03' : ('cp1252', 'Windows ANSI'),
     '\x04' : ('mac_roman', 'Standard Macintosh'),
     '\x64' : ('cp852', 'Eastern European MS-DOS'),
     '\x65' : ('cp866', 'Russian MS-DOS'),
     '\x66' : ('cp865', 'Nordic MS-DOS'),
     '\x67' : ('cp861', 'Icelandic MS-DOS'),
     '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'),     # iffy

Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
not alone. I suggest that you omit Kamenicky until someone actually
wants it.
     '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'),      # iffy

Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
predates and is not the same as cp852. In any case, I suggest that you
omit Masovia until someone wants it. Interesting reading:

http://www.jastra.com.pl/klub/ogonki.htm
     '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
     '\x6b' : ('cp857', 'Turkish MS-DOS'),
     '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\

big5 is *not* the same as cp950. The products that create DBF files
were designed for Windows. So when your source says that LDID 0xXX
maps to Windows codepage YYY, I would suggest that all you should do
is translate that without thinking to python encoding cpYYY.
                Windows'),       # wag

What does "wag" mean?
     '\x79' : ('iso2022_kr', 'Korean Windows'),          # wag

Try cp949.

     '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
                Windows'),       # wag

Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
(1980) Chinese (GB2312) and a basic Korean kit. However to quote from
"CJKV Information Processing" by Ken Lunde, "... from a practical
point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
encoding." i.e. no Chinese support at all. Try cp936.
     '\x7b' : ('iso2022_jp', 'Japanese Windows'),        # wag

Try cp936.
     '\x7c' : ('cp874', 'Thai Windows'),                 # wag
     '\x7d' : ('cp1255', 'Hebrew Windows'),
     '\x7e' : ('cp1256', 'Arabic Windows'),
     '\xc8' : ('cp1250', 'Eastern European Windows'),
     '\xc9' : ('cp1251', 'Russian Windows'),
     '\xca' : ('cp1254', 'Turkish Windows'),
     '\xcb' : ('cp1253', 'Greek Windows'),
     '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
     '\x97' : ('mac_latin2', 'Macintosh EE'),
     '\x98' : ('mac_greek', 'Greek Macintosh') }

HTH,
John
 
E

Ethan Furman

John said:
Do you think so? My take is that it leaves out most of the codepage
numbers, and these two lines are wrong:
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866

That was the site I used to get my whole project going, so ignoring the
unicode aspect, it has been very helpful to me.


Wow. Question, though: all those codepages mapping to 437 and 850 --
are they really all the same?

MS and ESRI have docs ... does that mean that they are non-commercial
companies?

I don't know enough about ESRI to make an informed comment, so I'll just
say I'm grateful they have them! MS is a complete mystery... perhaps
they are finally seeing the light? Hard to believe, though, from a
company that has consistently changed their file formats with every release.

All of the sources say codepage 437, so why ascii instead of cp437?

Hard to say, really. Adjusted.

Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
not alone. I suggest that you omit Kamenicky until someone actually
wants it.

Yeah, I noticed that. Tentative plan was to implement it myself (more
for practice than anything else), and also to be able to raise a more
specific error ("Kamenicky not currently supported" or some such).

Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
predates and is not the same as cp852. In any case, I suggest that you
omit Masovia until someone wants it. Interesting reading:

http://www.jastra.com.pl/klub/ogonki.htm

Very interesting reading.

big5 is *not* the same as cp950. The products that create DBF files
were designed for Windows. So when your source says that LDID 0xXX
maps to Windows codepage YYY, I would suggest that all you should do
is translate that without thinking to python encoding cpYYY.

Ack. Not sure how I missed 'Windows' at the end of that description.

What does "wag" mean?

wag == 'wild ass guess'

'\x79' : ('iso2022_kr', 'Korean Windows'), # wag

Try cp949.
Done.

'\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
Windows'), # wag


Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
(1980) Chinese (GB2312) and a basic Korean kit. However to quote from
"CJKV Information Processing" by Ken Lunde, "... from a practical
point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
encoding." i.e. no Chinese support at all. Try cp936.
Done.

'\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag


Try cp936.

You mean 932?

HTH,
John


Very helpful indeed. Many thanks for reviewing and correcting.
Learning to deal with unicode is proving more difficult for me than
learning Python was to begin with! ;D

~Ethan~
 
J

John Machin

Wow.  Question, though:  all those codepages mapping to 437 and 850 --
are they really all the same?

437 and 850 *are* codepages. You mean "all those language driver IDs
mapping to codepages 437 and 850". A codepage merely gives an
encoding. An LDID is like a locale; it includes other things besides
the encoding. That's why many Western European languages map to the
same codepage, first 437 then later 850 then 1252 when Windows came
along.
Yeah, I noticed that.  Tentative plan was to implement it myself (more
for practice than anything else), and also to be able to raise a more
specific error ("Kamenicky not currently supported" or some such).

The error idea is fine, but I don't get the "implement it yourself for
practice" bit ... practice what? You plan a long and fruitful career
inplementing codecs for YAGNI codepages?
You mean 932?
Yes.

Very helpful indeed.  Many thanks for reviewing and correcting.

You're welcome.
Learning to deal with unicode is proving more difficult for me than
learning Python was to begin with!  ;D

?? As far as I can tell, the topic has been about mapping from
something like a locale to the name of an encoding, i.e. all about the
pre-Unicode mishmash and nothing to do with dealing with unicode ...

BTW, what are you planning to do with an LDID of 0x00?

Cheers,

John
 
E

Ethan Furman

John said:
437 and 850 *are* codepages. You mean "all those language driver IDs
mapping to codepages 437 and 850". A codepage merely gives an
encoding. An LDID is like a locale; it includes other things besides
the encoding. That's why many Western European languages map to the
same codepage, first 437 then later 850 then 1252 when Windows came
along.

Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
to a cp437, and the file came from a german oem machine... could that
file have upper-ascii codes that will not map to anything reasonable on
my \x01 cp437 machine? If so, is there anything I can do about it?

The error idea is fine, but I don't get the "implement it yourself for
practice" bit ... practice what? You plan a long and fruitful career
inplementing codecs for YAGNI codepages?

ROFL. Playing with code; the unicode/code page interactions. Possibly
looking at constructs I might not otherwise. Since this would almost
certainly (I don't like saying "absolutely" and "never" -- been
troubleshooting for too many years for that!-) be a YAGNI, implementing
it is very low priority

You're welcome.




?? As far as I can tell, the topic has been about mapping from
something like a locale to the name of an encoding, i.e. all about the
pre-Unicode mishmash and nothing to do with dealing with unicode ...

You are, of course, correct. Once it's all unicode life will be easier
(he says, all innocent-like). And dbf files even bigger, lol.

BTW, what are you planning to do with an LDID of 0x00?

Hmmm. Well, logical choices seem to be either treating it as plain
ascii, and barfing when high-ascii shows up; defaulting to \x01; or
forcing the user to choose one on initial access.

I am definitely open to ideas!
 
J

John Machin

Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
to a cp437, and the file came from a german oem machine... could that
file have upper-ascii codes that will not map to anything reasonable on
my \x01 cp437 machine?  If so, is there anything I can do about it?

ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
meaningless. As for the rest of your question, if the file's encoded
in cpXXX, it's encoded in cpXXX. If either the creator or the reader
or both are lying, then all bets are off.
Hmmm.  Well, logical choices seem to be either treating it as plain
ascii, and barfing when high-ascii shows up; defaulting to \x01; or
forcing the user to choose one on initial access.

It would be more useful to allow the user to specify an encoding than
an LDID.

You need to be able to read files created not only by software like
VFP or dBase but also scripts using third-party libraries. It would be
useful to allow an encoding to override an LDID that is incorrect e.g.
the LDID implies cp1251 but the data is actually encoded in koi8[ru]

Read this: http://en.wikipedia.org/wiki/Code_page_437
With no LDID in the file and no encoding supplied, I'd be inclined to
make it barf if any codepoint not in range(32, 128) showed up.

Cheers,
John
 
E

Ethan Furman

John said:
ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
meaningless. As for the rest of your question, if the file's encoded
in cpXXX, it's encoded in cpXXX. If either the creator or the reader
or both are lying, then all bets are off.

My confusion is this -- is there a difference between any of the various
cp437s? Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437, and they have names
such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
English (Britain & US)... are these all the same?

It would be more useful to allow the user to specify an encoding than
an LDID.

I plan on using the same technique used in xlrd and xlwt, and allowing
an encoding to be specified when the table is opened. If not specified,
it will use whatever the table has in the LDID field.

You need to be able to read files created not only by software like
VFP or dBase but also scripts using third-party libraries. It would be
useful to allow an encoding to override an LDID that is incorrect e.g.
the LDID implies cp1251 but the data is actually encoded in koi8[ru]

Read this: http://en.wikipedia.org/wiki/Code_page_437
With no LDID in the file and no encoding supplied, I'd be inclined to
make it barf if any codepoint not in range(32, 128) showed up.

Sounds reasonable -- especially when the encoding can be overridden.

~Ethan~
 
J

John Machin

My confusion is this -- is there a difference between any of the various
cp437s?

What various cp437s???
 Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437,

Yes, this is called a "many-to-*one*" relationship.
and they have names

"they" being the Language Drivers, not the codepages.
such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
English (Britain & US)... are these all the same?

When you read the Wikipedia page on cp437, did you see any reference
to different versions for French, German, Finnish, etc? I saw only one
mapping table; how many did you see? If there are multiple language
versions of a codepage, how do you expect to handle this given Python
has only one codec per codepage?

Trying again: *ONE* attribute of a Language Driver ID (LDID) is the
character set (codepage) that it uses. Other attributes may be things
like the collating (sorting) sequence, whether they use a dot or a
comma as the decimal point, etc. Many different languages in Western
Europe can use the same codepage. Initially the common one was cp 437,
then 850, then 1252.

There may possibly different interpretations of a codepage out there
somewhere, but they are all *intended* to be the same, and I advise
you to cross the different-cp437s bridge *if* it exists and you ever
come to it.

Have you got access to files with LDID not in (0, 1) that you can try
out?

Cheers,
John
 
E

Ethan Furman

John said:
There may possibly different interpretations of a codepage out there
somewhere, but they are all *intended* to be the same, and I advise
you to cross the different-cp437s bridge *if* it exists and you ever
come to it.

Have you got access to files with LDID not in (0, 1) that you can try
out?

Alas, I do not. And I probably never will, making the whole thing academic.

Speaking of tables I do not have access to, and documentation for that
matter, I would love to get information on db4, 5, 7, etc.

Many thanks for your time and knowledge, and my apologies for seeming so
dense. :)

Cheers!

~Ethan~
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top