Converting EBCDIC to Unicode

S

Saeed Amrollahi

Dear all
Hi

I wrote a program to convert a EBCDIC text file in OS/400 environment
to Unicode (UTF-16) in Windows XP.
Because, the text file contains information of Shareholders in Persian
(Farsi), I had to find
the mapping table of Persian characters. You may be know, Unlike
English,
in Persian some characters has one form, some of them two forms and
for some
characters, there are more than two forms. I mean there are Initial,
Medial and Final forms.
I found them using Character Map (One of System Programs in Windows
XP).
I really like to know your general and special opinion. If someone
already worked on the
subject even in other languages (like Arabic) h(is/er) advice may be
help so much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) is 16
bits
(or more precisely 21 bits) encoding, I use for input file an ifstream
object (character files) and for
output file wofstream object (Wide character file)
2. I use the int() function to know the ordinal number behind the
characters. I use the convention:
If the returned number is positive, it should be English letter or
numeric, in other words it isn't Persian
and If it is negative, it is Persian and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;

Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};

// mapping.cpp
Mapping::Mapping()
{
FillMap();
}

void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}

LineConvertor is a class that read one line and convert it to Unicode
standard:

//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {

wchar_t w = s;
if (int(s) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s)] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s)]));

}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}

Is this a good way to find mapping for all Persian characters?
What is the reverse function of int()? I mean a function chr(int) that
returns the corresponding
character of an integer?
3. I trace my program using debugger, and I see my program works fine.
My main problem is: When I write the Persian character to wostream
file (output file)
The file is empty. There is nothing in output file:
In the following code, FileConvertor is a class with Convert member
function that
converts all the file. for each line the member LineConvertor,
converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};

// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V << L'\t'; // <-- no character is written to file

}
Out << L'\n';
}
}

4. I don't know. Do I should consider std::locale and std::facet in
programming
such applications (file conversion)? I want to extend my program to
convert Unicode to
EBCDIC, EBCDIC to XML, ... I mean Generic converter. How to apply
Policy class design?

5. How to write a general program with minimum effort to port it to
Linux environment?
I need to some general guidelines.

Please throw some light.
Regards,
-- Saeed Amrollahi
 
J

James Kanze

I wrote a program to convert a EBCDIC text file in OS/400
environment to Unicode (UTF-16) in Windows XP. Because, the
text file contains information of Shareholders in Persian
(Farsi), I had to find the mapping table of Persian
characters. You may be know, Unlike English, in Persian some
characters has one form, some of them two forms and for some
characters, there are more than two forms. I mean there are
Initial, Medial and Final forms.

And isolated, no?

But that's usually a problem for the rendering machine, not for
your program.
I found them using Character Map (One of System Programs in
Windows XP). I really like to know your general and special
opinion. If someone already worked on the subject even in
other languages (like Arabic) h(is/er) advice may be help so
much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16)
is 16 bits (or more precisely 21 bits) encoding, I use for
input file an ifstream object (character files) and for output
file wofstream object (Wide character file)

That's the way it was designed to work. (Actually, it was
designed so that you imbue a Persian EBCDIC local in a wifstream
when reading. If you can find such.)
2. I use the int() function to know the ordinal number behind the
characters.

In C++, all you have *is* the ordinal number. What you probably
do have to do is convert the input char to unsigned char.
I use the convention:
If the returned number is positive, it should be English
letter or numeric, in other words it isn't Persian and If it
is negative, it is Persian

You can't count on that. The type char may be signed or
unsigned. Convert to unsigned char, then compare to 128.

Except that that doesn't work at all for EBCDIC, where 'a' is
0x81, and the Persian characters are probably scattered about in
the unused spaces. Or it uses some sort of shift-in/shift-out
scheme with two different encodings. Or IBM has given up on
EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS
Windows CP-1256 (although I'm not sure that either of these has
the extra characters needed for Persian).
and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;
Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};
// mapping.cpp
Mapping::Mapping()
{
FillMap();
}
void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}

Why do things the hard way?

I'd use something like:

static wchar_t const map[] =
{
0x0000, 0x0001, 0x0002, 0x0003, // 0x00-0x03
// ...
0x0061, 0x0062, 0x0063, 0x0064, // 0x80-0x83
// ...
};

This should be indexed with the input char, converted to
unsigned char. (I'd also write some quicky program to generated
this table from some table you already have at hand.)
LineConvertor is a class that read one line and convert it to Unicode
standard:
//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {

wchar_t w = s;
if (int(s) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s)] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s)]));

}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}


A better solution would be to create a codecvt facet, and use it
directly in the istream.
Is this a good way to find mapping for all Persian characters?
What is the reverse function of int()? I mean a function
chr(int) that returns the corresponding character of an
integer?

There is no "function" int(). Using int() this way is the same
as a static_cast said:
3. I trace my program using debugger, and I see my program
works fine. My main problem is: When I write the Persian
character to wostream file (output file) The file is empty.
There is nothing in output file:

That sounds like a completely different problem. Without
complete, compilable code, and information concerning the
system you've compiled and run on, it's impossible to say. One
possible explination, however, is that the locale imbued in the
output stream doesn't understand the Persian characters. The
first character which cannot be correctly transcoded will result
in an error (bad() returning true on the wostream).

Note that even a wofstream only writes bytes (char's). The
trick here is to imbue it with a locale which converts each
wchar_t into two bytes.
In the following code, FileConvertor is a class with Convert
member function that converts all the file. for each line the
member LineConvertor, converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};
// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V << L'\t'; // <-- no character is written to file

}
Out << L'\n';
}

4. I don't know. Do I should consider std::locale and
std::facet in programming such applications (file conversion)?

You don't have a choice. If nothing else, you can use only
single byte streams, opened in binary mode, and imbued with the
"C" locale---these are transparent: the bytes you read are what
is on the disk, and the bytes you right are the bytes that end
up on the disk. In all other cases, the locale imbued in the
stream will get involved, or some other code translation will
take place in the stream.
I want to extend my program to convert Unicode to EBCDIC,
EBCDIC to XML, ... I mean Generic converter.

You mean iconv. It already exists.
How to apply Policy class design?

Generally, I've hear policy used to refer to some sort of
template metaprogramming technique. Perhaps you mean the
strategy pattern.
5. How to write a general program with minimum effort to port
it to Linux environment?

Well, if portability is a concern, avoid any locale but "C", and
avoid wchar_t.
 
S

Saeed Amrollahi

Hi James
Thank you for your detailed answers. I'm sorry for my delay, I was out
of office.

And isolated, no?

Yes. That's right. You are clever.
But that's usually a problem for the rendering machine, not for
your program.

I can't understand. By rendering machine, what do you mean? You mean
my local computer?
That's the way it was designed to work.  (Actually, it was
designed so that you imbue a Persian EBCDIC local in a wifstream
when reading.  If you can find such.)


In C++, all you have *is* the ordinal number.  What you probably
do have to do is convert the input char to unsigned char.

OK. I try it.
You can't count on that.  The type char may be signed or
unsigned.  Convert to unsigned char, then compare to 128.

OK.

Except that that doesn't work at all for EBCDIC, where 'a' is
0x81, and the Persian characters are probably scattered about in
the unused spaces.  Or it uses some sort of shift-in/shift-out
scheme with two different encodings.  Or IBM has given up on
EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS
Windows CP-1256 (although I'm not sure that either of these has
the extra characters needed for Persian).

<Nod> You are right. The Persian characters are scattered
in unordered way in unused space. An analogy: 'b' is not after 'a'
necessarily. Would you mind explain the Shift-in/Shift-out scheme?
About Windows Code Page 1256, there is a problem in my current
project.
As you know, there is just one form of each Persian character (the
Initial/Medial one),
for the isolated/Final, a space should be added to the word. It is the
problem.
In current application, there is another problem with CP-1256. We have
a field
with 3 Persian characters (The first 3 characters of shareholder
family name)
and 5 digits and They are concatenated. take an Analogy: 'Amr00023'
Unfortunately, in CP-1256, after meet the first digit, the last
character will
change from medial form to final form and it is wrong.
and I use my Mapping:
// mapping.h
struct Mapping {
                std::map<int, int> Map;
                Mapping();
                void FillMap();
                int operator[](const int k) { return Map[k]; }
};
// mapping.cpp
Mapping::Mapping()
{
        FillMap();
}
void Mapping::FillMap()
{
        // fill map
        Map[-14]  = 0xFEF4;  // ARABIC LETTER YEH MEDIAL FORM
        Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
        Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
        // other map entries
}

Why do things the hard way?

I'd use something like:

        static wchar_t const map[] =
        {
                0x0000, 0x0001, 0x0002, 0x0003,  // 0x00-0x03
                //  ...
                0x0061, 0x0062, 0x0063, 0x0064,  // 0x80-0x83
                //  ...
        };

This should be indexed with the input char, converted to
unsigned char.  (I'd also write some quicky program to generated
this table from some table you already have at hand.)

OK. I consider it.
LineConvertor is a class that read one line and convert it to Unicode
standard:
//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
        wstring ws;
        for (string::size_type i = 0; i < s.size(); i++) {
                wchar_t w = s;
                if (int(s) >= 0) ws.push_back(w);
                else { // so it should be persian character in EBCEDIC character set
                        if (CP[int(s)] != 0) { // if the character is in lookup table
                                ws.push_back(wchar_t(CP[int(s)]));

                        }
                        else {
                              // there is no entry in Mapping data
structure.
                              // throw exception
                        }
                }
        }
        return ws;
}

A better solution would be to create a codecvt facet, and use it
directly in the istream.


OK. I try it.
There is no "function" int().  Using int() this way is the same
as a static_cast<int>.

Indeed, I didn't mean an C/C++ ordinary function.
That sounds like a completely different problem.  Without
complete, compilable code, and information concerning the
system you've compiled and run on, it's impossible to say.  One
possible explination, however, is that the locale imbued in the
output stream doesn't understand the Persian characters.  The
first character which cannot be correctly transcoded will result
in an error (bad() returning true on the wostream).

Note that even a wofstream only writes bytes (char's).  The
trick here is to imbue it with a locale which converts each
wchar_t into two bytes.

OK. I consider it.
In the following code, FileConvertor is a class with Convert
member function that converts all the file. for each line the
member LineConvertor, converts a line.:
// file_convertor.h
class FileConvertor {
        std::ifstream In; // original file
        std::wofstream Out; // a file containing of converted records
(unicode)
        LineConvertor LC;
        // ...
public:
       void Convert();
};
// file_convertor.cpp
void FileConvertor::Convert()
{
        for (string s; getline(In, s); ++RecCount) {
                try {
                        std::vector<std::wstring> V = LC.Convert();
                        for (std::vector<std::wstring>::size_type i  = 0; i < V.size(); i+
+)  {
                                Out << V << L'\t';   // <-- no character is written to file

                        }
                        Out << L'\n';
}
4. I don't know. Do I should consider std::locale and
std::facet in programming such applications (file conversion)?

You don't have a choice.  If nothing else, you can use only
single byte streams, opened in binary mode, and imbued with the
"C" locale---these are transparent: the bytes you read are what
is on the disk, and the bytes you right are the bytes that end
up on the disk.  In all other cases, the locale imbued in the
stream will get involved, or some other code translation will
take place in the stream.
I want to extend my program to convert Unicode to EBCDIC,
EBCDIC to XML, ... I mean Generic converter.

You mean iconv.  It already exists.


I don't know iconv. Is it the product by Dinkumware company?
Generally, I've hear policy used to refer to some sort of
template metaprogramming technique.  Perhaps you mean the
strategy pattern.
By policy class I mean something like this (Pseudo-code):
template<class ConversionPolicy>
class Convertor {
// ...
public:
convert();
};

Well, if portability is a concern, avoid any locale but "C", and
avoid wchar_t.

--

Again, thanks for your answer. It contains several items
and I try to consider/use them at maximum capacity.
James Kanze

Regards,
-- Saeed Amrollahi
 
J

James Kanze

Yes. That's right. You are clever.
I can't understand. By rendering machine, what do you mean? You mean
my local computer?

Rendering machine or rendering engine. The mechanism which
converts the internal code to human readable format. In
other words, the encoding should just store the letters,
without regards to the form. The engine which actually
generates the display or the graphic format should choose
the appropriate form depending on context.
OK. I try it.
<Nod> You are right. The Persian characters are scattered
in unordered way in unused space. An analogy: 'b' is not after 'a'
necessarily. Would you mind explain the Shift-in/Shift-out scheme?

A shift-in/shift-out scheme is a solution which basically
uses two different encodings, with special characters to
shift from one to the other. With 7 bit characters, for
example, one might have one encoding for Persian characters,
another for Latin (with some common characters like space in
both), and two reserved codes, one which says that what
follows is Latin, the other that what follows is Persian.

Such schemes were common many years ago. They have serious
disadvantages (like, loose one of the shift characters in
translation, and everything is off, or that you can't just
skip ahead n characters without looking at every character).
From what you say above, I don't think this is your case.
About Windows Code Page 1256, there is a problem in my
current project. As you know, there is just one form of
each Persian character (the Initial/Medial one), for the
isolated/Final, a space should be added to the word. It is
the problem.

I'm not at all familiar with the Windows code pages. I do
know that in general, Arabic (and certainly also Persian)
normally only encode the character, not its form. It's only
when rendering that the correct form is chosen, according to
context.
In current application, there is another problem with
CP-1256. We have a field with 3 Persian characters (The
first 3 characters of shareholder family name) and
5 digits and They are concatenated. take an Analogy:
'Amr00023' Unfortunately, in CP-1256, after meet the first
digit, the last character will change from medial form to
final form and it is wrong.

That sounds like a bug in the rendering engine. Or maybe in
your expectations: I would expect a final form before
a sequence of digits, see section 3.5 of
http://www.unicode.org/reports/tr9/#Shaping. (If I'm not
mistaken, digits are right to left in Persian, which means
that there is a change in the direction when you switch from
letters to digits.)

[...]
OK. I try it.

Just be warned that it is more work. The codecvt has
a somewhat perverted interface (probably because it was
designed before there was an std::string).

[...]
I don't know iconv. Is it the product by Dinkumware company?

No. It's GPL (I think---a free to use license, anyway).
It's both a library, for use within your code, and
a stand-alone command line program. It's generally part of
Unix distributions, but you can get it for Windows as well.
By policy class I mean something like this (Pseudo-code):
template<class ConversionPolicy>
class Convertor {
// ...
public:
convert();
};
Convertor<EBCDIC2Unicode> c;
c.convert();

OK. In my experience, using the strategy pattern is
preferrable. Sooner or later, you'll end up wanting the
decision to be made at run-time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,876
Messages
2,569,932
Members
46,206
Latest member
BernardPer

Latest Threads

Top