Converting EBCDIC to Unicode

Saeed Amrollahi · Sep 28, 2010

Dear all
Hi

I wrote a program to convert a EBCDIC text file in OS/400 environment
to Unicode (UTF-16) in Windows XP.
Because, the text file contains information of Shareholders in Persian
(Farsi), I had to find
the mapping table of Persian characters. You may be know, Unlike
English,
in Persian some characters has one form, some of them two forms and
for some
characters, there are more than two forms. I mean there are Initial,
Medial and Final forms.
I found them using Character Map (One of System Programs in Windows
XP).
I really like to know your general and special opinion. If someone
already worked on the
subject even in other languages (like Arabic) h(is/er) advice may be
help so much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) is 16
bits
(or more precisely 21 bits) encoding, I use for input file an ifstream
object (character files) and for
output file wofstream object (Wide character file)
2. I use the int() function to know the ordinal number behind the
characters. I use the convention:
If the returned number is positive, it should be English letter or
numeric, in other words it isn't Persian
and If it is negative, it is Persian and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;

Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};

// mapping.cpp
Mapping::Mapping()
{
FillMap();
}

void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}

LineConvertor is a class that read one line and convert it to Unicode
standard:

//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {

wchar_t w = s;
if (int(s) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s)] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s)]));

}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}

Is this a good way to find mapping for all Persian characters?
What is the reverse function of int()? I mean a function chr(int) that
returns the corresponding
character of an integer?
3. I trace my program using debugger, and I see my program works fine.
My main problem is: When I write the Persian character to wostream
file (output file)
The file is empty. There is nothing in output file:
In the following code, FileConvertor is a class with Convert member
function that
converts all the file. for each line the member LineConvertor,
converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};

// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V << L'\t'; // <-- no character is written to file

}
Out << L'\n';
}
}

4. I don't know. Do I should consider std::locale and std::facet in
programming
such applications (file conversion)? I want to extend my program to
convert Unicode to
EBCDIC, EBCDIC to XML, ... I mean Generic converter. How to apply
Policy class design?

5. How to write a general program with minimum effort to port it to
Linux environment?
I need to some general guidelines.

Please throw some light.
Regards,
-- Saeed Amrollahi

James Kanze · Sep 28, 2010

I wrote a program to convert a EBCDIC text file in OS/400
environment to Unicode (UTF-16) in Windows XP. Because, the
text file contains information of Shareholders in Persian
(Farsi), I had to find the mapping table of Persian
characters. You may be know, Unlike English, in Persian some
characters has one form, some of them two forms and for some
characters, there are more than two forms. I mean there are
Initial, Medial and Final forms.

And isolated, no?

But that's usually a problem for the rendering machine, not for
your program.

I found them using Character Map (One of System Programs in
Windows XP). I really like to know your general and special
opinion. If someone already worked on the subject even in
other languages (like Arabic) h(is/er) advice may be help so
much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16)
is 16 bits (or more precisely 21 bits) encoding, I use for
input file an ifstream object (character files) and for output
file wofstream object (Wide character file)

That's the way it was designed to work. (Actually, it was
designed so that you imbue a Persian EBCDIC local in a wifstream
when reading. If you can find such.)

2. I use the int() function to know the ordinal number behind the
characters.

In C++, all you have *is* the ordinal number. What you probably
do have to do is convert the input char to unsigned char.

I use the convention:
If the returned number is positive, it should be English
letter or numeric, in other words it isn't Persian and If it
is negative, it is Persian

You can't count on that. The type char may be signed or
unsigned. Convert to unsigned char, then compare to 128.

Except that that doesn't work at all for EBCDIC, where 'a' is
0x81, and the Persian characters are probably scattered about in
the unused spaces. Or it uses some sort of shift-in/shift-out
scheme with two different encodings. Or IBM has given up on
EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS
Windows CP-1256 (although I'm not sure that either of these has
the extra characters needed for Persian).

and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;

Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};

// mapping.cpp
Mapping::Mapping()
{
FillMap();
}

void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}

Why do things the hard way?

I'd use something like:

static wchar_t const map[] =
{
0x0000, 0x0001, 0x0002, 0x0003, // 0x00-0x03
// ...
0x0061, 0x0062, 0x0063, 0x0064, // 0x80-0x83
// ...
};

This should be indexed with the input char, converted to
unsigned char. (I'd also write some quicky program to generated
this table from some table you already have at hand.)

LineConvertor is a class that read one line and convert it to Unicode
standard:

//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {

wchar_t w = s;
if (int(s) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s)] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s)]));

}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}

A better solution would be to create a codecvt facet, and use it
directly in the istream.

Is this a good way to find mapping for all Persian characters?

Click to expand...

What is the reverse function of int()? I mean a function
chr(int) that returns the corresponding character of an
integer?

Click to expand...

There is no "function" int(). Using int() this way is the same

as a static_cast said:

3. I trace my program using debugger, and I see my program
works fine. My main problem is: When I write the Persian
character to wostream file (output file) The file is empty.
There is nothing in output file:

Click to expand...

That sounds like a completely different problem. Without
complete, compilable code, and information concerning the
system you've compiled and run on, it's impossible to say. One
possible explination, however, is that the locale imbued in the
output stream doesn't understand the Persian characters. The
first character which cannot be correctly transcoded will result
in an error (bad() returning true on the wostream).

Note that even a wofstream only writes bytes (char's). The
trick here is to imbue it with a locale which converts each
wchar_t into two bytes.

In the following code, FileConvertor is a class with Convert
member function that converts all the file. for each line the
member LineConvertor, converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};

Click to expand...

// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V << L'\t'; // <-- no character is written to file

}
Out << L'\n';
}

Click to expand...

4. I don't know. Do I should consider std::locale and
std::facet in programming such applications (file conversion)?

Click to expand...

You don't have a choice. If nothing else, you can use only
single byte streams, opened in binary mode, and imbued with the
"C" locale---these are transparent: the bytes you read are what
is on the disk, and the bytes you right are the bytes that end
up on the disk. In all other cases, the locale imbued in the
stream will get involved, or some other code translation will
take place in the stream.

I want to extend my program to convert Unicode to EBCDIC,
EBCDIC to XML, ... I mean Generic converter.

Click to expand...

You mean iconv. It already exists.

How to apply Policy class design?

Click to expand...

Generally, I've hear policy used to refer to some sort of
template metaprogramming technique. Perhaps you mean the
strategy pattern.

5. How to write a general program with minimum effort to port
it to Linux environment?

Click to expand...

Well, if portability is a concern, avoid any locale but "C", and
avoid wchar_t.

Saeed Amrollahi · Sep 30, 2010

Hi James
Thank you for your detailed answers. I'm sorry for my delay, I was out
of office.

And isolated, no?

Yes. That's right. You are clever.

But that's usually a problem for the rendering machine, not for
your program.

I can't understand. By rendering machine, what do you mean? You mean
my local computer?

That's the way it was designed to work. (Actually, it was
designed so that you imbue a Persian EBCDIC local in a wifstream
when reading. If you can find such.)

In C++, all you have *is* the ordinal number. What you probably
do have to do is convert the input char to unsigned char.

OK. I try it.

You can't count on that. The type char may be signed or
unsigned. Convert to unsigned char, then compare to 128.

OK.

Except that that doesn't work at all for EBCDIC, where 'a' is
0x81, and the Persian characters are probably scattered about in
the unused spaces. Or it uses some sort of shift-in/shift-out
scheme with two different encodings. Or IBM has given up on
EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS
Windows CP-1256 (although I'm not sure that either of these has
the extra characters needed for Persian).

<Nod> You are right. The Persian characters are scattered
in unordered way in unused space. An analogy: 'b' is not after 'a'
necessarily. Would you mind explain the Shift-in/Shift-out scheme?
About Windows Code Page 1256, there is a problem in my current
project.
As you know, there is just one form of each Persian character (the
Initial/Medial one),
for the isolated/Final, a space should be added to the word. It is the
problem.
In current application, there is another problem with CP-1256. We have
a field
with 3 Persian characters (The first 3 characters of shareholder
family name)
and 5 digits and They are concatenated. take an Analogy: 'Amr00023'
Unfortunately, in CP-1256, after meet the first digit, the last
character will
change from medial form to final form and it is wrong.

and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;
Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};
// mapping.cpp
Mapping::Mapping()
{
FillMap();
}
void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}

Click to expand...

Why do things the hard way?

I'd use something like:

static wchar_t const map[] =
{
0x0000, 0x0001, 0x0002, 0x0003, // 0x00-0x03
// ...
0x0061, 0x0062, 0x0063, 0x0064, // 0x80-0x83
// ...
};

This should be indexed with the input char, converted to
unsigned char. (I'd also write some quicky program to generated
this table from some table you already have at hand.)

OK. I consider it.

LineConvertor is a class that read one line and convert it to Unicode
standard:
//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {

Click to expand...

wchar_t w = s;
if (int(s) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s)] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s)]));

Click to expand...

}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}

Click to expand...

A better solution would be to create a codecvt facet, and use it
directly in the istream.

OK. I try it.

There is no "function" int(). Using int() this way is the same
as a static_cast<int>.

Click to expand...

Indeed, I didn't mean an C/C++ ordinary function.

That sounds like a completely different problem. Without
complete, compilable code, and information concerning the
system you've compiled and run on, it's impossible to say. One
possible explination, however, is that the locale imbued in the
output stream doesn't understand the Persian characters. The
first character which cannot be correctly transcoded will result
in an error (bad() returning true on the wostream).

Note that even a wofstream only writes bytes (char's). The
trick here is to imbue it with a locale which converts each
wchar_t into two bytes.

Click to expand...

OK. I consider it.

In the following code, FileConvertor is a class with Convert
member function that converts all the file. for each line the
member LineConvertor, converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};
// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V << L'\t'; // <-- no character is written to file

Click to expand...

}
Out << L'\n';
}
4. I don't know. Do I should consider std::locale and
std::facet in programming such applications (file conversion)?

Click to expand...

You don't have a choice. If nothing else, you can use only
single byte streams, opened in binary mode, and imbued with the
"C" locale---these are transparent: the bytes you read are what
is on the disk, and the bytes you right are the bytes that end
up on the disk. In all other cases, the locale imbued in the
stream will get involved, or some other code translation will
take place in the stream.

I want to extend my program to convert Unicode to EBCDIC,
EBCDIC to XML, ... I mean Generic converter.

Click to expand...

You mean iconv. It already exists.

Click to expand...

I don't know iconv. Is it the product by Dinkumware company?

Generally, I've hear policy used to refer to some sort of
template metaprogramming technique. Perhaps you mean the
strategy pattern.

Click to expand...

By policy class I mean something like this (Pseudo-code):
template<class ConversionPolicy>
class Convertor {
// ...
public:
convert();
};

Well, if portability is a concern, avoid any locale but "C", and
avoid wchar_t.

--

Click to expand...

Again, thanks for your answer. It contains several items
and I try to consider/use them at maximum capacity.

James Kanze

Click to expand...

Regards,
-- Saeed Amrollahi

James Kanze · Sep 30, 2010

Yes. That's right. You are clever.

I can't understand. By rendering machine, what do you mean? You mean
my local computer?

Rendering machine or rendering engine. The mechanism which
converts the internal code to human readable format. In
other words, the encoding should just store the letters,
without regards to the form. The engine which actually
generates the display or the graphic format should choose
the appropriate form depending on context.

OK. I try it.

<Nod> You are right. The Persian characters are scattered
in unordered way in unused space. An analogy: 'b' is not after 'a'
necessarily. Would you mind explain the Shift-in/Shift-out scheme?

A shift-in/shift-out scheme is a solution which basically
uses two different encodings, with special characters to
shift from one to the other. With 7 bit characters, for
example, one might have one encoding for Persian characters,
another for Latin (with some common characters like space in
both), and two reserved codes, one which says that what
follows is Latin, the other that what follows is Persian.

Such schemes were common many years ago. They have serious
disadvantages (like, loose one of the shift characters in
translation, and everything is off, or that you can't just
skip ahead n characters without looking at every character).
From what you say above, I don't think this is your case.

About Windows Code Page 1256, there is a problem in my
current project. As you know, there is just one form of
each Persian character (the Initial/Medial one), for the
isolated/Final, a space should be added to the word. It is
the problem.

I'm not at all familiar with the Windows code pages. I do
know that in general, Arabic (and certainly also Persian)
normally only encode the character, not its form. It's only
when rendering that the correct form is chosen, according to
context.

In current application, there is another problem with
CP-1256. We have a field with 3 Persian characters (The
first 3 characters of shareholder family name) and
5 digits and They are concatenated. take an Analogy:
'Amr00023' Unfortunately, in CP-1256, after meet the first
digit, the last character will change from medial form to
final form and it is wrong.

That sounds like a bug in the rendering engine. Or maybe in
your expectations: I would expect a final form before
a sequence of digits, see section 3.5 of
http://www.unicode.org/reports/tr9/#Shaping. (If I'm not
mistaken, digits are right to left in Persian, which means
that there is a change in the direction when you switch from
letters to digits.)

[...]

OK. I try it.

Just be warned that it is more work. The codecvt has
a somewhat perverted interface (probably because it was
designed before there was an std::string).

[...]

I don't know iconv. Is it the product by Dinkumware company?

No. It's GPL (I think---a free to use license, anyway).
It's both a library, for use within your code, and
a stand-alone command line program. It's generally part of
Unix distributions, but you can get it for Windows as well.

By policy class I mean something like this (Pseudo-code):
template<class ConversionPolicy>
class Convertor {
// ...
public:
convert();
};

Convertor<EBCDIC2Unicode> c;
c.convert();

OK. In my experience, using the strategy pattern is
preferrable. Sooner or later, you'll end up wanting the
decision to be made at run-time.

Converting swf to html5	2	Aug 25, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
EBCDIC <--> ASCII	4	Dec 4, 2008
Converting linefeeds to ebcdic	6	Mar 24, 2009
.VB converting to C#	0	Sep 5, 2019
converting double to int	1	Nov 19, 2013
Hardcoding a Unicode String(looks not work)	4	Jun 26, 2011
Converting several Markdown files into DOCX with pandoc	4	Feb 1, 2023

Converting EBCDIC to Unicode

Saeed Amrollahi

James Kanze

Saeed Amrollahi

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads