Will standard C++ allow me to replace a string in a unicode-encoded text file?

E

Eric Lilja

Hello, I had what I thought was normal text-file and I needed to locate a
string matching a certain pattern in that file and, if found, replace that
string. I thought this would be simple but I had problems getting my
algorithm to work and in order to help me find the solution I decided to
print each line to screen as I read them.
Then, to my surprise, I noticed that there was a space between every
character as I outputted the lines to the screen. I opened the file in a
more competent text editor and it informed me the file was "encoded" in
U-DOS. What's that, unicode? Anyway, my question is, can I read and write
unicode text files using standard C++ or will I have to resort to platform
specific tools in order to accomplish what I want?

Thanks for reading and replying

/ Eric
 
J

Jerry Coffin

Eric said:
Hello, I had what I thought was normal text-file and I needed to
locate a string matching a certain pattern in that file and, if
found, replace that string. I thought this would be simple but I
had problems getting my algorithm to work and in order to help me
find the solution I decided to print each line to screen as I
read them. Then, to my surprise, I noticed that there was a space
between every character as I outputted the lines to the screen. I
opened the file in a more competent text editor and it informed
me the file was "encoded" in U-DOS. What's that, unicode? Anyway,
my question is, can I read and write unicode text files using
standard C++ or will I have to resort to platform specific tools
in order to accomplish what I want?

You should be able to read nearly any sort of file in standard C++. The
major question is how much work it'll be -- i.e. whether your library
already has code to handle the encoding used or not.

"U-DOS" doesn't mean much to me -- to get very far, you'll probably
want to look at something like a hex-dump of the file to figure out
what it really contains. Based on your description, it sounds as if it
_may_ have been written as UCS-2 or UTF-16 Unicode, but it's hard to
guess. If (nearly) every other byte is 00, one of those is a strong
possibility. Then again, if quite a few of the odd bytes aren't 00's,
it might still be UCS-2 (for example) but it's harder to say for sure.

If the file's truly properly written Unicode, then it's supposed to
start with a byte-order mark, and based on how that's been written, you
can pretty much figure out how the rest of the file should be decoded
as well. Unfortunately, an awful lot of files use (one of the several
forms of) Unicode encoding elsewhere, but leave out the byte-order
mark. In that case, you'll have to figure out the encoding on your own
-- there are heuristics to us to try to figure out (much as I've
outlined above) but none of them is perfect by any means.
 
E

Eric Lilja

:
You should be able to read nearly any sort of file in standard C++. The
major question is how much work it'll be -- i.e. whether your library
already has code to handle the encoding used or not.

"U-DOS" doesn't mean much to me -- to get very far, you'll probably
want to look at something like a hex-dump of the file to figure out
what it really contains. Based on your description, it sounds as if it
_may_ have been written as UCS-2 or UTF-16 Unicode, but it's hard to
guess. If (nearly) every other byte is 00, one of those is a strong
possibility. Then again, if quite a few of the odd bytes aren't 00's,
it might still be UCS-2 (for example) but it's harder to say for sure.

If the file's truly properly written Unicode, then it's supposed to
start with a byte-order mark, and based on how that's been written, you
can pretty much figure out how the rest of the file should be decoded
as well. Unfortunately, an awful lot of files use (one of the several
forms of) Unicode encoding elsewhere, but leave out the byte-order
mark. In that case, you'll have to figure out the encoding on your own
-- there are heuristics to us to try to figure out (much as I've
outlined above) but none of them is perfect by any means.

Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that means
utf-16? I was thinking of opening it in binary mode, read the first two
bytes then start a loop that reads from the file byte by byte and adds the
first, the third, the fifth byte etc to a std::string (or a std::vector of
chars maybe). When the loop is done I should have the actual text of the
file. Then I can look for the pattern I want and replace it as needed. Then
I will open the file for writing (still in binary of course) and write out
as utf-16. Sounds like this should work?
--
Later,
Jerry.

The universe is a figment of its own imagination.

/ Eric
 
D

Dietmar Kuehl

Eric said:
Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that means
utf-16?

Not necessarily: it indicates UTF-16 or UCS2. However, the difference
only matters if you want to access the whole set of Unicode characters:
UTF-16 provides the possibility to access characters where the code
requires more than 16 bits while UCS2 does not (and is thus not an
encoding covering all Unicode characters).
I was thinking of opening it in binary mode, read the first two
bytes then start a loop that reads from the file byte by byte and adds the
first, the third, the fifth byte etc to a std::string (or a std::vector of
chars maybe).

It depends on what you want to do: if your goal is only to process the
given file, this may work but you are probably easier off using a
Unicode enabled editor for this task. If you need to process more files
or similar nature you should use a rather different approach: the first
two bytes are conventionally considered to be a byte order mark if they
either consist of FFFE or FEFF. Otherwise, the file does not have a
byte
order mark and you should figure the details of the encoding out
differently: for example, XML specifies that a certain string should
appear early and this can be used to find out the byte ordering. Often
files have some form of a "magic" code to indicate their contents.

Since you are apparently handling Unicode, you should not use a
'std::string' but at least 'std::wstring' to cope with non-ASCII
characters, too: the Unicode encoding does not just waste the space, it
does so for a reason. The zero bytes you are seening just indicate that
you got essentially ASCII characters but there are also many other
characters which require more than seven bits of encoding. The usual
Unicode characters just take two bytes (which happens to be the size of
'wchar_t' on some platforms) but full coverage of Unicode requires 20
bits (the last time I looked; there was a time when 16 bits where
sufficient for Unicode, too). You should probably at least compute the
'whcar_t' and assume a UCS2 encoding (you mgiht want to bail out if
you detect UTF-16; I don't remember the details off-hand but this is
pretty easy: just look for documentation of UTF-16).

Effectively, you might get away without even touching this whole mold,
though: if you are using 'std::wifstream' you might get the right thing
immediately. If not, you probably can set up the locale to do the right
thing. Unfortunately are the details of the locale setup not part of
the standard and depend on the platform, i.e. you have to find out with
your documentation.
When the loop is done I should have the actual text of the
file. Then I can look for the pattern I want and replace it as needed. Then
I will open the file for writing (still in binary of course) and write out
as utf-16. Sounds like this should work?

I'd consider it unlikely. You might want to get a text containing e.g.
Japanese character to test on...
 
H

Heinz Ozwirk

Eric Lilja said:
Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that means
utf-16? I was thinking of opening it in binary mode, read the first two
bytes then start a loop that reads from the file byte by byte and adds the
first, the third, the fifth byte etc to a std::string (or a std::vector of
chars maybe). When the loop is done I should have the actual text of the
file. Then I can look for the pattern I want and replace it as needed. Then
I will open the file for writing (still in binary of course) and write out
as utf-16. Sounds like this should work?

0xFF, 0xFE looks like the little-endian byte-order-mark. So it is a good guess to assume it to contain 16-bit unicode text, created on (or for) a little-endian machine. If your program runs on such a machine, you could use wchar_t/wstring to read and process your file. If your program does not run on a little-endian machine, you can still use wstring, but you have to swap bytes after reading (and before writing). [Actually, wchar_t is not garanteed to be unicode, but it is very likely to be unicode. If you are very suspicious, you could typedef your own unicode and ustring types as wchar_t and wstring.]

Of cause, you can also read and process it as a binary file, but simply discard every other byte is not a good idea.

HTH
Heinz
 
D

Dietmar Kuehl

Heinz said:
0xFF, 0xFE looks like the little-endian byte-order-mark. So it is a
good guess to assume it to contain 16-bit unicode text, created on (or
for) a little-endian machine. If your program runs on such a machine,
you could use wchar_t/wstring to read and process your file. If your
program does not run on a little-endian machine, you can still use
wstring, but you have to swap bytes after reading (and before writing).
[Actually, wchar_t is not garanteed to be unicode, but it is very
likely to be unicode. If you are very suspicious, you could typedef
your own unicode and ustring types as wchar_t and wstring.]

Actually, the details of the encoding should be entirely independent of
the architecture and should be handled by the 'std::codecvt<>' facet!
If you are reading a 'std::wstring' from a 'std::wistream' there should
be no need to tinker with the bytes at all.
 
C

Chris Croughton

Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that means
utf-16? I was thinking of opening it in binary mode, read the first two
bytes then start a loop that reads from the file byte by byte and adds the
first, the third, the fifth byte etc to a std::string (or a std::vector of
chars maybe). When the loop is done I should have the actual text of the
file. Then I can look for the pattern I want and replace it as needed. Then
I will open the file for writing (still in binary of course) and write out
as utf-16. Sounds like this should work?

It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
invalid, so that the byte order (big/little endian) can be determined.
In your case the order must be LSB MSB, so you want all even numbered
bytes (assuming standard C array indices starting at zero), but you
ought to check for a portable implementation.

You really should check that the other bytes are zero, as well, and give
some sort of error if not (it's a character not representable in a
normal string, unless you're on an implementation with 16 bit or more
bytes); at minimum I would either ignore such a character or convert it
to an error character ('?' for instance, like my mailer does).

Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
any non-ASCII characters. This will be a bit slower as an
implementation, but on modern machines still faster than the I/O.

If you really want portability, look at interpreting UCS-32, UTF-8 and
UTF-16 as well as UCS-2 (and plain old text), with both big- and
little-endian representations, and write a generic routine which
converts any of them to a string (note that a C++ string type can take
wide characters or longs as its element type). But for your case you
may only need to do one or two of the formats.

For further reading, see:

http://www.unicode.org/faq/

(and its parent if you want to get into the spec.). Warning: if you're
like me, you can waste (er, spend) many happy hours reading the spec.
and forget to do the work <g>...

Chris C
 
E

Eric Lilja

Chris Croughton said:
It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
invalid, so that the byte order (big/little endian) can be determined.
In your case the order must be LSB MSB, so you want all even numbered
bytes (assuming standard C array indices starting at zero), but you
ought to check for a portable implementation.

You really should check that the other bytes are zero, as well, and give
some sort of error if not (it's a character not representable in a
normal string, unless you're on an implementation with 16 bit or more
bytes); at minimum I would either ignore such a character or convert it
to an error character ('?' for instance, like my mailer does).

Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
any non-ASCII characters. This will be a bit slower as an
implementation, but on modern machines still faster than the I/O.

If you really want portability, look at interpreting UCS-32, UTF-8 and
UTF-16 as well as UCS-2 (and plain old text), with both big- and
little-endian representations, and write a generic routine which
converts any of them to a string (note that a C++ string type can take
wide characters or longs as its element type). But for your case you
may only need to do one or two of the formats.

For further reading, see:

http://www.unicode.org/faq/

(and its parent if you want to get into the spec.). Warning: if you're
like me, you can waste (er, spend) many happy hours reading the spec.
and forget to do the work <g>...

Chris C

Thanks for your replies everyone. I wrote the following little test program
that I hope to get working for ucs-2 encoded files where all characters are
representable using ascii (i.e, the second byte after the byte-order mark is
\0 for all chars in the file). The program doesn't work as expected,
however, because if you look at the function read_file it will read the byte
order mark into the contents variable so when I write the new file (where I
have replaced some strings), I get the byte-order mark twice although the
second one has padding. If you look at the file in a hex editor you see: FF
FE FF 00 FE 00. I can easily work around it by I want to know why
read_file() is doing what it's doing.

Here's the complete code:
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>

using std::cerr;
using std::cout;
using std::endl;
using std::exit;
using std::ifstream;
using std::ios_base;
using std::eek:fstream;
using std::string;

static string read_file(const char *);
static void find_and_replace(string& s, const string&, const string&);
static void write_file(const char *, const string&);

static const char padding = '\0';

int
main()
{
const string find_what = "foobar";
const string replace_with = "abcdef";

string contents = read_file("testfile.txt");

find_and_replace(contents, find_what, replace_with);

write_file("outfile.txt", contents);

return EXIT_SUCCESS;
}

static string
read_file(const char *filename)
{
ifstream file(filename, ios_base::binary);

if(!file)
{
cerr << "Error: Failed to open " << filename << endl;

exit(EXIT_FAILURE);
}

char c = '\0';
string contents;

file.read(&c, sizeof(c));
contents += c;
file.read(&c, sizeof(c));
contents += c;

if((unsigned char)contents[0] != 0xFF ||
(unsigned char)contents[1] != 0xFE)
{
cerr << "Error: The file doesn't appear to be a unicode-file." <<
endl;

/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}

int count = 0;

while(file.read(&c, sizeof(c)))
{
if(!(count++ % 2))
contents.push_back(c);
else
if(c != padding) /* padding is a static global that equals \0 */
{
cerr << "Error: Found a character that is too "
<< "big to fit into a single byte." << endl;

/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}
}

/* std::ifstreams destructor will close the file. */
return contents;
}

static void
find_and_replace(string& s, const string& find_what, const string&
replace_with)
{
string::size_type start = 0;
string::size_type offset = 0;
size_t occurencies = 0;

while((start = s.find(find_what, offset)) != string::npos)
{
s.replace(start, find_what.length(), replace_with);

/* Very important that we set offset to start + 1 or we will
go into an infinite loop because we will find the first {
over and over again. */
offset = start + 1;

++occurencies;
}

cout << "Replaced " << occurencies << " occurencies." << endl;
}

static void
write_file(const char *filename, const string& contents)
{
ofstream file(filename, ios_base::binary);

const char byte_order_mark[2] = { 0xFF, 0xFE };

file.write(&byte_order_mark[0], sizeof(char));
file.write(&byte_order_mark[1], sizeof(char));

for(string::size_type i = 0; i < contents.length(); ++i)
{
file.write(&contents, sizeof(char));
file.write(&padding, sizeof(char));
}
}

Thanks for any replies

/ Eric
 
E

Eric Lilja

Eric Lilja said:
Chris Croughton said:
It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
invalid, so that the byte order (big/little endian) can be determined.
In your case the order must be LSB MSB, so you want all even numbered
bytes (assuming standard C array indices starting at zero), but you
ought to check for a portable implementation.

You really should check that the other bytes are zero, as well, and give
some sort of error if not (it's a character not representable in a
normal string, unless you're on an implementation with 16 bit or more
bytes); at minimum I would either ignore such a character or convert it
to an error character ('?' for instance, like my mailer does).

Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
any non-ASCII characters. This will be a bit slower as an
implementation, but on modern machines still faster than the I/O.

If you really want portability, look at interpreting UCS-32, UTF-8 and
UTF-16 as well as UCS-2 (and plain old text), with both big- and
little-endian representations, and write a generic routine which
converts any of them to a string (note that a C++ string type can take
wide characters or longs as its element type). But for your case you
may only need to do one or two of the formats.

For further reading, see:

http://www.unicode.org/faq/

(and its parent if you want to get into the spec.). Warning: if you're
like me, you can waste (er, spend) many happy hours reading the spec.
and forget to do the work <g>...

Chris C

Thanks for your replies everyone. I wrote the following little test
program that I hope to get working for ucs-2 encoded files where all
characters are representable using ascii (i.e, the second byte after the
byte-order mark is \0 for all chars in the file). The program doesn't work
as expected, however, because if you look at the function read_file it
will read the byte order mark into the contents variable so when I write
the new file (where I have replaced some strings), I get the byte-order
mark twice although the second one has padding. If you look at the file in
a hex editor you see: FF FE FF 00 FE 00. I can easily work around it by I
want to know why read_file() is doing what it's doing.

Here's the complete code:
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>

using std::cerr;
using std::cout;
using std::endl;
using std::exit;
using std::ifstream;
using std::ios_base;
using std::eek:fstream;
using std::string;

static string read_file(const char *);
static void find_and_replace(string& s, const string&, const string&);
static void write_file(const char *, const string&);

static const char padding = '\0';

int
main()
{
const string find_what = "foobar";
const string replace_with = "abcdef";

string contents = read_file("testfile.txt");

find_and_replace(contents, find_what, replace_with);

write_file("outfile.txt", contents);

return EXIT_SUCCESS;
}

static string
read_file(const char *filename)
{
ifstream file(filename, ios_base::binary);

if(!file)
{
cerr << "Error: Failed to open " << filename << endl;

exit(EXIT_FAILURE);
}

char c = '\0';
string contents;

file.read(&c, sizeof(c));
contents += c;
file.read(&c, sizeof(c));
contents += c;

if((unsigned char)contents[0] != 0xFF ||
(unsigned char)contents[1] != 0xFE)
{
cerr << "Error: The file doesn't appear to be a unicode-file." <<
endl;

/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}

int count = 0;

while(file.read(&c, sizeof(c)))
{
if(!(count++ % 2))
contents.push_back(c);
else
if(c != padding) /* padding is a static global that equals \0 */
{
cerr << "Error: Found a character that is too "
<< "big to fit into a single byte." << endl;

/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}
}

/* std::ifstreams destructor will close the file. */
return contents;
}

static void
find_and_replace(string& s, const string& find_what, const string&
replace_with)
{
string::size_type start = 0;
string::size_type offset = 0;
size_t occurencies = 0;

while((start = s.find(find_what, offset)) != string::npos)
{
s.replace(start, find_what.length(), replace_with);

/* Very important that we set offset to start + 1 or we will
go into an infinite loop because we will find the first {
over and over again. */
offset = start + 1;

++occurencies;
}

cout << "Replaced " << occurencies << " occurencies." << endl;
}

static void
write_file(const char *filename, const string& contents)
{
ofstream file(filename, ios_base::binary);

const char byte_order_mark[2] = { 0xFF, 0xFE };

file.write(&byte_order_mark[0], sizeof(char));
file.write(&byte_order_mark[1], sizeof(char));

for(string::size_type i = 0; i < contents.length(); ++i)
{
file.write(&contents, sizeof(char));
file.write(&padding, sizeof(char));
}
}

Thanks for any replies

/ Eric


Lol, nevermind! I saw that I was using the contents variable for reading the
byte-order mark. I thought the reading position was being rewound somehow.
Anyway, if you have any other comments on the code, please share them.

/ Eric
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top