Chris Croughton said:
It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
invalid, so that the byte order (big/little endian) can be determined.
In your case the order must be LSB MSB, so you want all even numbered
bytes (assuming standard C array indices starting at zero), but you
ought to check for a portable implementation.
You really should check that the other bytes are zero, as well, and give
some sort of error if not (it's a character not representable in a
normal string, unless you're on an implementation with 16 bit or more
bytes); at minimum I would either ignore such a character or convert it
to an error character ('?' for instance, like my mailer does).
Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
any non-ASCII characters. This will be a bit slower as an
implementation, but on modern machines still faster than the I/O.
If you really want portability, look at interpreting UCS-32, UTF-8 and
UTF-16 as well as UCS-2 (and plain old text), with both big- and
little-endian representations, and write a generic routine which
converts any of them to a string (note that a C++ string type can take
wide characters or longs as its element type). But for your case you
may only need to do one or two of the formats.
For further reading, see:
http://www.unicode.org/faq/
(and its parent if you want to get into the spec.). Warning: if you're
like me, you can waste (er, spend) many happy hours reading the spec.
and forget to do the work <g>...
Chris C
Thanks for your replies everyone. I wrote the following little test
program that I hope to get working for ucs-2 encoded files where all
characters are representable using ascii (i.e, the second byte after the
byte-order mark is \0 for all chars in the file). The program doesn't work
as expected, however, because if you look at the function read_file it
will read the byte order mark into the contents variable so when I write
the new file (where I have replaced some strings), I get the byte-order
mark twice although the second one has padding. If you look at the file in
a hex editor you see: FF FE FF 00 FE 00. I can easily work around it by I
want to know why read_file() is doing what it's doing.
Here's the complete code:
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
using std::cerr;
using std::cout;
using std::endl;
using std::exit;
using std::ifstream;
using std::ios_base;
using std:

fstream;
using std::string;
static string read_file(const char *);
static void find_and_replace(string& s, const string&, const string&);
static void write_file(const char *, const string&);
static const char padding = '\0';
int
main()
{
const string find_what = "foobar";
const string replace_with = "abcdef";
string contents = read_file("testfile.txt");
find_and_replace(contents, find_what, replace_with);
write_file("outfile.txt", contents);
return EXIT_SUCCESS;
}
static string
read_file(const char *filename)
{
ifstream file(filename, ios_base::binary);
if(!file)
{
cerr << "Error: Failed to open " << filename << endl;
exit(EXIT_FAILURE);
}
char c = '\0';
string contents;
file.read(&c, sizeof(c));
contents += c;
file.read(&c, sizeof(c));
contents += c;
if((unsigned char)contents[0] != 0xFF ||
(unsigned char)contents[1] != 0xFE)
{
cerr << "Error: The file doesn't appear to be a unicode-file." <<
endl;
/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}
int count = 0;
while(file.read(&c, sizeof(c)))
{
if(!(count++ % 2))
contents.push_back(c);
else
if(c != padding) /* padding is a static global that equals \0 */
{
cerr << "Error: Found a character that is too "
<< "big to fit into a single byte." << endl;
/* std::ifstreams destructor will close the file. */
exit(EXIT_FAILURE);
}
}
/* std::ifstreams destructor will close the file. */
return contents;
}
static void
find_and_replace(string& s, const string& find_what, const string&
replace_with)
{
string::size_type start = 0;
string::size_type offset = 0;
size_t occurencies = 0;
while((start = s.find(find_what, offset)) != string::npos)
{
s.replace(start, find_what.length(), replace_with);
/* Very important that we set offset to start + 1 or we will
go into an infinite loop because we will find the first {
over and over again. */
offset = start + 1;
++occurencies;
}
cout << "Replaced " << occurencies << " occurencies." << endl;
}
static void
write_file(const char *filename, const string& contents)
{
ofstream file(filename, ios_base::binary);
const char byte_order_mark[2] = { 0xFF, 0xFE };
file.write(&byte_order_mark[0], sizeof(char));
file.write(&byte_order_mark[1], sizeof(char));
for(string::size_type i = 0; i < contents.length(); ++i)
{
file.write(&contents
, sizeof(char));
file.write(&padding, sizeof(char));
}
}
Thanks for any replies
/ Eric