Reading text file containing accented vowels

P

Phil Slater

I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil
 
H

Howard

Phil Slater said:
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard
 
V

Victor Bazarov

Phil said:
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

It's possible that your accented characters are from the "extended ASCII"
part of the code table. Try reading the words using _unsigned_char_ type.

V
 
J

John Harrison

Phil Slater said:
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

I would guess that some where you are assuming that char values are
positive, for instance by using a char variable as the index of an array.
This is not necessarily true of non-ASCII characters which can have negative
values (depending on your implementation). Casts to unsigned char at
appropriate places in your code might solve this.

I would like to be more specific but you forgot to include any code at all
in your post.

john
 
P

Phil Slater

Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
.... ad infinitum

Looks like the stream goes into a fail state when it hits the á

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Kárahnjúkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

Thanks for your help.

Phil
 
J

John Harrison

Phil Slater said:
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...
Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for me.
Which compiler are you using?
typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

john
 
P

Phil Slater

John said:
It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...

Thanks for that.
I think you have a broken implementation of the STL, that works fine for me.

Which compiler are you using?

g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)
That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

Which compiler are you using?
 
J

John Harrison

me.

Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Yes.


g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)

I've heard that gcc 2.95 has a poor implementation of the standard template
library (STL), your experience seems to prove it. Last post I tried with
with VC++ 7.1, I've just tried with gcc 3.3.1 and got the same result. Your
first program runs correctly, your second doesn't compile. I really think
you are going to have to upgrade.

john
 
H

Howard

Phil Slater said:
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Karahnjzkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
... ad infinitum

Looks like the stream goes into a fail state when it hits the a

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Karahnjzkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

That's correct. You're reading in a string. There is nothing that says
that that reading is delimited by anything when reading.

So what do you do? I'd probably read in a line at a time into a string (and
then parse the string into words, if there can be more than one word on a
line). I think std::getline is the function for reading a line.

Also, don't use while (!f.eof()), use while (getline(whatever)). The eof()
function is not valid to check until *after* attempting a read.

-Howard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top