printf and UTF-8 in linux

S

sas

I have a problem printing cyrillic text to stdout in Linux. I know
that it has to be UTF-8. I'm trying to read a symbol, guess that it is
cyrillic encoded as CP1251, and if so output it as cyrillic in UTF-8,
but my code so far doesn't work.

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int
main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}

while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

// 'á'-'ñ'
if (c >= 0xc0 && c <= 0xdf)
{
c -= 0xc0;
c += 0x410;
}

// 'Á'-'Ñ'
if (c >= 0xe0 && c <= 0xff)
{
c -= 0xe0;
c += 0x430;
}

printf("%lc", c);
}

return 0;
}
 
M

Morris Keesan

I don't know enough about Cyrillic character sets,and international
character sets in C in general, to be able to help with the question
you're asking, but this:
while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

doesn't do what you expect. feof(file) only returns true if the
end-of-file indicator has been set for the file, the that indicator
only gets set after you've tried to read one char past the end of
the file. So you'll always try to process c one extra time, when
its value is equal to (wchar_t)EOF.

A common idiom is
int c;
while ((c = fgetc(stdin)) != EOF)
{
...
 
B

Ben Bacarisse

sas said:
I have a problem printing cyrillic text to stdout in Linux. I know
that it has to be UTF-8. I'm trying to read a symbol, guess that it is
cyrillic encoded as CP1251, and if so output it as cyrillic in UTF-8,
but my code so far doesn't work.

The problem is quite likely to be with the input. If the locate is
set for UTF-9 output, how are you going to enter a CP1251 character?

The program actually works fine (despite one oddity) is the input is
as expected. To test it I had to generate a CP1251 file first and use
redirection to get the program to read it.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int
main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}

while (!feof(stdin)) {
wchar_t c = fgetc(stdin);

fgetc returns an int so this is a little odd. CP1251 is a single-byte
character set so you don't need a wchar_t to hold it.

Also it is better to test for EOF after trying to read a character or
you will process the EOF:

int ch;
while ((ch = fgetc(stding)) != EOF) { ... }

is the usual pattern.

Now you do need a wchar_t to hold the wide character:

wchar_t c = ch;
 
S

sas

The problem is quite likely to be with the input.  If the locate is
set for UTF-9 output, how are you going to enter a CP1251 character?

That is my problem! I have no idea how locales work, or even which
locale I should use, can you give me some information?
The program actually works fine (despite one oddity) is the input is
as expected.  To test it I had to generate a CP1251 file first and use
redirection to get the program to read it.

Yes, it works fine in the console, showing the correct cyrillic
letters, but mplayer (the player I use in linux) still shows garbled
text. I want to convert *.srt files that I have in CP1251 to something
that's usable under Linux. Does mplayer use a different locale?
 
B

Ben Bacarisse

sas said:
The problem is quite likely to be with the input.  If the locale is
set for UTF-8 output, how are you going to enter a CP1251
character?
[some typos corrected]

That is my problem! I have no idea how locales work, or even which
locale I should use, can you give me some information?

Someone (who actually knows) could write a book on that. The locale
setting determines various things about your C program. C itself says
very little about exactly what happens, so most of it is off topic
here. However, you did the C bits pretty much correctly.
Yes, it works fine in the console, showing the correct cyrillic
letters, but mplayer (the player I use in linux) still shows garbled
text. I want to convert *.srt files that I have in CP1251 to something
that's usable under Linux. Does mplayer use a different locale?

That's nothing to do with C. I think you need to ask what encodings
mplayer understands but I can't suggest the best place for that.

Why are you writing this? My first though would have been someone
must have written this already. man iconv.
 
S

sas

Why are you writing this?  My first though would have been someone
must have written this already.  man iconv.

I didn't know about this program, thanks. I googled about displaying
cyrillic subtitles in Linux, and when I couldn't find anything that
works, thought it would be faster to try and make a small program
myself. But iconv works great, so I don't need that anymore. Thanks
for letting me know about this program.
 
N

Nobody

I didn't know about this program, thanks. I googled about displaying
cyrillic subtitles in Linux, and when I couldn't find anything that
works, thought it would be faster to try and make a small program
myself. But iconv works great, so I don't need that anymore. Thanks
for letting me know about this program.

iconv is primarily a library function, although many implementations also
provide a program by that name (on Linux, both the function and the
program are provded by GNU libc).

An alternative is the ANSI C functions mbstowcs() and wcstombs() ("wcs"
stands for "wide character string", mbs for "multi-byte string"). These
convert to and from the encoding of the current locale (as set by
setlocale(LC_CTYPE, ...)). This can make them easier to use than iconv()
(if you just need to convert between Unicode and the locale's encoding) or
harder (if you need to convert between arbitrary encodings).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top