UNICODE: reinventing the wheel with WSUCONV

Willow · Mar 12, 2012

After one weekend of work (and no cheating -- I only used Wikipedia!)
I am happy to announce I have developed my own UNICODE reading/writing
library in C++ (yes, I realize this group is for the C language. If
you know of a better group to use, let me know).

The library comes with a sample main() program that converts between
UTF-8, UTF-16 (big and small endians), and Standard ASCII.

It's called WSUCONV and you can find it here:

http://code.google.com/p/netwidecc/downloads/list

It's under the New BSD license. One of the features is it supports
UNICODE file names on both UNIX-like OSes and Windows.

If anyone is so kind as to report any bugs or other problems you
discover, assuming you have an interest in UNICODE, that would be
greatly appreciated. I am developing a C compiler called NCC that
generates NASM code, and I wanted to accept UNICODE input source
files, and I have no problem reinventing the wheel at all--as long as
I'm learning a lot of stuff--hence this code.

Noob · Mar 12, 2012

Willow said:
I am happy to announce I have developed my own UNICODE reading/writing
library in C++ (yes, I realize this group is for the C language. If
you know of a better group to use, let me know).

comp.lang.c++ ?

Willow · Mar 13, 2012

Be aware that a BOM on UTF-8 often wrecks the file for its intended
use (e.g. on PHP it sends a BOM to the web browser and ensures that
subsequent header() calls FAIL). It probably would also wreck C
source as fed to gcc, even if gcc will (accidentally) accept UTF-8
in quoted strings and comments.

WSUCONV has a feature where it omits BOMs on UTF-8 output when the
source was UTF-8 unless the source had a UTF-8 BOM as well.
Since this was just a demonstration utility to show how to use the
library code, I think it's good enough for my purposes.

I'm not aware of any system where the character 0x04 in a *FILE*
terminates the file (whether you're reading the file in binary or
text mode doesn't matter in UNIX). If it's coming from a terminal,
UNIX terminal drivers will interpret that as EOF (not an "EOF
character", which doesn't exist).

Point taken. Thanks for finding this problem, I fixed it in the
Subversion repository which now contains version 1.03 in the ncc/src/
tests/ folder.
You can find the latest version by clicking the "Source" tab from
here: http://code.google.com/p/netwidecc

I noticed on Windows that when I was reading from stdin, Ctrl-Z would
be properly detected as EOF only if it was at the beginning of a new
blank line.
I believe this is intended behavior and I assume Linux works similarly
with Ctrl-D.
My question is, if I have a non-blank line on Linux then type text and
press Ctrl-D (where reading is done via fgetc()) and press Enter, will
I read a
character 0x04 from stdin? Or what? Will fgetc() return EOF??
As long as Linux users are used to Ctrl-D not working except at the
beginning of a new blank line, I think the right thing to do is take
out the recognition
of 0x04 as EOF. However on Windows, Windows users are used to Ctrl-Z
terminating input from an interactive program in text mode, even if
it's not at the beginning
of a line--so I think if we're in text mode (which applies only to
Windows like platforms) then 0x1a should still be treated as EOF.

What I had to do was put code in there for when reading from stdin so
that if I enter text on a line and then hit Ctrl-Z and press Enter,
EOF is
corectly detected. This applies to "text mode" on Windows-like OSes
where character 0x1a really means EOF (but you can also get EOF from
fgetc() returning
EOF--this happens when character 0x1a is present at the beginning of a
line if I recall correctly). I wanted behavior similar to "COPY CON
FOO.TXT" from
the Command Prompt of Windows.

I'm not sure if Ctrl-D will correctly indicate EOF from Linux if it is
not at the beginning of a new line of input. Will I get character 0x04
instead? I
assumed I would, but I took out the recognition of character 0x04 as
EOF from "text mode" because Linux doesn't really have a "text mode"
with line translations
and such as Windows does.

In my BASH on Windows, CAT does strange things when I type "foo" and
then hit Ctrl-D. It duplicates the input, showing "foo" back at me,
but does not detect EOF!

All this is related to interactive mode for the NCC C/C++ compiler I
am writing. I want it to accept UTF-8 input via
ncc1 <utf8.txt
or read from STDIN and output NASM-compatible assembly code in an
interactive way.

When the user is done, if they hit Ctrl-D but aren't at the beginning
of a line, what happens on Linux?

Ben Bacarisse · Mar 13, 2012

Willow said:
I noticed on Windows that when I was reading from stdin, Ctrl-Z would
be properly detected as EOF only if it was at the beginning of a new
blank line.
I believe this is intended behavior and I assume Linux works similarly
with Ctrl-D.

No, both the mechanism and the behaviour are different.

My question is, if I have a non-blank line on Linux then type text and
press Ctrl-D (where reading is done via fgetc()) and press Enter, will
I read a
character 0x04 from stdin? Or what? Will fgetc() return EOF??

Not usually. First ^D is arbitrary -- you can choose to use another
character if you like. Second, ^D won't normally be seen by your
program -- it is processed by the tty driver. It is this driver that
closes the input to your program in response to seeing ^D (or whatever
you've decided to use). If you do want to type ^D so your program will
see it, the driver usually has an "take the next character literally"
character. Hence if I type ^V^D I will get a 0x4 byte to be read.
Finally, ^D word mi-line as well as at the start, but if you want to end
the input mid-line you usually have to type two ^Ds.

As long as Linux users are used to Ctrl-D not working except at the
beginning of a new blank line, I think the right thing to do is take
out the recognition
of 0x04 as EOF.

0x4 does not mark the end of a file, and you should it treat it exactly
like any other character! C provides a way to test when the input is
exhausted -- fgetc returns EOF (with is not equal to any character) and
that's how know there is no more input.

However on Windows, Windows users are used to Ctrl-Z
terminating input from an interactive program in text mode, even if
it's not at the beginning
of a line--so I think if we're in text mode (which applies only to
Windows like platforms) then 0x1a should still be treated as EOF.

I don't think you have to take any special action. You certainly didn't
have to "in the old days".

What I had to do was put code in there for when reading from stdin so
that if I enter text on a line and then hit Ctrl-Z and press Enter,
EOF is
corectly detected. This applies to "text mode" on Windows-like OSes
where character 0x1a really means EOF (but you can also get EOF from
fgetc() returning
EOF--this happens when character 0x1a is present at the beginning of a
line if I recall correctly). I wanted behavior similar to "COPY CON
FOO.TXT" from
the Command Prompt of Windows.

I don't follow what you are saying but since I know little about modern
Windows, I could not help anyway.

I'm not sure if Ctrl-D will correctly indicate EOF from Linux if it is
not at the beginning of a new line of input. Will I get character 0x04
instead?

That's the same question as above. I think you misunderstand how
Unix-like deal with signally the end of the input. If your program get
a 0x4 byte it is because the user wanted your program to get it so treat
like any other input.

I
assumed I would, but I took out the recognition of character 0x04 as
EOF from "text mode" because Linux doesn't really have a "text mode"
with line translations
and such as Windows does.

The ^D mechanism works no matter what mode your C program is using,
though normally you don't get to choose the mode of stdin -- it's
pre-opened. If you are reading a genuine file (or stdin is not attached
to a tty) then ^D is just a character like any other.

In my BASH on Windows, CAT does strange things when I type "foo" and
then hit Ctrl-D. It duplicates the input, showing "foo" back at me,
but does not detect EOF!

bash and cat in Windows will follow Windows input methods.

All this is related to interactive mode for the NCC C/C++ compiler I
am writing. I want it to accept UTF-8 input via
ncc1 <utf8.txt
or read from STDIN and output NASM-compatible assembly code in an
interactive way.

When the user is done, if they hit Ctrl-D but aren't at the beginning
of a line, what happens on Linux?

It's all much simpler than you think it is. Read characters until you
get EOF. Lunux people will know what to do and so will Windows people.

Kaz Kylheku · Mar 13, 2012

I noticed on Windows that when I was reading from stdin, Ctrl-Z would
be properly detected as EOF only if it was at the beginning of a new
blank line.
I believe this is intended behavior and I assume Linux works similarly
with Ctrl-D.

Unix does not work that way. Ctrl-D does execute its action in the middle of a
line.

Its action is: "wake up the process now which is waiting on the tty read, and
make it return to the caller all the bytes gathered so far during
this call (possibly zero)".

Various behaviors follow from this.

James Kuyper · Mar 13, 2012

On 03/13/2012 12:32 AM, Ben Bacarisse wrote:
....

like any other character! C provides a way to test when the input is
exhausted -- fgetc returns EOF (with is not equal to any character) and
that's how know there is no more input.

Keep in mind that EOF could also indicate an I/O error, not just EOF.
You need feof() or ferror() to disambiguate those possibilities (unless
you want to treat them both the same way).

When fgetc() successfully reads a character, it returns a value != EOF
on most (possibly all) real-world implementations. However, a conforming
implementation of C could have UCHAR_MAX > INT_MAX (which implies
CHAR_BIT >= 16), in which case fputc(EOF, stream) must necessarily write
a character to that stream, which if successfully read back by fgetc(),
would cause it to return EOF. If you wish to protect against this
(admittedly, extremely unlikely) possibility, you need to check both
feof() and ferror() if fgetc() returns EOF).

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
onchange event for SELECT element trouble with Internet Explorermouse wheel	2	Jan 22, 2008
Help with my responsive home page	2	Dec 14, 2022
Unicode Text Fileupload using MultipartRequest Failing	0	Feb 4, 2011
How to deal with Unicode	3	Sep 3, 2007
Unicode characters, XML/RSS	1	Jul 31, 2008
An assessment of the Unicode standard	119	Aug 29, 2009
How to display unicode with the CGI module?	7	Nov 24, 2007

UNICODE: reinventing the wheel with WSUCONV

Willow

Noob

Willow

Ben Bacarisse

Kaz Kylheku

James Kuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads