UNICODE: reinventing the wheel with WSUCONV

Discussion in 'C Programming' started by Willow, Mar 12, 2012.

  1. Willow

    Willow Guest

    After one weekend of work (and no cheating -- I only used Wikipedia!)
    I am happy to announce I have developed my own UNICODE reading/writing
    library in C++ (yes, I realize this group is for the C language. If
    you know of a better group to use, let me know).

    The library comes with a sample main() program that converts between
    UTF-8, UTF-16 (big and small endians), and Standard ASCII.

    It's called WSUCONV and you can find it here:

    http://code.google.com/p/netwidecc/downloads/list

    It's under the New BSD license. One of the features is it supports
    UNICODE file names on both UNIX-like OSes and Windows.

    If anyone is so kind as to report any bugs or other problems you
    discover, assuming you have an interest in UNICODE, that would be
    greatly appreciated. I am developing a C compiler called NCC that
    generates NASM code, and I wanted to accept UNICODE input source
    files, and I have no problem reinventing the wheel at all--as long as
    I'm learning a lot of stuff--hence this code.
     
    Willow, Mar 12, 2012
    #1
    1. Advertising

  2. Willow

    Noob Guest

    Willow wrote:
    > I am happy to announce I have developed my own UNICODE reading/writing
    > library in C++ (yes, I realize this group is for the C language. If
    > you know of a better group to use, let me know).


    comp.lang.c++ ?
     
    Noob, Mar 12, 2012
    #2
    1. Advertising

  3. Willow

    Willow Guest

    On Mar 12, 3:39 pm, (Gordon Burditt) wrote:
    > Be aware that a BOM on UTF-8 often wrecks the file for its intended
    > use (e.g. on PHP it sends a BOM to the web browser and ensures that
    > subsequent header() calls FAIL).  It probably would also wreck C
    > source as fed to gcc, even if gcc will (accidentally) accept UTF-8
    > in quoted strings and comments.

    WSUCONV has a feature where it omits BOMs on UTF-8 output when the
    source was UTF-8 unless the source had a UTF-8 BOM as well.
    Since this was just a demonstration utility to show how to use the
    library code, I think it's good enough for my purposes.

    > I'm not aware of any system where the character 0x04 in a *FILE*
    > terminates the file (whether you're reading the file in binary or
    > text mode doesn't matter in UNIX).  If it's coming from a terminal,
    > UNIX terminal drivers will interpret that as EOF (not an "EOF
    > character", which doesn't exist).

    Point taken. Thanks for finding this problem, I fixed it in the
    Subversion repository which now contains version 1.03 in the ncc/src/
    tests/ folder.
    You can find the latest version by clicking the "Source" tab from
    here: http://code.google.com/p/netwidecc

    I noticed on Windows that when I was reading from stdin, Ctrl-Z would
    be properly detected as EOF only if it was at the beginning of a new
    blank line.
    I believe this is intended behavior and I assume Linux works similarly
    with Ctrl-D.
    My question is, if I have a non-blank line on Linux then type text and
    press Ctrl-D (where reading is done via fgetc()) and press Enter, will
    I read a
    character 0x04 from stdin? Or what? Will fgetc() return EOF??
    As long as Linux users are used to Ctrl-D not working except at the
    beginning of a new blank line, I think the right thing to do is take
    out the recognition
    of 0x04 as EOF. However on Windows, Windows users are used to Ctrl-Z
    terminating input from an interactive program in text mode, even if
    it's not at the beginning
    of a line--so I think if we're in text mode (which applies only to
    Windows like platforms) then 0x1a should still be treated as EOF.

    What I had to do was put code in there for when reading from stdin so
    that if I enter text on a line and then hit Ctrl-Z and press Enter,
    EOF is
    corectly detected. This applies to "text mode" on Windows-like OSes
    where character 0x1a really means EOF (but you can also get EOF from
    fgetc() returning
    EOF--this happens when character 0x1a is present at the beginning of a
    line if I recall correctly). I wanted behavior similar to "COPY CON
    FOO.TXT" from
    the Command Prompt of Windows.

    I'm not sure if Ctrl-D will correctly indicate EOF from Linux if it is
    not at the beginning of a new line of input. Will I get character 0x04
    instead? I
    assumed I would, but I took out the recognition of character 0x04 as
    EOF from "text mode" because Linux doesn't really have a "text mode"
    with line translations
    and such as Windows does.

    In my BASH on Windows, CAT does strange things when I type "foo" and
    then hit Ctrl-D. It duplicates the input, showing "foo" back at me,
    but does not detect EOF!

    All this is related to interactive mode for the NCC C/C++ compiler I
    am writing. I want it to accept UTF-8 input via
    ncc1 <utf8.txt
    or read from STDIN and output NASM-compatible assembly code in an
    interactive way.

    When the user is done, if they hit Ctrl-D but aren't at the beginning
    of a line, what happens on Linux?
     
    Willow, Mar 13, 2012
    #3
  4. Willow <> writes:
    <snip>
    > I noticed on Windows that when I was reading from stdin, Ctrl-Z would
    > be properly detected as EOF only if it was at the beginning of a new
    > blank line.
    > I believe this is intended behavior and I assume Linux works similarly
    > with Ctrl-D.


    No, both the mechanism and the behaviour are different.

    > My question is, if I have a non-blank line on Linux then type text and
    > press Ctrl-D (where reading is done via fgetc()) and press Enter, will
    > I read a
    > character 0x04 from stdin? Or what? Will fgetc() return EOF??


    Not usually. First ^D is arbitrary -- you can choose to use another
    character if you like. Second, ^D won't normally be seen by your
    program -- it is processed by the tty driver. It is this driver that
    closes the input to your program in response to seeing ^D (or whatever
    you've decided to use). If you do want to type ^D so your program will
    see it, the driver usually has an "take the next character literally"
    character. Hence if I type ^V^D I will get a 0x4 byte to be read.
    Finally, ^D word mi-line as well as at the start, but if you want to end
    the input mid-line you usually have to type two ^Ds.

    > As long as Linux users are used to Ctrl-D not working except at the
    > beginning of a new blank line, I think the right thing to do is take
    > out the recognition
    > of 0x04 as EOF.


    0x4 does not mark the end of a file, and you should it treat it exactly
    like any other character! C provides a way to test when the input is
    exhausted -- fgetc returns EOF (with is not equal to any character) and
    that's how know there is no more input.

    > However on Windows, Windows users are used to Ctrl-Z
    > terminating input from an interactive program in text mode, even if
    > it's not at the beginning
    > of a line--so I think if we're in text mode (which applies only to
    > Windows like platforms) then 0x1a should still be treated as EOF.


    I don't think you have to take any special action. You certainly didn't
    have to "in the old days".

    > What I had to do was put code in there for when reading from stdin so
    > that if I enter text on a line and then hit Ctrl-Z and press Enter,
    > EOF is
    > corectly detected. This applies to "text mode" on Windows-like OSes
    > where character 0x1a really means EOF (but you can also get EOF from
    > fgetc() returning
    > EOF--this happens when character 0x1a is present at the beginning of a
    > line if I recall correctly). I wanted behavior similar to "COPY CON
    > FOO.TXT" from
    > the Command Prompt of Windows.


    I don't follow what you are saying but since I know little about modern
    Windows, I could not help anyway.

    > I'm not sure if Ctrl-D will correctly indicate EOF from Linux if it is
    > not at the beginning of a new line of input. Will I get character 0x04
    > instead?


    That's the same question as above. I think you misunderstand how
    Unix-like deal with signally the end of the input. If your program get
    a 0x4 byte it is because the user wanted your program to get it so treat
    like any other input.

    > I
    > assumed I would, but I took out the recognition of character 0x04 as
    > EOF from "text mode" because Linux doesn't really have a "text mode"
    > with line translations
    > and such as Windows does.


    The ^D mechanism works no matter what mode your C program is using,
    though normally you don't get to choose the mode of stdin -- it's
    pre-opened. If you are reading a genuine file (or stdin is not attached
    to a tty) then ^D is just a character like any other.

    > In my BASH on Windows, CAT does strange things when I type "foo" and
    > then hit Ctrl-D. It duplicates the input, showing "foo" back at me,
    > but does not detect EOF!


    bash and cat in Windows will follow Windows input methods.

    > All this is related to interactive mode for the NCC C/C++ compiler I
    > am writing. I want it to accept UTF-8 input via
    > ncc1 <utf8.txt
    > or read from STDIN and output NASM-compatible assembly code in an
    > interactive way.
    >
    > When the user is done, if they hit Ctrl-D but aren't at the beginning
    > of a line, what happens on Linux?


    It's all much simpler than you think it is. Read characters until you
    get EOF. Lunux people will know what to do and so will Windows people.

    --
    Ben.
     
    Ben Bacarisse, Mar 13, 2012
    #4
  5. Willow

    Kaz Kylheku Guest

    On 2012-03-13, Willow <> wrote:
    > I noticed on Windows that when I was reading from stdin, Ctrl-Z would
    > be properly detected as EOF only if it was at the beginning of a new
    > blank line.
    > I believe this is intended behavior and I assume Linux works similarly
    > with Ctrl-D.


    Unix does not work that way. Ctrl-D does execute its action in the middle of a
    line.

    Its action is: "wake up the process now which is waiting on the tty read, and
    make it return to the caller all the bytes gathered so far during
    this call (possibly zero)".

    Various behaviors follow from this.
     
    Kaz Kylheku, Mar 13, 2012
    #5
  6. Willow

    James Kuyper Guest

    On 03/13/2012 12:32 AM, Ben Bacarisse wrote:
    ....
    > like any other character! C provides a way to test when the input is
    > exhausted -- fgetc returns EOF (with is not equal to any character) and
    > that's how know there is no more input.


    Keep in mind that EOF could also indicate an I/O error, not just EOF.
    You need feof() or ferror() to disambiguate those possibilities (unless
    you want to treat them both the same way).

    When fgetc() successfully reads a character, it returns a value != EOF
    on most (possibly all) real-world implementations. However, a conforming
    implementation of C could have UCHAR_MAX > INT_MAX (which implies
    CHAR_BIT >= 16), in which case fputc(EOF, stream) must necessarily write
    a character to that stream, which if successfully read back by fgetc(),
    would cause it to return EOF. If you wish to protect against this
    (admittedly, extremely unlikely) possibility, you need to check both
    feof() and ferror() if fgetc() returns EOF).
    --
    James Kuyper
     
    James Kuyper, Mar 13, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Wes Groleau

    Reinventing a square wheel

    Wes Groleau, Aug 21, 2003, in forum: Perl
    Replies:
    0
    Views:
    988
    Wes Groleau
    Aug 21, 2003
  2. nicholas
    Replies:
    1
    Views:
    5,115
    Kevin Spencer
    Dec 16, 2004
  3. Ronald Fischer
    Replies:
    1
    Views:
    15,360
    Jacob
    Jul 22, 2003
  4. Morris Dovey

    Re: NUL to terminate strings; was reinventing ASCII?

    Morris Dovey, Mar 8, 2008, in forum: C Programming
    Replies:
    41
    Views:
    1,102
    BruceMcF
    Apr 13, 2008
  5. jim
    Replies:
    1
    Views:
    95
    beegee
    Oct 27, 2008
Loading...

Share This Page