How to read unicode file line by line on Linux platform

Discussion in 'C Programming' started by hezhenjie@gmail.com, Sep 3, 2005.

  1. Guest

    Hi, all:
    I just need to parse a unicode file, and assume to get data one line
    by one line.
    I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
    normally on Windows platform.

    However, when migrate it to Linux platform, issue occurs.
    Linux only has fopen() function, and fgetws() could not correctly get
    lines, in fact, it gets nothing.

    I thought to use fread() instead, but it could not get data one line by
    one line.

    Is there any good way to solve this problem?

    Thanks~
     
    , Sep 3, 2005
    #1
    1. Advertising

  2. <> wrote in message
    news:...
    > Hi, all:
    > I just need to parse a unicode file, and assume to get data one line
    > by one line.
    > I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
    > normally on Windows platform.
    >
    > However, when migrate it to Linux platform, issue occurs.
    > Linux only has fopen() function, and fgetws() could not correctly get
    > lines, in fact, it gets nothing.
    >
    > I thought to use fread() instead, but it could not get data one line by
    > one line.
    >
    > Is there any good way to solve this problem?


    Yes, go to www.unicode.org and get yourself the article "To the BMP and
    beyond!" by Muller of Adobe Systems, Unicode FAQ, Unicode standard and some
    charts. Find out how "code points" are stored in UTF-8 and UTF-16. Write
    code to read/write code points in the needed UTF from/to the file. Then
    process the file code point by code point. Most likely you'll only need to
    look for code points with values of 13 and 10 (i.e. the famous '\r' and '\n'
    :) to find out where the lines begin and end. But for full Unicode coverage,
    please do read the Unicode FAQ and standard.

    HTH
    Alex
     
    Alexei A. Frounze, Sep 3, 2005
    #2
    1. Advertising

  3. wrote:
    > Hi, all:
    > I just need to parse a unicode file, and assume to get data one line
    > by one line.


    My first guess at "unicode file" would be a file which contains some
    documentation on Unicode, kinda like this "unicode file" (not the link,
    but the actual file):

    http://www.unicode.org/faq/basic_q.html#a

    > I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
    > normally on Windows platform.
    >
    > However, when migrate it to Linux platform, issue occurs.
    > Linux only has fopen() function, and fgetws() could not correctly get
    > lines, in fact, it gets nothing.
    >
    > I thought to use fread() instead, but it could not get data one line by
    > one line.


    So, with what encoding are the file's contents encoded? Note that "unicode"
    is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
    BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.

    I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
    case, note that for Linux the natural encoding meant for representing the
    Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
    the standpoint of C. You'll need to convert the file. A great C library
    for dealing with the myriad issues with Unicode and UTF is ICU:

    http://icu.sourceforge.net/
    http://www-306.ibm.com/software/globalization/icu/index.jsp

    If I sound harsh or condescending it's because Unicode and UTF requires a
    significant rethinking of how one deals with text, and it cannot be
    understated. It goes way beyond the differences between UTF-16 and UTF-8.
    And having to interoperate with broken software all day has hardened me.

    Also note that this is all beyond the scope of what comp.lang.c deal withs.

    - Bill
     
    William Ahern, Sep 4, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hugo
    Replies:
    10
    Views:
    1,343
    Matt Humphrey
    Oct 18, 2004
  2. jcc
    Replies:
    15
    Views:
    4,731
    Nigel Wade
    May 12, 2006
  3. kaushikshome
    Replies:
    4
    Views:
    795
    kaushikshome
    Sep 10, 2006
  4. scad
    Replies:
    23
    Views:
    1,193
    Alf P. Steinbach
    May 17, 2009
  5. gry
    Replies:
    2
    Views:
    767
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page