How to read unicode file line by line on Linux platform

H

hezhenjie

Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

Is there any good way to solve this problem?

Thanks~
 
A

Alexei A. Frounze

Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

Is there any good way to solve this problem?

Yes, go to www.unicode.org and get yourself the article "To the BMP and
beyond!" by Muller of Adobe Systems, Unicode FAQ, Unicode standard and some
charts. Find out how "code points" are stored in UTF-8 and UTF-16. Write
code to read/write code points in the needed UTF from/to the file. Then
process the file code point by code point. Most likely you'll only need to
look for code points with values of 13 and 10 (i.e. the famous '\r' and '\n'
:) to find out where the lines begin and end. But for full Unicode coverage,
please do read the Unicode FAQ and standard.

HTH
Alex
 
W

William Ahern

Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.

My first guess at "unicode file" would be a file which contains some
documentation on Unicode, kinda like this "unicode file" (not the link,
but the actual file):

http://www.unicode.org/faq/basic_q.html#a
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

So, with what encoding are the file's contents encoded? Note that "unicode"
is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.

I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
case, note that for Linux the natural encoding meant for representing the
Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
the standpoint of C. You'll need to convert the file. A great C library
for dealing with the myriad issues with Unicode and UTF is ICU:

http://icu.sourceforge.net/
http://www-306.ibm.com/software/globalization/icu/index.jsp

If I sound harsh or condescending it's because Unicode and UTF requires a
significant rethinking of how one deals with text, and it cannot be
understated. It goes way beyond the differences between UTF-16 and UTF-8.
And having to interoperate with broken software all day has hardened me.

Also note that this is all beyond the scope of what comp.lang.c deal withs.

- Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top