Reading UTF-8 string from file with read() function.

Sergei · Aug 31, 2004

Hi,
I need to read a string from UTF-8 encoded text file.
I know at which byte position the string starts and its length (also
in byte units).
The problem is that read( FILEHANDLE,SCALAR,LENGTH) function takes
LENGTH in character units, not in bytes.
I've tried to open the file in binary mode instead of UTF-8, so I can
read the correct length, but then I can't process the string with
regular expressions correctly as Perl thinks it's in binary encoding,
not UTF-8.
Also, I've tried to read the string using getc() function, but it is
unacceptably slow.
Is there any solution ?
Thanks a lot,
--Sergei

Brian McCauley · Aug 31, 2004

Sergei said:
I need to read a string from UTF-8 encoded text file.
I know at which byte position the string starts and its length (also
in byte units).
The problem is that read( FILEHANDLE,SCALAR,LENGTH) function takes
LENGTH in character units, not in bytes.
I've tried to open the file in binary mode instead of UTF-8, so I can
read the correct length, but then I can't process the string with
regular expressions correctly as Perl thinks it's in binary encoding,
not UTF-8.

Is there any solution ?

Read the string from file as binary and then utf8::decode() it.

Sergei · Aug 31, 2004

Brian McCauley said:
...
Read the string from file as binary and then utf8::decode() it.

Brian,
You are right. I did:
use Encode 'decode_utf8';
$Unicode = decode_utf8($bytes);
And it works !
Thanks a lot !
--Sergei

nobull · Sep 1, 2004

use Encode 'decode_utf8';
$Unicode = decode_utf8($bytes);
And it works !

Yes, you can use Encode::decode_utf8() instead of the builtin
utf8::decode() if you like. Note: when called with a single agument
Encode::decode_utf8() is simply a wrapper for utf8::decode().

Sergei · Sep 3, 2004

Yes, you can use Encode::decode_utf8() instead of the builtin
utf8::decode() if you like. Note: when called with a single agument
Encode::decode_utf8() is simply a wrapper for utf8::decode().

I didn't know I could use it like this.
This way it's even better.
Thanks!
(Other's messages were useful too. What an excellent thing this news
group! Thanks a lot everybody!)

UTF-8 read & print?	6	Nov 25, 2012
Read utf-8 file	1	Mar 18, 2013
broken UTF-8 string	1	Jul 24, 2010
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
CGI and UTF-8	14	Sep 28, 2009
Converting file from utf-16 to utf-8	3	Mar 23, 2010
codec for UTF-8 with BOM	3	May 2, 2011

Reading UTF-8 string from file with read() function.

Sergei

Brian McCauley

Sergei

nobull

Sergei

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads