How do I create a new text file with utf-8 encoding

bk · May 10, 2007

I use Activeperl version 5.8.8.817 on windows xp.

I try create a new text file and add some content but when I open it
in notepad, it says its a ansi encoded file. Why?

Here is my code snippit:

open my $fh, '>:encoding(UTF-8)', "testfile.txt";
print $fh "Welcome to Muppet Show\n";
close $fh;

What do I do wrong?

Jürgen Exner · May 10, 2007

I use Activeperl version 5.8.8.817 on windows xp.

I try create a new text file and add some content but when I open it
in notepad, it says its a ansi encoded file. Why?

open my $fh, '>:encoding(UTF-8)', "testfile.txt";
print $fh "Welcome to Muppet Show\n";
close $fh;

What do I do wrong?

Your sample text has the identical byte sequence in ASCII, Windows-1252 (aka
ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodings.
Therefore your sample is useless for testing for the correct encoding.

Notepad relies on the byte order mark (BOM) do identify Unicode files,
including UTF-8 where the BOM of course is meaningless and not used except
by Notepad itself. In not so many words: Notepad has no clue what it is
talking about. But for your sample text nor would any other tool.

Step 1: use some sample text that contains characters, that have different
code points in each encoding.
Step 2: don't use Notepad. Write to a (trivial) HTML file and then use a web
browser to view that file. There you can change the encoding and determine,
if those characters are displayed correctly for the desired encoding.

In over 8 years as software localization engineer and international program
manager this has proven to be the only practical and reliable way to
identify the actual encoding of a file.

jue

Brian McCauley · May 10, 2007

Your sample text has the identical byte sequence in ASCII, Windows-1252 (aka
ANSI), UTF-8, ISO-Latin1, ISO-Latin15, and probably a dozen other encodings.
Therefore your sample is useless for testing for the correct encoding.

Notepad relies on the byte order mark (BOM) do identify Unicode files,
including UTF-8 where the BOM of course is meaningless and not used except
by Notepad itself.

You mean Windows not Notepad. Most Windows programs will recognise a
file with a utf8 BOM at the start as utf8.

In a situation where you've got a mixture of Windows-1252 and utf8
files knocking about then it's not a bad way to distinguish them. I'm
not saying I particularly liked Microsoft's unilateral adoption of BOM
in utf8 but I have to admit it makes the best of a bad job.

In Perl I'd like to be able to say something like

open my $fh, '>:encoding(UTF-8 BOM)', "testfile.txt";

But AFIAK I can't and I just have to

print $fh "\x{FEFF}"; # BOM

Jürgen Exner · May 10, 2007

Brian said:
In a situation where you've got a mixture of Windows-1252 and utf8
files knocking about then it's not a bad way to distinguish them. I'm
not saying I particularly liked Microsoft's unilateral adoption of BOM
in utf8 but I have to admit it makes the best of a bad job.

Fair enough, you got a point.
However calling it a _Byte_Order_ Mark in context of UTF-8 is a misnomer if
there ever has been one ;-)

jue

UTF-8 read & print?	6	Nov 25, 2012
Reading Text File Encoding and converting to Perls internal UTF-8 encoding	2	Apr 17, 2009
How to create a file with UTF-8 encoding	4	Sep 21, 2009
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Reading a CSV file with UTF-16LE encoding	4	Jan 13, 2011

How do I create a new text file with utf-8 encoding

bk

Jürgen Exner

Brian McCauley

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads