Question regarding Encode

williams.wilkie · Jul 8, 2008

Hello! If this is the wrong group my apologies. I'll accept pointing
in the right direction if there is one.

We have several users in our company cutting and pasting from Word
into our CMS and now have the need to convert from "Windows-1252" to
"utf-8"

for in-place editing I have been using
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

I am now needing to convert multiple files in a dir and another
developer mentioned that if UTF-8 and Windows-1252 are intermixed then
there could be some confusion of the two character sets together.
Transliteration was suggested..

tr/\x92/\N{RIGHT SINGLE QUOTATION MARK}/;

for example.

What I am wondering is if that is indeed the case. I don't want to
have to resort to transliteration if it isn't necessary.

Maybe I need some kind of check to see if a file is encoded a certain
way before figuring out how to jump into it. I can't ever remember
using Encode before and now we need it on a massive scope.

Any advice would be appreciated.
Wilkie

Flames go quietly to /dev/null

Ben Morrow · Jul 9, 2008

Quoth (e-mail address removed):

We have several users in our company cutting and pasting from Word
into our CMS and now have the need to convert from "Windows-1252" to
"utf-8"

for in-place editing I have been using
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

I am now needing to convert multiple files in a dir and another
developer mentioned that if UTF-8 and Windows-1252 are intermixed then
there could be some confusion of the two character sets together.
Transliteration was suggested..

tr/\x92/\N{RIGHT SINGLE QUOTATION MARK}/;

for example.

What I am wondering is if that is indeed the case. I don't want to
have to resort to transliteration if it isn't necessary.

I'm not quite sure what the concerns are here, but it sounds a lot like
superstition. If each file is consistent within itself, then from_to
will work perfectly well; you can use Encode::Guess to figure out
whether a file is UTF8 or 1252, and since you're only using UTF8 or an
8bit superset of ASCII, it should be 100% reliable. It would be best to
feed the whole file to guess_encoding in one go (use File::Slurp rather
than <> or -p), and specify UTF8 first on the list, so that pure ASCII
is guessed as utf8 rather than 1252 (since either is valid, and you
don't need to re-encode that file).

If some files contain some portions in UTF8 and some portions in 1252,
then you have a serious problem whatever tool you use. My suggestion
would be to attempt to find blocks you can split the file into, where
each block is guaranteed to have a consistent encoding. Then you can
pass these blocks to guess_encoding individually.

Ben

Maybe I need some kind of check to see if a file is encoded a certain
way before figuring out how to jump into it. I can't ever remember
using Encode before and now we need it on a massive scope.

Any advice would be appreciated.
Wilkie

Flames go quietly to /dev/null

worldcyclist · Jul 11, 2008

There are some heuristic algorithms to do just that, but to be honest I
would assume all data is in the same encoding unless you have proof
otherwise. If it isn't, your CMS *REALLY* screwed up.

I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
UTF-8).
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.
JC

Question about Encode (Windows-1252 to utf-8)	3	Jul 9, 2008
encode() question	6	Jul 31, 2007
Help regarding python facepy library	0	Sep 16, 2013
decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007
unicode: is decode-process-encode a "good" aproach?	2	Sep 28, 2004
How do I encode and decode this data to write to a file?	11	Apr 29, 2013
From UTF-8 to windows-1252	3	Jan 6, 2011
Python beginner, unicode encode/decode Q	1	Jul 14, 2008

Question regarding Encode

williams.wilkie

Ben Morrow

worldcyclist

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads