Question regarding Encode

  • Thread starter williams.wilkie
  • Start date
W

williams.wilkie

Hello! If this is the wrong group my apologies. I'll accept pointing
in the right direction if there is one.

We have several users in our company cutting and pasting from Word
into our CMS and now have the need to convert from "Windows-1252" to
"utf-8"

for in-place editing I have been using
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

I am now needing to convert multiple files in a dir and another
developer mentioned that if UTF-8 and Windows-1252 are intermixed then
there could be some confusion of the two character sets together.
Transliteration was suggested..

tr/\x92/\N{RIGHT SINGLE QUOTATION MARK}/;

for example.

What I am wondering is if that is indeed the case. I don't want to
have to resort to transliteration if it isn't necessary.

Maybe I need some kind of check to see if a file is encoded a certain
way before figuring out how to jump into it. I can't ever remember
using Encode before and now we need it on a massive scope.

Any advice would be appreciated.
Wilkie

Flames go quietly to /dev/null
 
B

Ben Morrow

Quoth (e-mail address removed):
We have several users in our company cutting and pasting from Word
into our CMS and now have the need to convert from "Windows-1252" to
"utf-8"

for in-place editing I have been using
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

I am now needing to convert multiple files in a dir and another
developer mentioned that if UTF-8 and Windows-1252 are intermixed then
there could be some confusion of the two character sets together.
Transliteration was suggested..

tr/\x92/\N{RIGHT SINGLE QUOTATION MARK}/;

for example.

What I am wondering is if that is indeed the case. I don't want to
have to resort to transliteration if it isn't necessary.

I'm not quite sure what the concerns are here, but it sounds a lot like
superstition. If each file is consistent within itself, then from_to
will work perfectly well; you can use Encode::Guess to figure out
whether a file is UTF8 or 1252, and since you're only using UTF8 or an
8bit superset of ASCII, it should be 100% reliable. It would be best to
feed the whole file to guess_encoding in one go (use File::Slurp rather
than <> or -p), and specify UTF8 first on the list, so that pure ASCII
is guessed as utf8 rather than 1252 (since either is valid, and you
don't need to re-encode that file).

If some files contain some portions in UTF8 and some portions in 1252,
then you have a serious problem whatever tool you use. My suggestion
would be to attempt to find blocks you can split the file into, where
each block is guaranteed to have a consistent encoding. Then you can
pass these blocks to guess_encoding individually.

Ben

Maybe I need some kind of check to see if a file is encoded a certain
way before figuring out how to jump into it. I can't ever remember
using Encode before and now we need it on a massive scope.

Any advice would be appreciated.
Wilkie

Flames go quietly to /dev/null
 
W

worldcyclist

There are some heuristic algorithms to do just that, but to be honest I
would assume all data is in the same encoding unless you have proof
otherwise. If it isn't, your CMS *REALLY* screwed up.

I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
UTF-8).
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.
JC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top