Question about Encode (Windows-1252 to utf-8)

  • Thread starter williams.wilkie
  • Start date
W

williams.wilkie

Hello! I have recently been turned on to Encode. We have some folks
who are copying and pasting from Word straight into our CMS and the
need to convert from "Windows-1252" to "utf-8" is now critical.

For a one liner I have been using this....
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

Works good for editing in place.

My quandry is that now I need to tackle multiple files in a directory
and another developer mentioned that if "UTF-8" and "Windows-1252" are
intermixed in a file that it may get confused and I should do a
transliteration like..

tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I wonder if that's really true and when it comes to open and closing
file handles for this should I be using something like "binmode
OUTPUTFILEHANDLE, ':bytes';"

I am impressed with Encode but any advice or words that anyone wants
to throw in would be greatly appreciated.

Wilkie
flames go quietly to /dev/null
 
T

Ted Zlatanov

On Tue, 8 Jul 2008 16:40:53 -0700 (PDT) (e-mail address removed) wrote:

ww> Hello! I have recently been turned on to Encode. We have some folks
ww> who are copying and pasting from Word straight into our CMS and the
ww> need to convert from "Windows-1252" to "utf-8" is now critical.

ww> For a one liner I have been using this....
ww> perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
ww> file1.txt file2.txt

ww> Works good for editing in place.

ww> My quandry is that now I need to tackle multiple files in a directory
ww> and another developer mentioned that if "UTF-8" and "Windows-1252" are
ww> intermixed in a file that it may get confused

Why don't you try it? If it doesn't work for you, post an example and
what fails.

ww> and I should do a transliteration like..

ww> tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I would avoid that solution, it's extremely dangerous compared to
Encode. You may destroy valid UTF-8 data.

ww> I wonder if that's really true and when it comes to open and closing
ww> file handles for this should I be using something like "binmode
ww> OUTPUTFILEHANDLE, ':bytes';"

Maybe, depending on the file contents. Again, try it.

Ted
 
J

Jürgen Exner

My quandry is that now I need to tackle multiple files in a directory
and another developer mentioned that if "UTF-8" and "Windows-1252" are
intermixed in a file that it may get confused and I should do a
transliteration like..

Unless the file format supports multiple encodings within the same file
(like e.g. a MIME email) a file can have only one encoding.
tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;
Nuts!

I am impressed with Encode but any advice or words that anyone wants
to throw in would be greatly appreciated.

The only way to survive the encoding nightmare and stay sane is to
standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
recommend UTF-8, but that's up to you.
Any conversion between this standard format and other formats happens
(if at all) _ONLY_ for user interaction, e.g. to support legacy email
clients which don't support UTF-8 or accept input from a web page in ISO
8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
at all possible even this user interaction should use the agreed-upon
standard.

jue
(with a decade of internationalizing and localizing software)
 
W

worldcyclist

Unless the file format supports multiple encodings within the same file
(like e.g. a MIME email) a file can have only one encoding.


The only way to survive the encoding nightmare and stay sane is to
standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
recommend UTF-8, but that's up to you.
Any conversion between this standard format and other formats happens
(if at all)  _ONLY_ for user interaction, e.g. to support legacy email
clients which don't support UTF-8 or accept input from a web page in ISO
8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
at all possible even this user interaction should use the agreed-upon
standard.

jue
(with a decade of internationalizing and localizing software)

I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
UTF-8).
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.
JC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top