Question about Encode (Windows-1252 to utf-8)

Discussion in 'Perl Misc' started by williams.wilkie@gmail.com, Jul 9, 2008.

  1. Guest

    Hello! I have recently been turned on to Encode. We have some folks
    who are copying and pasting from Word straight into our CMS and the
    need to convert from "Windows-1252" to "utf-8" is now critical.

    For a one liner I have been using this....
    perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
    file1.txt file2.txt

    Works good for editing in place.

    My quandry is that now I need to tackle multiple files in a directory
    and another developer mentioned that if "UTF-8" and "Windows-1252" are
    intermixed in a file that it may get confused and I should do a
    transliteration like..

    tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

    I wonder if that's really true and when it comes to open and closing
    file handles for this should I be using something like "binmode
    OUTPUTFILEHANDLE, ':bytes';"

    I am impressed with Encode but any advice or words that anyone wants
    to throw in would be greatly appreciated.

    Wilkie
    flames go quietly to /dev/null
    , Jul 9, 2008
    #1
    1. Advertising

  2. Ted Zlatanov Guest

    On Tue, 8 Jul 2008 16:40:53 -0700 (PDT) wrote:

    ww> Hello! I have recently been turned on to Encode. We have some folks
    ww> who are copying and pasting from Word straight into our CMS and the
    ww> need to convert from "Windows-1252" to "utf-8" is now critical.

    ww> For a one liner I have been using this....
    ww> perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
    ww> file1.txt file2.txt

    ww> Works good for editing in place.

    ww> My quandry is that now I need to tackle multiple files in a directory
    ww> and another developer mentioned that if "UTF-8" and "Windows-1252" are
    ww> intermixed in a file that it may get confused

    Why don't you try it? If it doesn't work for you, post an example and
    what fails.

    ww> and I should do a transliteration like..

    ww> tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

    I would avoid that solution, it's extremely dangerous compared to
    Encode. You may destroy valid UTF-8 data.

    ww> I wonder if that's really true and when it comes to open and closing
    ww> file handles for this should I be using something like "binmode
    ww> OUTPUTFILEHANDLE, ':bytes';"

    Maybe, depending on the file contents. Again, try it.

    Ted
    Ted Zlatanov, Jul 9, 2008
    #2
    1. Advertising

  3. wrote:
    >My quandry is that now I need to tackle multiple files in a directory
    >and another developer mentioned that if "UTF-8" and "Windows-1252" are
    >intermixed in a file that it may get confused and I should do a
    >transliteration like..


    Unless the file format supports multiple encodings within the same file
    (like e.g. a MIME email) a file can have only one encoding.

    >tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;


    Nuts!

    >I am impressed with Encode but any advice or words that anyone wants
    >to throw in would be greatly appreciated.


    The only way to survive the encoding nightmare and stay sane is to
    standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
    recommend UTF-8, but that's up to you.
    Any conversion between this standard format and other formats happens
    (if at all) _ONLY_ for user interaction, e.g. to support legacy email
    clients which don't support UTF-8 or accept input from a web page in ISO
    8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
    at all possible even this user interaction should use the agreed-upon
    standard.

    jue
    (with a decade of internationalizing and localizing software)
    Jürgen Exner, Jul 9, 2008
    #3
  4. Guest

    On Jul 9, 11:34 am, Jürgen Exner <> wrote:
    > wrote:
    > >My quandry is that now I need to tackle multiple files in a directory
    > >and another developer mentioned that if "UTF-8" and "Windows-1252" are
    > >intermixed in a file that it may get confused and I should do a
    > >transliteration like..

    >
    > Unless the file format supports multiple encodings within the same file
    > (like e.g. a MIME email) a file can have only one encoding.
    >
    > >tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

    >
    > Nuts!
    >
    > >I am impressed with Encode but any advice or words that anyone wants
    > >to throw in would be greatly appreciated.

    >
    > The only way to survive the encoding nightmare and stay sane is to
    > standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
    > recommend UTF-8, but that's up to you.
    > Any conversion between this standard format and other formats happens
    > (if at all)  _ONLY_ for user interaction, e.g. to support legacy email
    > clients which don't support UTF-8 or accept input from a web page in ISO
    > 8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
    > at all possible even this user interaction should use the agreed-upon
    > standard.
    >
    > jue
    > (with a decade of internationalizing and localizing software)


    I have seen this before with other CMSs where someone types something
    and then cuts
    and pastes from Word and then the data is mixed when stored in MySQL.
    MySQL doesn't care what you have it encoded in, but the
    problem comes when automated routines create XML files that are then
    stored with mixed
    encoding (CMS data stored into MySQL, another routine generates static
    XML files from the faulty data for usage by other places).

    Certainly makes the point that the data needs to be validated before
    going into the db, but I can
    feel the poster's pain regarding this issue.

    Maybe specifying your IN and OUT filehandles as ':bytes' would help
    (to preserve data and inhibit automated encoding
    that may result in unexpected changed to your already formatted
    UTF-8).
    Once you read in then use the transliteration method you described
    before to change things. I'm not a huge fan of using that
    method either but that's the way it was done not too many years ago.

    I'd like to see other suggestions on this one too.
    JC
    , Jul 11, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DC
    Replies:
    12
    Views:
    3,994
    Joerg Jooss
    Jun 20, 2006
  2. Robert Kern
    Replies:
    0
    Views:
    495
    Robert Kern
    Sep 11, 2010
  3. Noé Alejandro Castro Sánchez

    From UTF-8 to windows-1252

    Noé Alejandro Castro Sánchez, Jan 6, 2011, in forum: Ruby
    Replies:
    3
    Views:
    268
    Y. NOBUOKA
    Jan 7, 2011
  4. nevosa
    Replies:
    5
    Views:
    245
    David Squire
    Jul 11, 2006
  5. nevosa
    Replies:
    0
    Views:
    82
    nevosa
    Jul 10, 2006
Loading...

Share This Page