[FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127)

Discussion in 'Perl Misc' started by Alextophi, Dec 29, 2005.

  1. Alextophi

    Alextophi Guest

    EN ---------------------------------------------------------
    hello

    I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
    it is extend ASCII (OEM) (0-255)

    - which is the method to convert towards ASCII (0-127)?

    thank you

    FR ---------------------------------------------------------
    bonjour

    Je ne peux convertir les caractères de la log
    "C:\WINDOWS\SchedLgU.Txt", c'est de l'ascii etendu (OEM) (0-255) !

    - quelle est la méthode pour convertir vers de l'ASCII (0-127)?

    merci

    christophe
    Alextophi, Dec 29, 2005
    #1
    1. Advertising

  2. Alextophi

    Paul Lalli Guest

    Alextophi wrote:
    > I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
    > it is extend ASCII (OEM) (0-255)
    >
    > - which is the method to convert towards ASCII (0-127)?


    That depends entirely on what you mean by "convert". What,
    specifically, are the conversions you want to make? If you simply want
    to remove all the non-ASCII characters from the file, try something
    like:

    perl -pi.bkp -e's/[^[:ascii:]]//g' C:\WINDOWS\SchedLgU.Txt

    If you're looking for more complex than that, you're going to have to
    be more explicit. What specific characters in the 128-255 range should
    become what specific characters in the 0-127 range?

    Paul Lalli
    Paul Lalli, Dec 29, 2005
    #2
    1. Advertising

  3. Alextophi

    Alextophi Guest

    Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    EXAMPLE:

    the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
    "tâche" or "système"),

    $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
    $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a

    - how to replace all the ASCII characters?

    cordially Christophe
    Alextophi, Dec 29, 2005
    #3
  4. Alextophi

    Samwyse Guest

    Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    Alextophi wrote:
    > EXAMPLE:
    >
    > the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
    > "tâche" or "système"),
    >
    > $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
    > $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
    >
    > - how to replace all the ASCII characters?


    Are they wide ASCII, or extended ASCII? Your example (and your subject
    line) are talking about extended, not wide, characters. BTW, your code
    fragment can be shorted to this:
    $LINE = ~ tr/\x8A\x83/\x65\x61/;

    What you want to do is a lossy transformation, so I doubt that there's
    any one "right" way to do it. From your example, I'd use this page:
    http://www.cplusplus.com/doc/papers/ascii.html
    and hand-build a 'tr' that does what you want. \xC0 through \xFF are
    fairly easy, the fun part is deciding what you want to do with
    "copyright" and "registered". If you'll be translating characters into
    strings ("copyright" into "(C)" and/or HTML entities) then you want a
    substitution table:

    my %xlate = (
    "\xA9" -> "(C)",
    "\xAE" -> "(R)",
    "\xB1" -> "+/-",
    # add more lines as desired
    );
    my $from = join('', keys %xlate);
    # ...
    $input =~ s/([$from])/$xlate{$1}/ego;
    Samwyse, Dec 29, 2005
    #4
  5. Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    On Thu, 29 Dec 2005, Samwyse wrote:

    > Alextophi wrote:
    > > EXAMPLE:
    > >
    > > the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters


    There's no such thing. ASCII is definitively a 7-bit character
    coding: it has no character positions above 127 (nor any displayable
    characters above 126).

    There are countless 8-bit character codings which contain the ASCII
    characters in their lower half: each one of them that has been
    published has a definitive name. You can't make sense of an arbitrary
    stream of bytes unless and until you know just which coding you are
    dealing with. In this sense, it only spreads confusion to talk about
    "8-bit ASCII" or "wide ASCII" or "extended ASCII" as if those terms -
    apparently made-up for convenience by somebody who's never been
    exposed to the full range of codings - might designate an actual
    character coding.

    Are you attempting to designate an MS-DOS code page? - it seems that
    you are - for example, it might be codepage 437, the US National
    MS-DOS code page, which is consistent with your presentation, but so
    would other code pages, such as CP850, the "Latin1 Multinational" DOS
    code page.

    These, and other, MS-DOS code pages are documented at
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
    together with their cross-mappings into Unicode.

    However, these newsgroup postings are (rightly) in iso-8859-1, which
    uses very different encodings of the accented letters. So one needs
    to keep a careful grasp.

    > > (ex: "tâche" or "système"),
    > >
    > > $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
    > > $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
    > >
    > > - how to replace all the ASCII characters?


    I read the question as really asking "how to replace all the
    *non*-ASCII characters".

    > Are they wide ASCII, or extended ASCII?


    Please, don't do that. We readers of the group have no clear idea
    which definitive character codings you are referring to under these
    baby-talk names.

    It's been my experience that, despite the underlying simplicity of the
    topic, character coding is something which causes endless confusion,
    which is only made worse by a refusal to call things by their proper
    names.

    > Your example (and your subject line)
    > are talking about extended, not wide, characters.


    As I say: out of what I'd interpret as plausible interpretations of
    8-bit ASCII-based codes (MS-DOS code pages, or iso-8859-something, or
    Windows-125x), the evidence points to an MS-DOS code page. If we're
    dealing with a Western context, then more precisely we'd be dealing
    with MS-DOS either CP437 or 850, or iso-8859-1, or Windows-1252.

    > http://www.cplusplus.com/doc/papers/ascii.html


    Hmmm, this chap also uses baby talk instead of the proper names of
    things.

    I've no argument with your code fragments, provided that the
    questioner has properly identified which MS-DOS code page they are
    dealing with; but I do urge you please, in an international forum, to
    use terms which make proper sense internationally.

    regards
    Alan J. Flavell, Dec 29, 2005
    #5
  6. Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    Alextophi wrote:
    > EXAMPLE:
    >
    > the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters


    There is no such thing as "wide ASCII".

    > (ex:
    > "tâche" or "système"),
    >
    > $LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
    > $LINE = ~ tr/\x83/\x61 /; # remplace ... â > a
    >
    > - how to replace all the ASCII characters?


    Did you mean to say "replace all the non-ASCII with ASCII characters?"
    You don't want to do that. Or do you really mean to rename Ms. Höra ("to
    hear") into Ms. Hora ("whore") or Österreich ("Austria") into Osterreich
    ("Easter Empire")?

    jue
    (who does not take kindly to his name being bastardized)
    Jürgen Exner, Dec 29, 2005
    #6
  7. Alextophi

    Samwyse Guest

    Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    Alan J. Flavell wrote:
    [snip]

    Alan, I am in awe of your skills in pedantry. In the future, I promise
    that I will *never* use the term "ASCII" to mean anything other than
    whatever it was you just said.
    Samwyse, Dec 30, 2005
    #7
  8. Alextophi

    Eric Bohlman Guest

    Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    Samwyse <> wrote in news:pa1tf.38841$dO2.20814
    @newssvr29.news.prodigy.net:

    > Alan J. Flavell wrote:
    > [snip]
    >
    > Alan, I am in awe of your skills in pedantry. In the future, I promise
    > that I will *never* use the term "ASCII" to mean anything other than
    > whatever it was you just said.


    It's not pedantry. The subject of character encodings is one that simply
    can't be meaningfully discussed without using extremely precise language;
    "you know what I mean" simply won't cut it here because in fact different
    people will come up with *radically* different ideas of what you mean.
    "High ASCII" or "wide ASCII" mean different things to different people,
    because there is simply no common definition for them (which in turn comes
    from the fact that they're inherently contradictory).
    Eric Bohlman, Dec 30, 2005
    #8
  9. Re: how to convert the characters ASCII(0-255) to ASCII(0-127)

    On Fri, 30 Dec 2005, Eric Bohlman wrote:

    > Samwyse <> wrote in news:pa1tf.38841$dO2.20814
    > @newssvr29.news.prodigy.net:
    >
    > > Alan, I am in awe of your skills in pedantry. In the future, I
    > > promise that I will *never* use the term "ASCII" to mean anything
    > > other than whatever it was you just said.

    >
    > It's not pedantry. The subject of character encodings is one that
    > simply can't be meaningfully discussed without using extremely
    > precise language;

    [...]

    Thanks. It might be worth adding, since the original poster is in
    ..fr, that their data *might* be using the French MS-DOS code page
    (this doesn't seem to be listed amongst the Unicode cross-mapping
    tables - I'm sure it's listed in my old DOS manual in the office),
    although one of my French colleagues, back in MS-DOS days, told me
    that he preferred to use the French-Canadian code page instead - that
    would be:
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP863.TXT

    I already mentioned the possibility of CP850, the Latin1 Multinational
    code page. The original poster used the term "OEM", but a search for
    "OEM codepage" will easily reveal that there are *many* different
    MS-DOS "OEM" codepages: http://www.google.co.uk/search?q=oem codepage

    See also http://www.unicode.org/Public/MAPPINGS/VENDORS/IBM/readme.txt
    for some useful notes.

    > "you know what I mean" simply won't cut it here because in fact
    > different people will come up with *radically* different ideas of
    > what you mean. "High ASCII" or "wide ASCII" mean different things
    > to different people, because there is simply no common definition
    > for them (which in turn comes from the fact that they're inherently
    > contradictory).


    Quite.

    Things aren't helped by the fact that MS mischievously refer to their
    proprietary Windows character encoding(s) as "ANSI". On finding
    contradictory assertions about this, I researched further, and am
    convinced that the (US-)American National Standards Inst. has never
    published such a specification. After they had initially discussed a
    US specification for an ASCII-based 8-bit character coding, they
    wisely decided not to have one, and adopted the international
    iso-8859-1 specification instead.

    Not that it's directly relevant to the present question, but I
    concluded that a conscientious author would avoid referring to
    Windows-1252 (or to the Windows-125x family of codings) as "ANSI"
    character coding(s).

    best regards
    Alan J. Flavell, Dec 30, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wob
    Replies:
    4
    Views:
    440
    Dave Thompson
    Aug 1, 2005
  2. Laszlo Nagy
    Replies:
    6
    Views:
    610
  3. Terry Reedy
    Replies:
    0
    Views:
    507
    Terry Reedy
    Jul 1, 2008
  4. Joel VanderWerf

    UDPSocket broadcast to 127.0.0.255

    Joel VanderWerf, Aug 28, 2005, in forum: Ruby
    Replies:
    5
    Views:
    828
    Joel VanderWerf
    Aug 29, 2005
  5. kike
    Replies:
    0
    Views:
    136
Loading...

Share This Page