[FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127)

A

Alextophi

EN ---------------------------------------------------------
hello

I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
it is extend ASCII (OEM) (0-255)

- which is the method to convert towards ASCII (0-127)?

thank you

FR ---------------------------------------------------------
bonjour

Je ne peux convertir les caractères de la log
"C:\WINDOWS\SchedLgU.Txt", c'est de l'ascii etendu (OEM) (0-255) !

- quelle est la méthode pour convertir vers de l'ASCII (0-127)?

merci

christophe
 
P

Paul Lalli

Alextophi said:
I cannot convert the characters of the log "C:\WINDOWS\SchedLgU.Txt",
it is extend ASCII (OEM) (0-255)

- which is the method to convert towards ASCII (0-127)?

That depends entirely on what you mean by "convert". What,
specifically, are the conversions you want to make? If you simply want
to remove all the non-ASCII characters from the file, try something
like:

perl -pi.bkp -e's/[^[:ascii:]]//g' C:\WINDOWS\SchedLgU.Txt

If you're looking for more complex than that, you're going to have to
be more explicit. What specific characters in the 128-255 range should
become what specific characters in the 0-127 range?

Paul Lalli
 
A

Alextophi

EXAMPLE:

the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
"tâche" or "système"),

$LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
$LINE = ~ tr/\x83/\x61 /; # remplace ... â > a

- how to replace all the ASCII characters?

cordially Christophe
 
S

Samwyse

Alextophi said:
EXAMPLE:

the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters (ex:
"tâche" or "système"),

$LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
$LINE = ~ tr/\x83/\x61 /; # remplace ... â > a

- how to replace all the ASCII characters?

Are they wide ASCII, or extended ASCII? Your example (and your subject
line) are talking about extended, not wide, characters. BTW, your code
fragment can be shorted to this:
$LINE = ~ tr/\x8A\x83/\x65\x61/;

What you want to do is a lossy transformation, so I doubt that there's
any one "right" way to do it. From your example, I'd use this page:
http://www.cplusplus.com/doc/papers/ascii.html
and hand-build a 'tr' that does what you want. \xC0 through \xFF are
fairly easy, the fun part is deciding what you want to do with
"copyright" and "registered". If you'll be translating characters into
strings ("copyright" into "(C)" and/or HTML entities) then you want a
substitution table:

my %xlate = (
"\xA9" -> "(C)",
"\xAE" -> "(R)",
"\xB1" -> "+/-",
# add more lines as desired
);
my $from = join('', keys %xlate);
# ...
$input =~ s/([$from])/$xlate{$1}/ego;
 
A

Alan J. Flavell

There's no such thing. ASCII is definitively a 7-bit character
coding: it has no character positions above 127 (nor any displayable
characters above 126).

There are countless 8-bit character codings which contain the ASCII
characters in their lower half: each one of them that has been
published has a definitive name. You can't make sense of an arbitrary
stream of bytes unless and until you know just which coding you are
dealing with. In this sense, it only spreads confusion to talk about
"8-bit ASCII" or "wide ASCII" or "extended ASCII" as if those terms -
apparently made-up for convenience by somebody who's never been
exposed to the full range of codings - might designate an actual
character coding.

Are you attempting to designate an MS-DOS code page? - it seems that
you are - for example, it might be codepage 437, the US National
MS-DOS code page, which is consistent with your presentation, but so
would other code pages, such as CP850, the "Latin1 Multinational" DOS
code page.

These, and other, MS-DOS code pages are documented at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
together with their cross-mappings into Unicode.

However, these newsgroup postings are (rightly) in iso-8859-1, which
uses very different encodings of the accented letters. So one needs
to keep a careful grasp.

I read the question as really asking "how to replace all the
*non*-ASCII characters".
Are they wide ASCII, or extended ASCII?

Please, don't do that. We readers of the group have no clear idea
which definitive character codings you are referring to under these
baby-talk names.

It's been my experience that, despite the underlying simplicity of the
topic, character coding is something which causes endless confusion,
which is only made worse by a refusal to call things by their proper
names.
Your example (and your subject line)
are talking about extended, not wide, characters.

As I say: out of what I'd interpret as plausible interpretations of
8-bit ASCII-based codes (MS-DOS code pages, or iso-8859-something, or
Windows-125x), the evidence points to an MS-DOS code page. If we're
dealing with a Western context, then more precisely we'd be dealing
with MS-DOS either CP437 or 850, or iso-8859-1, or Windows-1252.

Hmmm, this chap also uses baby talk instead of the proper names of
things.

I've no argument with your code fragments, provided that the
questioner has properly identified which MS-DOS code page they are
dealing with; but I do urge you please, in an international forum, to
use terms which make proper sense internationally.

regards
 
J

Jürgen Exner

Alextophi said:
EXAMPLE:

the log "C:\WINDOWS\SchedLgU.Txt", contains wide ASCII characters

There is no such thing as "wide ASCII".
(ex:
"tâche" or "système"),

$LINE = ~ tr/\x8A/\x65 /; # remplace ... è > e
$LINE = ~ tr/\x83/\x61 /; # remplace ... â > a

- how to replace all the ASCII characters?

Did you mean to say "replace all the non-ASCII with ASCII characters?"
You don't want to do that. Or do you really mean to rename Ms. Höra ("to
hear") into Ms. Hora ("whore") or Österreich ("Austria") into Osterreich
("Easter Empire")?

jue
(who does not take kindly to his name being bastardized)
 
S

Samwyse

Alan J. Flavell wrote:
[snip]

Alan, I am in awe of your skills in pedantry. In the future, I promise
that I will *never* use the term "ASCII" to mean anything other than
whatever it was you just said.
 
E

Eric Bohlman

Alan J. Flavell wrote:
[snip]

Alan, I am in awe of your skills in pedantry. In the future, I promise
that I will *never* use the term "ASCII" to mean anything other than
whatever it was you just said.

It's not pedantry. The subject of character encodings is one that simply
can't be meaningfully discussed without using extremely precise language;
"you know what I mean" simply won't cut it here because in fact different
people will come up with *radically* different ideas of what you mean.
"High ASCII" or "wide ASCII" mean different things to different people,
because there is simply no common definition for them (which in turn comes
from the fact that they're inherently contradictory).
 
A

Alan J. Flavell

It's not pedantry. The subject of character encodings is one that
simply can't be meaningfully discussed without using extremely
precise language;
[...]

Thanks. It might be worth adding, since the original poster is in
..fr, that their data *might* be using the French MS-DOS code page
(this doesn't seem to be listed amongst the Unicode cross-mapping
tables - I'm sure it's listed in my old DOS manual in the office),
although one of my French colleagues, back in MS-DOS days, told me
that he preferred to use the French-Canadian code page instead - that
would be:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP863.TXT

I already mentioned the possibility of CP850, the Latin1 Multinational
code page. The original poster used the term "OEM", but a search for
"OEM codepage" will easily reveal that there are *many* different
MS-DOS "OEM" codepages: http://www.google.co.uk/search?q=oem+codepage

See also http://www.unicode.org/Public/MAPPINGS/VENDORS/IBM/readme.txt
for some useful notes.
"you know what I mean" simply won't cut it here because in fact
different people will come up with *radically* different ideas of
what you mean. "High ASCII" or "wide ASCII" mean different things
to different people, because there is simply no common definition
for them (which in turn comes from the fact that they're inherently
contradictory).

Quite.

Things aren't helped by the fact that MS mischievously refer to their
proprietary Windows character encoding(s) as "ANSI". On finding
contradictory assertions about this, I researched further, and am
convinced that the (US-)American National Standards Inst. has never
published such a specification. After they had initially discussed a
US specification for an ASCII-based 8-bit character coding, they
wisely decided not to have one, and adopted the international
iso-8859-1 specification instead.

Not that it's directly relevant to the present question, but I
concluded that a conscientious author would avoid referring to
Windows-1252 (or to the Windows-125x family of codings) as "ANSI"
character coding(s).

best regards
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top