Regex to remove non printable characters

L

Larry

Hi peeps,

I'd like to remove all characters with ascii values > 127 from a
string...that's to say i'd like to remove non printable chars...

is the following fine?

my $input =~ s/[^ -~]+//g;

thanks ever so much!
 
J

Jürgen Exner

I'd like to remove all characters with ascii values > 127 from a

ASCII is a 7 bit encoding system where sometimes the eights bit is used as
parity bit. There are no ASCII characters > 127, therefore your request
doesn't make sense.
string...that's to say i'd like to remove non printable chars...

In case you are not talking about ASCII but about e.g Windows-1252 or
ISO-Latin-x or any of the dozen other code pages that share the lower 128
characters with ASCII then please be advised that the vast majority of
those characters > 127 _ARE_ printable, at least in your typical commonly
used code pages.

The non-printable characters can be found in the lower part from 0x00 to
0x1F, no matter if ASCII or Windows-1252 or ISO-Latin-x or many, many
others.

Therefore your request makes even less sense. Maybe you want to clarify
first what you are talking about?
is the following fine?

my $input =~ s/[^ -~]+//g;

That will remove pretty much all the lower case English letters and a few
special characters. Wonder what they have to do with non-printable or
non-ASCII.

jue
 
J

John W. Krahn

Larry said:
I'd like to remove all characters with ascii values > 127 from a
string

$input =~ s/[^[:ascii:]]+//g;

...that's to say i'd like to remove non printable chars...

$input =~ s/[^[:print:]]+//g;

is the following fine?

my $input =~ s/[^ -~]+//g;

my() creates a new variable with no contents so there is nothing for the
substitution operator to remove.

$ perl -wle'my $input =~ s/[^ -~]+//g;'
Use of uninitialized value in substitution (s///) at -e line 1.



John
 
J

Jürgen Exner

John W. Krahn said:
$input =~ s/[^[:ascii:]]+//g;

...that's to say i'd like to remove non printable chars...

$input =~ s/[^[:print:]]+//g;

is this fine?

$input =~ tr/\x80-\xFF//d;

Depends what you are looking for (you still didn't clarify).
It will remove non-ASCII character in the typical 8-bit encodings.
It will _NOT_ remove non-printable characters.

Maybe you should make up your mind and let us know _which_ of these two
you are actually trying to do.

jue
 
J

Jürgen Exner

John W. Krahn said:
Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.

Good point. However, if you are splitting hairs, then let's be accurate:
Regular expressions match a string but they never remove anything as
requested by the OP. Therefore taking literally the OPs question is
non-sensical in the first place.

And he still didn't tell us if he wanted to remove non-ASCII or
non-printable, two very different categories which have no relationship with
each other whatsoever.

jue
 
L

Larry

J?rgen Exner said:
And he still didn't tell us if he wanted to remove non-ASCII or
non-printable, two very different categories which have no relationship with
each other whatsoever.

I have yet to understand the differences...in the meanwhile I think I'll
settle for the following:

tr/\x80-\xFF//d;

thanks
 
J

Jürgen Exner

Larry said:
I have yet to understand the differences..

Well, there is no communallity at all. It's two totally different things,
like colour and texture. A specific object can be green and smooth or green
and rough or blue and rough or blue and smooth or whatever combination you
can imagine.

Non-printable characters are characters that don't have a glyph assigned to
them and therefore cannot be printed. Another word for them is control
characters and they include e.g. line feed, carriage return, delete,
backspace, end-of-transmission, header start, etc., etc.
In ASCII and most other modern code pages the non-printable characters are
in the range 0x00 to 0x1F and 0x7F.


Non-ASCII characters on the other hand are characters that are not included
in the 7-bit ASCII encoding at all like e.g. symbols, graphics, and what
some people refer to as 'extended' characters like German umlauts, French
and Spanish accented characters, scandinavian extended characters, but also
Greek, Cyrillic, Arabic,Chinese, ... characters. Basically anything you can
imagine that is not typically used in the English language or that's not on
a US typewriter.
That's not surprising because as the name suggests ASCII is an _AMERICAN_
Standard Code for Information Interchange and Lyndon B. Johnson surely
didn't care about the rest of the world when he mandated its use back in
1968.

For e.g. ISO-Latin-1 those non-ASCII characters would be
Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ ©
ª « ¬ SHY ® ¯
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹
º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É
Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù
Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é
ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù
ú û ü ý þ ÿ

However almost all non-ASCII characters do have a glyph and obviously they
can be printed very well(*), just see the list above.
Or do you really think I would just omit the second letter of my first name
'Jürgen' when printing it?

*1: You could argue if the NBSP and and in particular SHY are printable or
not because they have an additional semantic on top of their (blank resp.
dash) glyphs.
*2: There are exceptions in the code pages for more exotic languages
(Arabic, Thai, Tamil, ...) , where some characters my not have a glyph
assigned but instead they alter the appearence and/or the meaning of
preceeding or following characters.

jue
 
L

Larry

Jürgen Exner said:
Well, there is no communallity at all. It's two totally different things,
like colour and texture. A specific object can be green and smooth or green
and rough or blue and rough or blue and smooth or whatever combination you
can imagine.

ok...to me those are ascii printable chars:

#!/usr/bin/perl

use strict;
use warnings;

for my $k (33 .. 126)
{
print "$k => " . chr($k) . "\n";
}

plus chr(10) and chr(13)
 
J

Jürgen Exner

Larry said:
ok...to me those are ascii printable chars:

#!/usr/bin/perl

use strict;
use warnings;

for my $k (33 .. 126)
{
print "$k => " . chr($k) . "\n";
}

Agreed, those characters are the intersection of the set of printable
characters and the set of ASCII characters, except that commonly the space
character 0x20 is considered a printable character, too. It just has a blank
glyph.
plus chr(10) and chr(13)

This however conflicts with customary understanding. From "perldoc perlre"
on POSIX character classes:

print
Any alphanumeric or punctuation (special) character or space.

While on the other hand

cntrl
Any control character. Usually characters that don't produce output
as such but instead control the terminal somehow: for example
newline and backspace are control characters. All characters with
ord() less than 32 are most often classified as control characters
(assuming ASCII, the ISO Latin character sets, and Unicode).

It appears LF and CR are control characters, not printable characters. After
all why should LF be a printable character but its cousin FF not?

jue
 
L

Larry

J?rgen Exner said:
Agreed, those characters are the intersection of the set of printable
characters and the set of ASCII characters, except that commonly the space
character 0x20 is considered a printable character, too. It just has a blank
glyph.

by the way, I'd like to get rid of 0x00 also! The thing is that I'm
coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
0x10 and those ranging from 0x21 to 0x7E

is that doable?

thanks
 
L

Larry

Larry said:
I'm hopeless at hex values...let's say:

chr(10)
chr(13)
chr(32) to chr(126)

thanks

well, for the moment I'll go along with keeping those ranging from 0x20
to 0x7E ... so that I don't have to chomp and all...
 
J

Jürgen Exner

Larry said:
The thing is that I'm
coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
0x10 and those ranging from 0x21 to 0x7E

Thank you for calling me a person with a bad char.

*PLONK*

jue
 
J

Jürgen Exner

Larry said:
well, for the moment I'll go along with keeping those ranging from 0x20
to 0x7E ... so that I don't have to chomp and all...

What a concept!
I am giving up.

jue
 
L

Larry

J?rgen Exner said:
What a concept!
I am giving up.

please don't! it's xmas time after all...

i need this to get values (commands) from CGI->param and need to get rid
of those chars
 
L

Larry

Petr Vileta said:
my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;

thank you so much ... btw, what is chr (127) ??

I think I'll make it this way:

$input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]//g;

thanks
 
C

comp.llang.perl.moderated

Thank you for calling me a person with a bad char.

*PLONK*
Wow, I thought for sure you'd finish with a
smiley after that wonderful flash of wit....
Of course, maybe you were sitting in a bad
"char" :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top