Regex to remove non printable characters

Discussion in 'Perl Misc' started by Larry, Dec 22, 2007.

  1. Larry

    Larry Guest

    Hi peeps,

    I'd like to remove all characters with ascii values > 127 from a
    string...that's to say i'd like to remove non printable chars...

    is the following fine?

    my $input =~ s/[^ -~]+//g;

    thanks ever so much!
    Larry, Dec 22, 2007
    #1
    1. Advertising

  2. On Sat, 22 Dec 2007 03:54:33 +0100, Larry <> wrote:
    > I'd like to remove all characters with ascii values > 127 from a


    ASCII is a 7 bit encoding system where sometimes the eights bit is used as
    parity bit. There are no ASCII characters > 127, therefore your request
    doesn't make sense.

    >string...that's to say i'd like to remove non printable chars...


    In case you are not talking about ASCII but about e.g Windows-1252 or
    ISO-Latin-x or any of the dozen other code pages that share the lower 128
    characters with ASCII then please be advised that the vast majority of
    those characters > 127 _ARE_ printable, at least in your typical commonly
    used code pages.

    The non-printable characters can be found in the lower part from 0x00 to
    0x1F, no matter if ASCII or Windows-1252 or ISO-Latin-x or many, many
    others.

    Therefore your request makes even less sense. Maybe you want to clarify
    first what you are talking about?

    >is the following fine?
    >
    >my $input =~ s/[^ -~]+//g;


    That will remove pretty much all the lower case English letters and a few
    special characters. Wonder what they have to do with non-printable or
    non-ASCII.

    jue
    Jürgen Exner, Dec 22, 2007
    #2
    1. Advertising

  3. Larry wrote:
    >
    > I'd like to remove all characters with ascii values > 127 from a
    > string


    $input =~ s/[^[:ascii:]]+//g;


    >...that's to say i'd like to remove non printable chars...


    $input =~ s/[^[:print:]]+//g;


    > is the following fine?
    >
    > my $input =~ s/[^ -~]+//g;


    my() creates a new variable with no contents so there is nothing for the
    substitution operator to remove.

    $ perl -wle'my $input =~ s/[^ -~]+//g;'
    Use of uninitialized value in substitution (s///) at -e line 1.



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, Dec 22, 2007
    #3
  4. Larry

    Larry Guest

    In article <fe0bj.9527$wy2.5863@edtnps90>,
    "John W. Krahn" <> wrote:

    > $input =~ s/[^[:ascii:]]+//g;
    >
    >
    > >...that's to say i'd like to remove non printable chars...

    >
    > $input =~ s/[^[:print:]]+//g;


    is this fine?

    $input =~ tr/\x80-\xFF//d;
    Larry, Dec 22, 2007
    #4
  5. Larry

    Dr.Ruud Guest

    Larry schreef:
    > John W. Krahn:


    > [remove non printable chars]
    > is this fine?
    > $input =~ tr/\x80-\xFF//d;


    No. How about chr(0x00)..chr(0x1F)?
    And characters > "\x{FF}"?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Dec 22, 2007
    #5
  6. On Sat, 22 Dec 2007 05:53:18 +0100, Larry <> wrote:

    >In article <fe0bj.9527$wy2.5863@edtnps90>,
    > "John W. Krahn" <> wrote:
    >
    >> $input =~ s/[^[:ascii:]]+//g;
    >>
    >>
    >> >...that's to say i'd like to remove non printable chars...

    >>
    >> $input =~ s/[^[:print:]]+//g;

    >
    >is this fine?
    >
    >$input =~ tr/\x80-\xFF//d;


    Depends what you are looking for (you still didn't clarify).
    It will remove non-ASCII character in the typical 8-bit encodings.
    It will _NOT_ remove non-printable characters.

    Maybe you should make up your mind and let us know _which_ of these two
    you are actually trying to do.

    jue
    Jürgen Exner, Dec 22, 2007
    #6
  7. On Sat, 22 Dec 2007 05:53:18 +0100
    Larry <> wrote:

    > In article <fe0bj.9527$wy2.5863@edtnps90>,
    > "John W. Krahn" <> wrote:
    >
    > > $input =~ s/[^[:ascii:]]+//g;
    > >
    > >
    > > >...that's to say i'd like to remove non printable chars...

    > >
    > > $input =~ s/[^[:print:]]+//g;

    >
    > is this fine?
    >
    > $input =~ tr/\x80-\xFF//d;


    Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, Dec 23, 2007
    #7
  8. "John W. Krahn" <> wrote:
    >Larry <> wrote:
    >> is this fine?
    >>
    >> $input =~ tr/\x80-\xFF//d;

    >
    >Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.


    Good point. However, if you are splitting hairs, then let's be accurate:
    Regular expressions match a string but they never remove anything as
    requested by the OP. Therefore taking literally the OPs question is
    non-sensical in the first place.

    And he still didn't tell us if he wanted to remove non-ASCII or
    non-printable, two very different categories which have no relationship with
    each other whatsoever.

    jue
    Jürgen Exner, Dec 23, 2007
    #8
  9. Larry

    Larry Guest

    In article <>,
    J?rgen Exner <> wrote:

    > And he still didn't tell us if he wanted to remove non-ASCII or
    > non-printable, two very different categories which have no relationship with
    > each other whatsoever.


    I have yet to understand the differences...in the meanwhile I think I'll
    settle for the following:

    tr/\x80-\xFF//d;

    thanks
    Larry, Dec 24, 2007
    #9
  10. Larry <> wrote:

    >In article <>,
    > J?rgen Exner <> wrote:
    >
    >> And he still didn't tell us if he wanted to remove non-ASCII or
    >> non-printable, two very different categories which have no relationship with
    >> each other whatsoever.

    >
    >I have yet to understand the differences..


    Well, there is no communallity at all. It's two totally different things,
    like colour and texture. A specific object can be green and smooth or green
    and rough or blue and rough or blue and smooth or whatever combination you
    can imagine.

    Non-printable characters are characters that don't have a glyph assigned to
    them and therefore cannot be printed. Another word for them is control
    characters and they include e.g. line feed, carriage return, delete,
    backspace, end-of-transmission, header start, etc., etc.
    In ASCII and most other modern code pages the non-printable characters are
    in the range 0x00 to 0x1F and 0x7F.


    Non-ASCII characters on the other hand are characters that are not included
    in the 7-bit ASCII encoding at all like e.g. symbols, graphics, and what
    some people refer to as 'extended' characters like German umlauts, French
    and Spanish accented characters, scandinavian extended characters, but also
    Greek, Cyrillic, Arabic,Chinese, ... characters. Basically anything you can
    imagine that is not typically used in the English language or that's not on
    a US typewriter.
    That's not surprising because as the name suggests ASCII is an _AMERICAN_
    Standard Code for Information Interchange and Lyndon B. Johnson surely
    didn't care about the rest of the world when he mandated its use back in
    1968.

    For e.g. ISO-Latin-1 those non-ASCII characters would be
    Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ ©
    ª « ¬ SHY ® ¯
    Bx ° ± ² ³ ´ µ ¶ · ¸ ¹
    º » ¼ ½ ¾ ¿
    Cx À Á Â Ã Ä Å Æ Ç È É
    Ê Ë Ì Í Î Ï
    Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù
    Ú Û Ü Ý Þ ß
    Ex à á â ã ä å æ ç è é
    ê ë ì í î ï
    Fx ð ñ ò ó ô õ ö ÷ ø ù
    ú û ü ý þ ÿ

    However almost all non-ASCII characters do have a glyph and obviously they
    can be printed very well(*), just see the list above.
    Or do you really think I would just omit the second letter of my first name
    'Jürgen' when printing it?

    *1: You could argue if the NBSP and and in particular SHY are printable or
    not because they have an additional semantic on top of their (blank resp.
    dash) glyphs.
    *2: There are exceptions in the code pages for more exotic languages
    (Arabic, Thai, Tamil, ...) , where some characters my not have a glyph
    assigned but instead they alter the appearence and/or the meaning of
    preceeding or following characters.

    jue
    Jürgen Exner, Dec 24, 2007
    #10
  11. Larry

    Larry Guest

    In article <>,
    Jürgen Exner <> wrote:

    > Well, there is no communallity at all. It's two totally different things,
    > like colour and texture. A specific object can be green and smooth or green
    > and rough or blue and rough or blue and smooth or whatever combination you
    > can imagine.


    ok...to me those are ascii printable chars:

    #!/usr/bin/perl

    use strict;
    use warnings;

    for my $k (33 .. 126)
    {
    print "$k => " . chr($k) . "\n";
    }

    plus chr(10) and chr(13)
    Larry, Dec 24, 2007
    #11
  12. Larry <> wrote:
    >ok...to me those are ascii printable chars:
    >
    >#!/usr/bin/perl
    >
    >use strict;
    >use warnings;
    >
    >for my $k (33 .. 126)
    >{
    > print "$k => " . chr($k) . "\n";
    >}


    Agreed, those characters are the intersection of the set of printable
    characters and the set of ASCII characters, except that commonly the space
    character 0x20 is considered a printable character, too. It just has a blank
    glyph.

    >plus chr(10) and chr(13)


    This however conflicts with customary understanding. From "perldoc perlre"
    on POSIX character classes:

    print
    Any alphanumeric or punctuation (special) character or space.

    While on the other hand

    cntrl
    Any control character. Usually characters that don't produce output
    as such but instead control the terminal somehow: for example
    newline and backspace are control characters. All characters with
    ord() less than 32 are most often classified as control characters
    (assuming ASCII, the ISO Latin character sets, and Unicode).

    It appears LF and CR are control characters, not printable characters. After
    all why should LF be a printable character but its cousin FF not?

    jue
    Jürgen Exner, Dec 24, 2007
    #12
  13. Larry

    Larry Guest

    In article <>,
    J?rgen Exner <> wrote:

    > Agreed, those characters are the intersection of the set of printable
    > characters and the set of ASCII characters, except that commonly the space
    > character 0x20 is considered a printable character, too. It just has a blank
    > glyph.


    by the way, I'd like to get rid of 0x00 also! The thing is that I'm
    coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
    0x10 and those ranging from 0x21 to 0x7E

    is that doable?

    thanks
    Larry, Dec 24, 2007
    #13
  14. Larry

    Larry Guest

    In article <>,
    Larry <> wrote:

    > 0x20 0x13
    > 0x10 and those ranging from 0x21 to 0x7E


    I'm hopeless at hex values...let's say:

    chr(10)
    chr(13)
    chr(32) to chr(126)

    thanks
    Larry, Dec 24, 2007
    #14
  15. Larry

    Larry Guest

    In article <>,
    Larry <> wrote:

    > I'm hopeless at hex values...let's say:
    >
    > chr(10)
    > chr(13)
    > chr(32) to chr(126)
    >
    > thanks


    well, for the moment I'll go along with keeping those ranging from 0x20
    to 0x7E ... so that I don't have to chomp and all...
    Larry, Dec 24, 2007
    #15
  16. Larry <> wrote:
    > J?rgen Exner <> wrote:
    > The thing is that I'm
    >coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
    >0x10 and those ranging from 0x21 to 0x7E


    Thank you for calling me a person with a bad char.

    *PLONK*

    jue
    Jürgen Exner, Dec 24, 2007
    #16
  17. Larry <> wrote:

    >In article <>,
    > Larry <> wrote:
    >
    >> I'm hopeless at hex values...let's say:
    >>
    >> chr(10)
    >> chr(13)
    >> chr(32) to chr(126)
    >>
    >> thanks

    >
    >well, for the moment I'll go along with keeping those ranging from 0x20
    >to 0x7E ... so that I don't have to chomp and all...


    What a concept!
    I am giving up.

    jue
    Jürgen Exner, Dec 24, 2007
    #17
  18. Larry

    Larry Guest

    In article <>,
    J?rgen Exner <> wrote:

    > What a concept!
    > I am giving up.


    please don't! it's xmas time after all...

    i need this to get values (commands) from CGI->param and need to get rid
    of those chars
    Larry, Dec 24, 2007
    #18
  19. Larry

    Larry Guest

    In article <fkj72b$1n26$>,
    "Petr Vileta" <> wrote:

    > my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;


    thank you so much ... btw, what is chr (127) ??

    I think I'll make it this way:

    $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]//g;

    thanks
    Larry, Dec 25, 2007
    #19
  20. On Dec 24, 11:04 am, Jürgen Exner <> wrote:
    > Larry <> wrote:
    > > J?rgen Exner <> wrote:
    > > The thing is that I'm
    > >coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
    > >0x10 and those ranging from 0x21 to 0x7E

    >
    > Thank you for calling me a person with a bad char.
    >
    > *PLONK*
    >

    Wow, I thought for sure you'd finish with a
    smiley after that wonderful flash of wit....
    Of course, maybe you were sitting in a bad
    "char" :)

    --
    Charles DeRykus
    comp.llang.perl.moderated, Dec 25, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Pascal
    Replies:
    3
    Views:
    668
    Roel Mathys
    Dec 4, 2003
  2. Pascal
    Replies:
    0
    Views:
    279
    Pascal
    Dec 3, 2003
  3. Daniel Alexandre
    Replies:
    2
    Views:
    553
    Sibylle Koczian
    Mar 21, 2005
  4. metaperl
    Replies:
    1
    Views:
    293
    Peter Otten
    Feb 9, 2007
  5. Joe Christl

    Selecting all printable characters (regex)

    Joe Christl, Aug 3, 2005, in forum: Perl Misc
    Replies:
    11
    Views:
    590
    Jürgen Exner
    Aug 6, 2005
Loading...

Share This Page