ways to check for octets outside of the safe ASCII range?

Discussion in 'Perl Misc' started by Ivan Shmakov, Dec 8, 2011.

  1. Ivan Shmakov

    Ivan Shmakov Guest

    I wonder, what's the (time-)efficient way to an octet string,
    for "ASCII safety"?

    The string is a POSIX filename, and POSIX is known to allow for
    arbitrary octet sequences (except those with ASCII NUL codes)
    for filenames. The tool I'm developing would store such
    filenames in an encoding-agnostic way (i. e., as BLOB's), unless
    it's certain that those are "safe ASCII."

    The check I've used in [1] is like:

    ## count the "unsafe" octets (outside of the [32, 126] range)
    my $unsafe
    = grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));

    but I'm curious if there's a way better than unpacking the octet
    sequence into a vector (Perl list)?

    TIA.

    [1] news:
    http://groups.google.com/group/alt.sources/msg/0ae6c64f26aea630

    --
    FSF associate member #7257
    Ivan Shmakov, Dec 8, 2011
    #1
    1. Advertising

  2. Ivan Shmakov <> writes:
    > I wonder, what's the (time-)efficient way to an octet string,
    > for "ASCII safety"?
    >
    > The string is a POSIX filename, and POSIX is known to allow for
    > arbitrary octet sequences (except those with ASCII NUL codes)
    > for filenames. The tool I'm developing would store such
    > filenames in an encoding-agnostic way (i. e., as BLOB's), unless
    > it's certain that those are "safe ASCII."
    >
    > The check I've used in [1] is like:
    >
    > ## count the "unsafe" octets (outside of the [32, 126] range)
    > my $unsafe
    > = grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));
    >
    > but I'm curious if there's a way better than unpacking the octet
    > sequence into a vector (Perl list)?


    Assuming that ASCII is taken for granted, an obvious other idea would
    be

    $filename =~ /[\x0-\x20\x7f-\xff]/

    This will probably also need a 'use bytes'.
    Rainer Weikusat, Dec 8, 2011
    #2
    1. Advertising

  3. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ivan Shmakov <> writes:
    >> > I wonder, what's the (time-)efficient way to an octet string,
    >> > for "ASCII safety"?
    >> >
    >> > The string is a POSIX filename, and POSIX is known to allow for
    >> > arbitrary octet sequences (except those with ASCII NUL codes)
    >> > for filenames. The tool I'm developing would store such
    >> > filenames in an encoding-agnostic way (i. e., as BLOB's), unless
    >> > it's certain that those are "safe ASCII."
    >> >
    >> > The check I've used in [1] is like:
    >> >
    >> > ## count the "unsafe" octets (outside of the [32, 126] range)
    >> > my $unsafe
    >> > = grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));
    >> >
    >> > but I'm curious if there's a way better than unpacking the octet
    >> > sequence into a vector (Perl list)?

    >>
    >> Assuming that ASCII is taken for granted, an obvious other idea would
    >> be
    >>
    >> $filename =~ /[\x0-\x20\x7f-\xff]/

    >
    > $filename !~ /[^[:ascii:]]/
    >
    > is clearer, and works properly against Unicode strings.


    Additionally, it doesn't work (in the sense that it would solve the
    problem). This includes that it is not supposed to 'work properly
    against unicode strings' aka 'let non-printable octets slip through if
    they happen to be part of utf8 multibyte characters'.

    [rw@error]~ $perl -e 'print " " =~ /[[:ascii:]]/, "\n"'
    1
    [rw@error]~ $perl -e 'print "\x1" =~ /[[:ascii:]]/, "\n"'
    1
    [rw@error]~ $perl -e 'print "\x7f" =~ /[[:ascii:]]/, "\n"'
    1

    A simpler way to test wheter a string contains 'non-printable octets'
    would be

    $filename =~ /[^[:print:]]/

    except -- unfortunately space and htab (0x20 and 9) are printable (I
    don't quite understand why space is considered to be a 'safe'
    character while \t is not, hence I assumed that ' ' was also supposed
    to be excluded).

    >> This will probably also need a 'use bytes'.

    >
    > 'use bytes' is always wrong.


    A statement of the form 'xxx is always wrong' is always wrong when
    referring to some kind of existing feature. The 'use bytes'
    documentation states

    When "use bytes" is in effect [...] each string is treated as
    a series of bytes

    Since the OP was looking for 'ASCII safety of an octet string',
    treating a string as 'series of bytes' seems to be exactly what is
    necessary for that. So, what's the problem with that (and, just out of
    curiosity who believes this documented Perl feature should not be used
    for what technical reasons which are applicable to actual problems?).

    I admit that I'm so far rather convinced that 'not using use bytes' is
    'always wrong' for the problems I have to deal with (which usually
    invovle strings of bytes and not 'characters' as arbitrarily defined,
    redefined and undefined by some US committee).
    Rainer Weikusat, Dec 8, 2011
    #3
  4. Shmuel (Seymour J.) Metz <> wrote:
    >In <>, on 12/08/2011
    > at 02:52 PM, Rainer Weikusat <> said:
    >
    >>Assuming that ASCII is taken for granted, an obvious other idea would
    >>be

    >
    >> $filename =~ /[\x0-\x20\x7f-\xff]/

    >
    >Space is valid in file names.
    >
    > $filename =~ /[\x0-\x1f\x7f-\xff]/
    >
    >BTW, does POSIX limit file names to ASCII, or are, e.g., ISO-8859-1
    >accented letters, allowed?


    AFAIK (and I may be wrong) POSIX supports any Unicode file name.
    Therefore the OPs approach to look at isolated octets is a sure way to
    ask for trouble.

    jue
    Jürgen Exner, Dec 9, 2011
    #4
  5. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >>
    >> > $filename !~ /[^[:ascii:]]/
    >> >
    >> > is clearer, and works properly against Unicode strings.

    >>
    >> Additionally, it doesn't work (in the sense that it would solve the
    >> problem).

    > <snip>
    >> A simpler way to test wheter a string contains 'non-printable octets'
    >> would be
    >>
    >> $filename =~ /[^[:print:]]/

    >
    > You're right.
    >
    >> except -- unfortunately space and htab (0x20 and 9) are printable (I
    >> don't quite understand why space is considered to be a 'safe'
    >> character while \t is not, hence I assumed that ' ' was also supposed
    >> to be excluded).

    >
    > Space is an ordinary single-width character like any other, it just
    > happens not to have any ink in its glyph. Tab is a control character
    > that (typically) produces a context-dependant amount of whitespace.
    >
    > For example, an app that wanted to know whether it was safe to assume 1
    > column per byte would treat space like 'A', but not tab.


    Both space and \t (and \v, \r and \n, here supposed to be C escape
    sequence mapped to ASCII) are whitespace characters and an application
    which wanted to know whether it was safe to assume that a filename can
    be fed to something which breaks its input into words separated by
    whitespace characters would treat them all differently from any
    non-whitespace character (eg, encoding them in some form, such as URL
    encoding, so that 'splitting on whitespace' produces the correct
    results).

    Depending on the unknown context of the original question, both
    interpretations could make sense (arguably, yours make more sense
    because it is not based on the assumption that space was erroneously
    included).

    >> >> This will probably also need a 'use bytes'.
    >> >
    >> > 'use bytes' is always wrong.

    >>
    >> A statement of the form 'xxx is always wrong' is always wrong when
    >> referring to some kind of existing feature. The 'use bytes'
    >> documentation states
    >>
    >> When "use bytes" is in effect [...] each string is treated as
    >> a series of bytes

    >
    > Yes, I know that. The general opinion among those who actually know how
    > these things work (which doesn't include me) is that both the design and
    > the implementation are buggy, and the pragma needs to be deprecated and
    > then removed. I'm not making these things up, I'm simply relaying the
    > opinion of those perl developers who are actively working on perl's
    > Unicode implementation.


    If these people are not aware that Perl scalars don't necessarily
    store 'character strings' but also arbitrary binary data, and if they
    actually want to remove the ability to use them in this way from the
    language based on their ignorance of the existance of a world beyond
    text processing, they're crackpots and their opinions as irrelevant as
    "laymen's babbling" about any topic usually is.

    Sorry guys, computer networks do exist and XML is not the universal
    messageing data format. You may be convinced that this is terribly
    wrong and really shouldn't be in this way, but then - please - go find
    yourself some soapbox and preach the true gospel to the nonbelievers
    elsewhere, leaving people who have to interoperate with the real world
    alone ...

    [...]

    >> Since the OP was looking for 'ASCII safety of an octet string',
    >> treating a string as 'series of bytes' seems to be exactly what is
    >> necessary for that. So, what's the problem with that (and, just out of
    >> curiosity who believes this documented Perl feature should not be used
    >> for what technical reasons which are applicable to actual problems?).

    >
    > Go find the relevant p5p threads if you want examples. There are quite a
    > few of them, as I recall...


    I don't even know what you consider to be relevant and I'm certainly
    not in the mood for trying to guess what the unknown source you
    claimed to be referring to could possibly be. That's a 08/15
    propaganda trick: Stay vague enough that people have to supply
    sensible interpretations of your statement using their own knowledge/
    experience and thus mistakenly believe to agree with you while they're
    actually agreeing with themselves.

    He who refers to authorities should name them.

    >> I admit that I'm so far rather convinced that 'not using use bytes' is
    >> 'always wrong' for the problems I have to deal with (which usually
    >> invovle strings of bytes and not 'characters' as arbitrarily defined,
    >> redefined and undefined by some US committee).

    >
    > I was inclined to think the same thing, until I learned that it's not
    > that simple and, while 'use bytes' seems like an attractive idea, it
    > doesn't appear to be possible to make it work properly.


    Perl has supported using scalars for binary data since ever and if the
    people who 'work on the Perl unicode implementation' cannot make that
    work correctly without breaking this feature, this would hint at the
    fact that either 'unicode support' cannot be implemented correctly or
    (more likely) the peope who happen to dabble in this area are not
    competent enough to produce useful results.
    Rainer Weikusat, Dec 9, 2011
    #5
  6. Ivan Shmakov

    Ivan Shmakov Guest

    POSIX vs. filename encoding

    >>>>> Jürgen Exner <> writes:
    >>>>> Shmuel (Seymour J.) Metz <> wrote:


    [Somehow, I believe that this discussion is more appropriate for
    news:comp.unix.programmer. Set Followup-To: there.]

    […]

    >> BTW, does POSIX limit file names to ASCII, or are, e.g., ISO-8859-1
    >> accented letters, allowed?


    > AFAIK (and I may be wrong) POSIX supports any Unicode file name.
    > Therefore the OPs approach to look at isolated octets is a sure way
    > to ask for trouble.


    AIUI, POSIX filenames are arbitrary octet strings. They can be
    in any encoding (e. g., ISO-8859-1, UTF-8, koi8-r) as long as it
    doesn't make use of the \000 octet (i. e., UCS-16, UCS-32,
    etc. cannot be used; which is, roughly, the very reason behind
    UTF-8.)

    In particular, it's perfectly possible for different users of
    the same multi-user system (and filesystem) to stick to
    different encodings. The software they use will interpret the
    filenames according to the locale settings in effect for that
    particular user (or, actually, for that particular application.)
    Which may, indeed, fail if one user will try to access different
    users' files without tweaking his or her locale to match the
    other user's preference.

    (That's why my software has to be encoding-agnostic.)

    --
    FSF associate member #7257
    Ivan Shmakov, Dec 12, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chason Hayes

    Literal Escaped Octets

    Chason Hayes, Feb 6, 2006, in forum: Python
    Replies:
    11
    Views:
    665
    Chason Hayes
    Feb 8, 2006
  2. Replies:
    18
    Views:
    832
  3. Stefan Ram
    Replies:
    8
    Views:
    1,668
    Karl Uppiano
    Jul 18, 2009
  4. Peter Vereshagin
    Replies:
    3
    Views:
    146
    Dr.Ruud
    Mar 17, 2011
  5. danielk
    Replies:
    18
    Views:
    390
    Mark Lawrence
    Mar 19, 2014
Loading...

Share This Page