ways to check for octets outside of the safe ASCII range?

Ivan Shmakov · Dec 8, 2011

I wonder, what's the (time-)efficient way to an octet string,
for "ASCII safety"?

The string is a POSIX filename, and POSIX is known to allow for
arbitrary octet sequences (except those with ASCII NUL codes)
for filenames. The tool I'm developing would store such
filenames in an encoding-agnostic way (i. e., as BLOB's), unless
it's certain that those are "safe ASCII."

The check I've used in [1] is like:

## count the "unsafe" octets (outside of the [32, 126] range)
my $unsafe
= grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));

but I'm curious if there's a way better than unpacking the octet
sequence into a vector (Perl list)?

TIA.

[1] http://groups.google.com/group/alt.sources/msg/0ae6c64f26aea630

Rainer Weikusat · Dec 8, 2011

Ivan Shmakov said:
I wonder, what's the (time-)efficient way to an octet string,
for "ASCII safety"?

The string is a POSIX filename, and POSIX is known to allow for
arbitrary octet sequences (except those with ASCII NUL codes)
for filenames. The tool I'm developing would store such
filenames in an encoding-agnostic way (i. e., as BLOB's), unless
it's certain that those are "safe ASCII."

The check I've used in [1] is like:

## count the "unsafe" octets (outside of the [32, 126] range)
my $unsafe
= grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));

but I'm curious if there's a way better than unpacking the octet
sequence into a vector (Perl list)?

Assuming that ASCII is taken for granted, an obvious other idea would
be

$filename =~ /[\x0-\x20\x7f-\xff]/

This will probably also need a 'use bytes'.

Rainer Weikusat · Dec 8, 2011

Ben Morrow said:
Quoth Rainer Weikusat said:

Ivan Shmakov said:

I wonder, what's the (time-)efficient way to an octet string,
for "ASCII safety"?

The string is a POSIX filename, and POSIX is known to allow for
arbitrary octet sequences (except those with ASCII NUL codes)
for filenames. The tool I'm developing would store such
filenames in an encoding-agnostic way (i. e., as BLOB's), unless
it's certain that those are "safe ASCII."

The check I've used in [1] is like:

## count the "unsafe" octets (outside of the [32, 126] range)
my $unsafe
= grep { $_ < 32 || $_ > 126 } (unpack ("C*", $filename));

but I'm curious if there's a way better than unpacking the octet
sequence into a vector (Perl list)?

Click to expand...

Assuming that ASCII is taken for granted, an obvious other idea would
be

$filename =~ /[\x0-\x20\x7f-\xff]/

Click to expand...

$filename !~ /[^[:ascii:]]/

is clearer, and works properly against Unicode strings.

Additionally, it doesn't work (in the sense that it would solve the
problem). This includes that it is not supposed to 'work properly
against unicode strings' aka 'let non-printable octets slip through if
they happen to be part of utf8 multibyte characters'.

[rw@error]~ $perl -e 'print " " =~ /[[:ascii:]]/, "\n"'
1
[rw@error]~ $perl -e 'print "\x1" =~ /[[:ascii:]]/, "\n"'
1
[rw@error]~ $perl -e 'print "\x7f" =~ /[[:ascii:]]/, "\n"'
1

A simpler way to test wheter a string contains 'non-printable octets'
would be

$filename =~ /[^[

rint:]]/

except -- unfortunately space and htab (0x20 and 9) are printable (I
don't quite understand why space is considered to be a 'safe'
character while \t is not, hence I assumed that ' ' was also supposed
to be excluded).

'use bytes' is always wrong.

A statement of the form 'xxx is always wrong' is always wrong when
referring to some kind of existing feature. The 'use bytes'
documentation states

When "use bytes" is in effect [...] each string is treated as
a series of bytes

Since the OP was looking for 'ASCII safety of an octet string',
treating a string as 'series of bytes' seems to be exactly what is
necessary for that. So, what's the problem with that (and, just out of
curiosity who believes this documented Perl feature should not be used
for what technical reasons which are applicable to actual problems?).

I admit that I'm so far rather convinced that 'not using use bytes' is
'always wrong' for the problems I have to deal with (which usually
invovle strings of bytes and not 'characters' as arbitrarily defined,
redefined and undefined by some US committee).

Jürgen Exner · Dec 9, 2011

Shmuel (Seymour J.) Metz said:
on 12/08/2011 said:

Assuming that ASCII is taken for granted, an obvious other idea would
be

Click to expand...

$filename =~ /[\x0-\x20\x7f-\xff]/

Click to expand...

Space is valid in file names.

$filename =~ /[\x0-\x1f\x7f-\xff]/

BTW, does POSIX limit file names to ASCII, or are, e.g., ISO-8859-1
accented letters, allowed?

AFAIK (and I may be wrong) POSIX supports any Unicode file name.
Therefore the OPs approach to look at isolated octets is a sure way to
ask for trouble.

jue

Rainer Weikusat · Dec 9, 2011

Ben Morrow said:
Quoth Rainer Weikusat said:

Ben Morrow said:

$filename !~ /[^[:ascii:]]/

is clearer, and works properly against Unicode strings.

Click to expand...

Additionally, it doesn't work (in the sense that it would solve the
problem).

Click to expand...

A simpler way to test wheter a string contains 'non-printable octets'
would be

$filename =~ /[^[rint:]]/

Click to expand...

You're right.

except -- unfortunately space and htab (0x20 and 9) are printable (I
don't quite understand why space is considered to be a 'safe'
character while \t is not, hence I assumed that ' ' was also supposed
to be excluded).

Click to expand...

Space is an ordinary single-width character like any other, it just
happens not to have any ink in its glyph. Tab is a control character
that (typically) produces a context-dependant amount of whitespace.

For example, an app that wanted to know whether it was safe to assume 1
column per byte would treat space like 'A', but not tab.

Both space and \t (and \v, \r and \n, here supposed to be C escape
sequence mapped to ASCII) are whitespace characters and an application
which wanted to know whether it was safe to assume that a filename can
be fed to something which breaks its input into words separated by
whitespace characters would treat them all differently from any
non-whitespace character (eg, encoding them in some form, such as URL
encoding, so that 'splitting on whitespace' produces the correct
results).

Depending on the unknown context of the original question, both
interpretations could make sense (arguably, yours make more sense
because it is not based on the assumption that space was erroneously
included).

This will probably also need a 'use bytes'.

'use bytes' is always wrong.

Click to expand...

A statement of the form 'xxx is always wrong' is always wrong when
referring to some kind of existing feature. The 'use bytes'
documentation states

When "use bytes" is in effect [...] each string is treated as
a series of bytes

Click to expand...

Yes, I know that. The general opinion among those who actually know how
these things work (which doesn't include me) is that both the design and
the implementation are buggy, and the pragma needs to be deprecated and
then removed. I'm not making these things up, I'm simply relaying the
opinion of those perl developers who are actively working on perl's
Unicode implementation.

If these people are not aware that Perl scalars don't necessarily
store 'character strings' but also arbitrary binary data, and if they
actually want to remove the ability to use them in this way from the
language based on their ignorance of the existance of a world beyond
text processing, they're crackpots and their opinions as irrelevant as
"laymen's babbling" about any topic usually is.

Sorry guys, computer networks do exist and XML is not the universal
messageing data format. You may be convinced that this is terribly
wrong and really shouldn't be in this way, but then - please - go find
yourself some soapbox and preach the true gospel to the nonbelievers
elsewhere, leaving people who have to interoperate with the real world
alone ...

[...]

Go find the relevant p5p threads if you want examples. There are quite a
few of them, as I recall...

I don't even know what you consider to be relevant and I'm certainly
not in the mood for trying to guess what the unknown source you
claimed to be referring to could possibly be. That's a 08/15
propaganda trick: Stay vague enough that people have to supply
sensible interpretations of your statement using their own knowledge/
experience and thus mistakenly believe to agree with you while they're
actually agreeing with themselves.

He who refers to authorities should name them.

I was inclined to think the same thing, until I learned that it's not
that simple and, while 'use bytes' seems like an attractive idea, it
doesn't appear to be possible to make it work properly.

Perl has supported using scalars for binary data since ever and if the
people who 'work on the Perl unicode implementation' cannot make that
work correctly without breaking this feature, this would hint at the
fact that either 'unicode support' cannot be implemented correctly or
(more likely) the peope who happen to dabble in this area are not
competent enough to produce useful results.

Ivan Shmakov · Dec 12, 2011

[Somehow, I believe that this discussion is more appropriate for
news:comp.unix.programmer. Set Followup-To: there.]

[â€¦]

AFAIK (and I may be wrong) POSIX supports any Unicode file name.
Therefore the OPs approach to look at isolated octets is a sure way
to ask for trouble.

AIUI, POSIX filenames are arbitrary octet strings. They can be
in any encoding (e. g., ISO-8859-1, UTF-8, koi8-r) as long as it
doesn't make use of the \000 octet (i. e., UCS-16, UCS-32,
etc. cannot be used; which is, roughly, the very reason behind
UTF-8.)

In particular, it's perfectly possible for different users of
the same multi-user system (and filesystem) to stick to
different encodings. The software they use will interpret the
filenames according to the locale settings in effect for that
particular user (or, actually, for that particular application.)
Which may, indeed, fail if one user will try to access different
users' files without tweaking his or her locale to match the
other user's preference.

(That's why my software has to be encoding-agnostic.)

UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
How bad is $'? (Was: "Get substring of line")	4	Jan 18, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
Request for comments - kgets()	10	Aug 13, 2004
Modify Python Code - no idea at all	0	Nov 5, 2003
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005

ways to check for octets outside of the safe ASCII range?

Ivan Shmakov

Rainer Weikusat

Rainer Weikusat

Jürgen Exner

Rainer Weikusat

Ivan Shmakov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads