regexp for matching a string with mandatory underscores

Peter J. Holzer · Jan 10, 2012

RTFS. tr/// is pretty highly optimised; in particular, the 'count
characters' case has its own implementation that does no copying, except
when counting non-SvUTF8 characters in a SvUTF8 string. In that case
obviously every character of the string being counted has to be
individually converted to UTF-32, so there's no allocation but there is
effectively copying.

This sort of inefficiency is unavoidable when using UTF-8 as an internal
representation, which is why certain people are trying so hard to make
perl's internal representation opaque. Everyone now knows that using
UTF-8 was a mistake, but it can't be fixed until people get used to
keeping their fingers out.

In Pike (like Perl a vaguely C-like interpreted language) strings always
consist of elements of equal length: All characters in a string
are either 1 byte or 2 bytes or 4 bytes in length. That may waste some
space if you have a string with lots of ascii characters and one ðŸ’© in
it, but it makes most string operations simpler.

Theoretically, Perl could switch to such a model without breaking
programs (except XS code). Practically ...

hp

Ilya Zakharevich · Jan 10, 2012

RTFS. tr/// is pretty highly optimised; in particular, the 'count
characters' case has its own implementation that does no copying, except
when counting non-SvUTF8 characters in a SvUTF8 string. In that case
obviously every character of the string being counted has to be
individually converted to UTF-32, so there's no allocation but there is
effectively copying.

What makes this "obvious"? I see absolutely no need for this...
Unless you mean "copying one char at a time", not copying the whole
string. And such things MUST be documented (since in presence of
tie()ing they are not implementation details).

This sort of inefficiency is unavoidable when using UTF-8 as an internal
representation, which is why certain people are trying so hard to make
perl's internal representation opaque. Everyone now knows that using
UTF-8 was a mistake, but it can't be fixed until people get used to
keeping their fingers out.

Why do you think it is inefficiency? Todays machines are even more
tied by memory than machines 10 years ago... (In proportion to amount
of data one may [so does] store on the disk.)

In the tied (or more generally magic) case, perl calls FETCH to update
the string stored in the scalar, does the tr/// on that string, then
calls STORE to update the magic. A tie implementation that's being
careful about copying will have no additional problems because of tr///.

Now I'm absolutely confused... Are you still discussing tr/foo//
here? Do you say it WOULD call STORE?

And "being careful about copying" brings no imagery here. What
EXACTLY do you mean by that?

IMO, an operation which has semantic of reading should NOT call STORE
on tied data...

Ilya

Rainer Weikusat · Jan 10, 2012

[...]

Using UTF-8 certainly makes the code a lot hairier, and I suspect
that costs more than the memory. You end up converting
character-at-a-time to UTF-32 practically every time you do anything
with that string, rather than being able to use fast
interfaces like wmemchr(3).

Sometimes, life is just mean. Couldn't the people who invented UTF-8
in 1993 for use on their incredibly fast machines have foreseen how
much slower hardware was going to become in the next 19 years?

trailing underscores naming convention_	9	May 9, 2014
matching '?' in a string ending with digits	15	Feb 26, 2011
matching string literals	4	Feb 1, 2011
Regex: deleting non-matching words	3	Aug 22, 2010
help with regexp	5	Feb 7, 2013
Matching	8	Nov 12, 2008
Need expert help matching a line	12	Sep 8, 2009
Unicode: matching a	0	Nov 15, 2007

regexp for matching a string with mandatory underscores

Peter J. Holzer

Ilya Zakharevich

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads