XML::LibXML UTF-8 toString() -vs- nodeValue()

S

sln

-----------------------------------------------------

Hey, first off, I really appretiate your responses, especially the Unicode.
I am newly interrested in this and am learning, but understand a good portion.
Below, I'm just going to briefly clarify some of my previous statements.

Thanks!
-sln

-----------------------------------------------------

I guess this phrasing skipped a few things. Streams is not really an
stand alone definition of anything, but an acronym for doing operations
on file descriptor's in the kernel, via api (POSIX). Certainly there
is not Regular Expression engine, or anything like that in kernel's.
You don't need regexps at all to parse XML (or any other language).
And you certainly don't need to do them on streams, since you can always
read the next block or line from the stream and append it to your
buffer.

You certainly don't need regexps to parse XML, and you certainly don't need
regexps to do string comparisons on XML. 'Stream processing' however, has a
more abstract meaning. Basically it means processing locally disposable data,
while traversing a buffer of a kernel file descriptor and not waiting for the
end of file/low-level i/o, device, pipe, or whatever the descriptor referrs to.
You certainly can't do that in the kernel. The key is that a small user buffer
is populated as the 'stream' passes through it. The buffer is either fixed size
or expands and contracts slightly as necessary to process events as they are
parsed, in computer time not necessarily real time.

The machinations of 'buffering' as it seams to indicate some delineation in
your mind, has nothing to do with 'stream' parsing or processing, only the notion
of incremental processing.

Some foolish people obfuscate XML parsing and regular expressions in some high
abstractness of language, which totally misses the point.

Regular expressions used for parsing XML is no different that simple string comparison
of token punctuation. It is for that reason I made my statement.

Many examples of push/pull stream oriented processors.

Some references of stream-oriented processing of XML (SAX or near sax compliant):
http://en.wikipedia.org/wiki/Expat_(XML)
http://en.wikipedia.org/wiki/Streaming_Transformations_for_XML

[paragraph moved]
On the other hand, I think you don't know what a stream is:

my ($fh, '<', 'test.xml');

Now $fh refers a stream.

No, not really, it refers to a file descriptor.
Please show me how you can apply a regexp to
this stream. Solutions which don't count:

As I said, there is 'no formal definition' of a stream. By all acounts
a 'stream' is an abstract concept akin to a tree watching water flow by,
a near static observer of fluidic motion.
* reading chunks from the stream into a scalar variable and then
applying the regexp to this variable (because then you apply it to a
string (as I wrote), not a stream.

Again, what is a stream? In this use, its an abstraction consisting of
buffering and processing layers in fluidic motion, in a continous manner.
A 'string' has nothing to do with anything.
* writing your own regexp engine (since Perl is a general purpose
programming language, you can of course write that but we were
talking about Perl' builtin regexp).

But regex has nothing to do with stream's per say, there is only a limited
fixed api (soon to be expanded) that deals with file descriptors
(or Microsofts FILE *). So, you can skip this process.
pack and unpack are Perl functions. They can only be applied to strings,
not streams. If you don't mean these functions but something else, be
more specific. And I have no idea what a "regex stream" might be. A
stream composed of regexps? A stream with special support for regexps?
A stream split into records with a regexp?

Remember, 'stream' is an abstract concept, and so is a 'record'.
For the record, stream parsing/processing is grabbing from 1 to user defined
amount of characters/data, using api that works on the file descriptor kernel data,
to match a pattern on which to process. This requires user space buffering.
The concept of 'stream' processing is the antithesis of processing a complete data set.

Stream-parsing XML can be as simple as reading 1 character at a time, buffering until
a key character is found that may represent a character used in the closure of a statement,
processing that possibility, then clearing the buffer, or continue buffering. It can also
depend on the state of parsing variability of the xml processor. The result is the same,
cars are taken off the track and processed. Most xml 'state' processors will stop upon
the (near) first point of error in syntax (MSXML does this). Regular expressions offer
a distinct advantage in this regard, will/can continue processing to report other errors,
advance the stream, but does not enjoy the speed as say Expat does. Stream processing
XML has unique advantages to tree's (although tree's are now windowed) and enables
multi-level filters.
I ah think your missing what Unicode is.

I know quite well what Unicode is - I found characterset issues
fascinating ever since I turned on an Apple ][ in 1984 and it identified
itself as "Apple ÜÄ". I've read Rob Pike's paper in the early 90s and
the full unicode standard (version 2.0) in the late 90s. And I've
discussed character encoding matters (including Unicode) a lot on
various newsgroups and mailinglists over the years and fixed a few
encoding related problems in various pieces of software.
Ah, back to my original argument, Unicode!
Was not a beef with Unicode, not at all, but it got me very interrested in it.

I didn't want to use pack/unpack templates that had no variability.
I needed to do pattern searches on 32-bit integers, plain and simple.
Had nothing to do with Unicode at all. For instance, if I found a numeric
256 (32-bit integer) in a stream of 32-bit integers, I wanted to grab
the 5th following 32-bit integer in the stream no matter what its value was.
This is the simple explanation, the real one involved complex variabilty.

So I looked at Unicode and Perl's utf-8 as the internal default,
as character representation's of 32-bit integers, to be used in regular expressions.
I didn't start with 'encodings'. In other words, encoding had nothing to do with
what I wanted to do. I understand there are encodings that translate to the code points,
in the particulare Unicode you want 8/16/32, endian and byte order mark.
The octets are the 1-6 bytes (8-bit) result of the encoding.

The code points run in ranges of 0-(2**32 - 1), but they run in ranges (utf-32 hase no code points).
Between those ranges and you run into Unicode internal control, reserved attributes (BOM,endianess etc..).

I guess I don't care about encoding if I could internalize (Perls utf-8) the full range
of 32-bit integers to characters to be used in regular expressions, then extracted back to 32-bit
integers to be used elsewhere.
......
I thought I had posted some code when I responded to this one ^^^^. Guess I didn't.
I will post a clipped follow-up code sample.
Code is always nice because it is unambiguous (unlike the English
language). However, keep in mind that this is a discussion group, not a
code repository. Any code example longer than 50 lines or so is unlikely
to be read.


I've read that several times (and critisized it here, too).


If you think this is a fight where one of us has to win and the other to
capitulate, I'll stop now.

hp


I hope you understand what my meaning is now, 'capitulate' is just a word.

Thank you!

-sln
 
P

Peter J. Holzer

Before anything else, I beg your and everyone else pardon. For some
weird reason, I'd called "tokens" "literals". Now I feel much better.

No. Almost all encodings today are supersets of US-ASCII.

Consider these two programs: *SKIP*
$ perl -Mutf8 -wle 'print "фыва"; print "\x{C0}\x{B0}"'
Wide character in print at -e line 1.
фыва
� *SKIP*
{2775:24} [0:0]$ perl -Mencoding=latin1 -wle 'print "фыва"; print "\x{C0}\x{B0}"'
фыва
�

use encoding als sets the binmode for STDOUT and STDERR, so you won't

No, it doesn't (s/STDERR/STDIN/)

Yes, that was a typo. Sorry.

{5665:37} [0:0]$ perl -Mencoding=utf8 -wle 'print STDERR "фыва"'
Wide character in print at -e line 1.
фыва
get a warning here. Again, I was talking only about compile time
effects, not run time, so I didn't mention that (you can read the manual
yourself).

I fail to see any compile time effects -- either in those two above or
this one below

Well, you aren't looking for any compile time effects, so you won't see
any :).

Let's compare 4 programs, which are all essentially the same:

#!/usr/bin/perl
use XXX ###
use warnings;
use strict;

my $greeting = "ΚαλημέÏα κόσμε";
dumpstr($greeting);

sub dumpstr {
my ($s) = @_;

print utf8::is_utf8($s) ? "char" : "byte";
print "[", length($s), "]";
print ":";
for (split //, $s) {
printf " %#02x", ord($_);
}
print "\n";
}
__END__

The differences are in the encoding of the source file (UTF-8 vs.
ISO-8859-7) and the line marked "use XXX ###" above.

1) encoded in UTF-8, contains "use utf8;"
prints:

char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5

2) encoded in UTF-8, no "use utf8;"
prints:

byte[27]: 0xce 0x9a 0xce 0xb1 0xce 0xbb 0xce 0xb7 0xce 0xbc 0xce 0xad
0xcf 0x81 0xce 0xb1 0x20 0xce 0xba 0xcf 0x8c 0xcf 0x83 0xce 0xbc 0xce
0xb5

3) encoded in ISO-8859-7, contains "use encoding 'ISO-8859-1';"
prints:

char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5

4) encoded in ISO-8859-7, no "use encoding 'ISO-8859-1';"
prints:

byte[14]: 0xca 0xe1 0xeb 0xe7 0xec 0xdd 0xf1 0xe1 0x20 0xea 0xfc 0xf3
0xec 0xe5

As you can see, in the two cases where "use utf8" resp. "use encoding"
was used, the string constant was converted to a character string: The
so-called utf8 flag is on, the first character ("Κ") is U+039A ("GREEK
CAPITAL LETTER KAPPA"). In the other two cases the string is left as an
uninterpreted byte string: (0xCE 0x9E) is the UTF-8 encoding of a Kappa,
(0xCA) is the ISO-8859-7 encoding of a Kappa.

You can verify that the compiler really converts the string constant
(and doesn't insert a call to encode which is evaluated at run-time)
with -MO=Concise.


My understanding is based on this -- C<perldoc perlunicode>

"use encoding" needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode
model: implicit upgrading from byte strings to Unicode strings
assumes that they were encoded in ISO 8859-1 (Latin-1), but
Unicode strings are downgraded with UTF-8 encoding.

This paragraph is confusing. I have a vague idea what the author wanted
to say but even then it's not quite correct. I doubt somebody can
understand this paragraph unless they already exactly understood the
problems before.

This happens because the first 256 codepoints in Unicode happens
to agree with Latin-1.

If encoding is unknown, it's treated as latin1, even if it's not.

This has nothing to do with "use utf8" and "use encoding". The
"implicit upgrading" which is mentioned here happens (for example) when
you concatenate a byte string to a character string. But then the result
*is* a character string, not a byte string.

Byte strings are *not* implicitely assumed to be ISO-8859-1, as you can
easily check by matching against a character class:

% perl -le '$_ = "\x{FC}"; print /\w/ ? "yes" : "no"'
no
% perl -le '$_ = "\x{FC}"; utf8::upgrade($_); print /\w/ ? "yes" : "no"'
yes

So, in a byte string the code point 0xFC does not count as a word
character, but in a character string it does. If byte strings were
assumed to be ISO-8859-1, then 0xFC would be a word character, so
obviously it isn't. Instead, byte strings are assumed to be some
superset of US-ASCII:

% perl -le '$_ = "\x{6C}"; print /\w/ ? "yes" : "no"'
yes

0x6C is a letter ("l") in ASCII, but 0xFC isn't (ASCII defines only
0x00-0x7F).

(I hear that somebody's working to change this to reduce the differences
in behaviour between byte and character strings)

But it didn't.

It does for me. If I change "use encoding 'ISO-8859-7'" to "use utf8"
in my ISO-8859-7 encoded file, I get a lot of warnings.
You want to say C<"\x{C0}\x{B0}"> is a welformed UTF-8?

Sort of: It decodes cleanly to U+0030. But the canonical (shortest)
encoding of U+0030 is "\x{30}", and UTF-8 generating programs MUST
always produce the canonical encoding. UTF-8 consuming programs should
complain if they encounter a non-canonical encoding. Perl behaves a bit
weirdly here: It doesn't complain when it reads the string, but it does
complain on some operations on it, e.g. ord(). I consider that a bug.

Have you ever seen a program text where tokens are mix of ASCII and
non-ASCII characters? I've seen.

I usually stick to using English names for my subs and variables. But if
I was using German names I might as well use umlauts. Mathematical
symbols might also be handy. I would have a problem if my colleague used
Chinese, though ;-).

(I already wanted to use € in a variable name (it contained a monetary
amount in Euro), but € isn't a work character. OTOH, $ isn't either, so
I guess that's fair)


Quoting C<perldoc utf8>

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8. The utility functions described below
are directly usable without "use utf8;".

I believe I already said that once or twice in this thread.

My understanding of "script" is a program text outside of any quotes in
it.

Bullshit. A script is the complete program text, including any string
constants, numeric constants, comments, the __DATA__ stream, if any.
Why would a string constant in a script not be part of it?

But,.. here be dragons...

{3335:27} [0:0]$ echo 'фыва' | xxd
0000000: d184 d18b d0b2 d0b0 0a .........
{3356:28} [0:0]$ echo 'фыва' | recode utf8..ucs-2-internal |xxd
0000000: 4404 4b04 3204 3004 0a00 D.K.2.0...
{3414:29} [0:1]$ perl -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'

You've mixed up the endianness. 'Ñ„' is U+0444, not U+4404.

Yes, my fault. And why you skipped the next line? It behaves the same
way with endianess fixed.

You mean:

{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

That doesn't fix the endianness, and it behaves completely differently.
"perl -Mencoding=ucs2" can't work, as I already explained to sln.

hp
 
S

sln

Code is always nice because it is unambiguous (unlike the English
language). However, keep in mind that this is a discussion group, not a
code repository. Any code example longer than 50 lines or so is unlikely
to be read.
hp

Hey Peter,

Thanks for your insight. Any other Unicode guru's out there would like to steer me
in the more right direction would be greatly appretiated.

I didn't want to start a new thread, so this is a branch on a recent one.
I finally have a grasp of encodings, albeit a small but significant glimmer.

My goal is to convert 32-bit binary data to Unicode character's that can
be used as both the data and the pattern in regular expression matching.

I believe I am halfway there, the binary to Unicode (Perl internal utf8) conversion,
which at least will do (I think) from 0..10FFFF (hex). We need 0..(2**32 - 1), but the
references I looked at are pretty old (@2002) so this may be possible now.
Formulating a regex via sprintf of binary may be a little bit more involved but I am
encouraged by the below code. The data side of the regex is not a problem. And if
the data conversion to strings is not a problem then its not going to be a problem
constructing pattern strings.

Sample is below. Note that some of the characters in the utf8 string output translates to
'?' characters in my Agent reader.
TIA.

-sln

==========================================================================================
## fr7.pl
##
## References: (UTF-32 - http://www.unicode.org/reports/tr19/tr19-9.html (@2002)
## (todo add more.. the links are endless)
##

use warnings;
use strict;

use Encode;
#use Encode::Unicode; # strange, 'decode()' is not found, thought it might be a base class, but not so.

binmode STDOUT, ':utf8';

# This is a shortened sample to test we get back what we put in. At the least we want to get an
# uninterrupted output from 0..10FFFF (hex). We need 0..(2**32 - 1), don't know how but this is a start.
# my @ar = (ord('a'),20000,20001,20002,20003,20004,20005,23336,20007,20008,20009,30000);

# Supersize it, inject some non utf-8 (thats why we are using utf-32) between code points to test
# conversion. Hey, it works! We get back what we put in.
my @ar = (ord('a'),0 .. 300);

# ----------------------------

print "\n";
print "Numeric array:\n@ar\n";

my $str = "\x{ff}\x{fe}\x{0}\x{0}" . pack 'L*', @ar;
# my $str = "\x{0}\x{0}\x{fe}\x{ff}".$str; # Apparently, the endianess is different on my machine.
# This is going to be an issue since we need strings.
print "\n";
dumpstr_d ("Packed 'L*' string with UTF-32 BOM prepended - ", $str);

print "\n";
$str = decode("UTF-32", $str);
dumpstr_d ("Decoded UTF-32. String - ", $str);

print "\n";
print "utf8 string:\n", $str ,"\n";

# ----------------------------

sub dumpstr_h {
my ($comment,$s) = @_;
print "(Hex Dump) $comment";
print utf8::is_utf8($s) ? "char" : "byte";
print "[", length($s), "]";
print ":\n";
for (split //, $s) {
printf "%#02x ", ord($_);
}
print "\n";
}

sub dumpstr_d {
my ($comment,$s) = @_;
print "(Decimal Dump) $comment";
print utf8::is_utf8($s) ? "char" : "byte";
print "[", length($s), "]";
print ":\n";
for (split //, $s) {
printf "%d ", ord($_);
}
print "\n";
}

__END__

c:\temp>perl fr7.pl

Numeric array:
97 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 8
2 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
287 288 289 290 291 292 293 294 295 296 297 298 299 300

(Decimal Dump) Packed 'L*' string with UTF-32 BOM prepended - byte[1212]:
255 254 0 0 97 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0
0 0 8 0 0 0 9 0 0 0 10 0 0 0 11 0 0 0 12 0 0 0 13 0 0 0 14 0 0 0 15 0 0 0 16 0
0 0 17 0 0 0 18 0 0 0 19 0 0 0 20 0 0 0 21 0 0 0 22 0 0 0 23 0 0 0 24 0 0 0 25 0
0 0 26 0 0 0 27 0 0 0 28 0 0 0 29 0 0 0 30 0 0 0 31 0 0 0 32 0 0 0 33 0 0 0 34
0 0 0 35 0 0 0 36 0 0 0 37 0 0 0 38 0 0 0 39 0 0 0 40 0 0 0 41 0 0 0 42 0 0 0 43
0 0 0 44 0 0 0 45 0 0 0 46 0 0 0 47 0 0 0 48 0 0 0 49 0 0 0 50 0 0 0 51 0 0 0 5
2 0 0 0 53 0 0 0 54 0 0 0 55 0 0 0 56 0 0 0 57 0 0 0 58 0 0 0 59 0 0 0 60 0 0 0
61 0 0 0 62 0 0 0 63 0 0 0 64 0 0 0 65 0 0 0 66 0 0 0 67 0 0 0 68 0 0 0 69 0 0 0
70 0 0 0 71 0 0 0 72 0 0 0 73 0 0 0 74 0 0 0 75 0 0 0 76 0 0 0 77 0 0 0 78 0 0
0 79 0 0 0 80 0 0 0 81 0 0 0 82 0 0 0 83 0 0 0 84 0 0 0 85 0 0 0 86 0 0 0 87 0 0
0 88 0 0 0 89 0 0 0 90 0 0 0 91 0 0 0 92 0 0 0 93 0 0 0 94 0 0 0 95 0 0 0 96 0
0 0 97 0 0 0 98 0 0 0 99 0 0 0 100 0 0 0 101 0 0 0 102 0 0 0 103 0 0 0 104 0 0 0
105 0 0 0 106 0 0 0 107 0 0 0 108 0 0 0 109 0 0 0 110 0 0 0 111 0 0 0 112 0 0 0
113 0 0 0 114 0 0 0 115 0 0 0 116 0 0 0 117 0 0 0 118 0 0 0 119 0 0 0 120 0 0 0
121 0 0 0 122 0 0 0 123 0 0 0 124 0 0 0 125 0 0 0 126 0 0 0 127 0 0 0 128 0 0 0
129 0 0 0 130 0 0 0 131 0 0 0 132 0 0 0 133 0 0 0 134 0 0 0 135 0 0 0 136 0 0 0
137 0 0 0 138 0 0 0 139 0 0 0 140 0 0 0 141 0 0 0 142 0 0 0 143 0 0 0 144 0 0 0
145 0 0 0 146 0 0 0 147 0 0 0 148 0 0 0 149 0 0 0 150 0 0 0 151 0 0 0 152 0 0 0
153 0 0 0 154 0 0 0 155 0 0 0 156 0 0 0 157 0 0 0 158 0 0 0 159 0 0 0 160 0 0 0
161 0 0 0 162 0 0 0 163 0 0 0 164 0 0 0 165 0 0 0 166 0 0 0 167 0 0 0 168 0 0 0
169 0 0 0 170 0 0 0 171 0 0 0 172 0 0 0 173 0 0 0 174 0 0 0 175 0 0 0 176 0 0 0
177 0 0 0 178 0 0 0 179 0 0 0 180 0 0 0 181 0 0 0 182 0 0 0 183 0 0 0 184 0 0 0
185 0 0 0 186 0 0 0 187 0 0 0 188 0 0 0 189 0 0 0 190 0 0 0 191 0 0 0 192 0 0 0
193 0 0 0 194 0 0 0 195 0 0 0 196 0 0 0 197 0 0 0 198 0 0 0 199 0 0 0 200 0 0 0
201 0 0 0 202 0 0 0 203 0 0 0 204 0 0 0 205 0 0 0 206 0 0 0 207 0 0 0 208 0 0 0
209 0 0 0 210 0 0 0 211 0 0 0 212 0 0 0 213 0 0 0 214 0 0 0 215 0 0 0 216 0 0 0
217 0 0 0 218 0 0 0 219 0 0 0 220 0 0 0 221 0 0 0 222 0 0 0 223 0 0 0 224 0 0 0
225 0 0 0 226 0 0 0 227 0 0 0 228 0 0 0 229 0 0 0 230 0 0 0 231 0 0 0 232 0 0 0
233 0 0 0 234 0 0 0 235 0 0 0 236 0 0 0 237 0 0 0 238 0 0 0 239 0 0 0 240 0 0 0
241 0 0 0 242 0 0 0 243 0 0 0 244 0 0 0 245 0 0 0 246 0 0 0 247 0 0 0 248 0 0 0
249 0 0 0 250 0 0 0 251 0 0 0 252 0 0 0 253 0 0 0 254 0 0 0 255 0 0 0 0 1 0 0 1
1 0 0 2 1 0 0 3 1 0 0 4 1 0 0 5 1 0 0 6 1 0 0 7 1 0 0 8 1 0 0 9 1 0 0 10 1 0 0
11 1 0 0 12 1 0 0 13 1 0 0 14 1 0 0 15 1 0 0 16 1 0 0 17 1 0 0 18 1 0 0 19 1 0 0
20 1 0 0 21 1 0 0 22 1 0 0 23 1 0 0 24 1 0 0 25 1 0 0 26 1 0 0 27 1 0 0 28 1 0
0 29 1 0 0 30 1 0 0 31 1 0 0 32 1 0 0 33 1 0 0 34 1 0 0 35 1 0 0 36 1 0 0 37 1 0
0 38 1 0 0 39 1 0 0 40 1 0 0 41 1 0 0 42 1 0 0 43 1 0 0 44 1 0 0

(Decimal Dump) Decoded UTF-32. String - char[302]:
97 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 8
2 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
287 288 289 290 291 292 293 294 295 296 297 298 299 300

utf8 string:
a ????
?¤????¶§?????????? !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]
^_`abcdefghijklmnopqrstuvwxyz{|}~¦-Ç-ü-é-â-ä-à-å-ç-ê-ë-è-ï-î-ì-Ä-Å-É-æ-Æ-ô-ö-ò-û
-ù-ÿ-Ö-Ü-¢-£-¥-P-ƒ-á-í-ó-ú-ñ-Ñ-ª-º-¿-¬-¬-½-¼-¡-«-»-¦-¦-¦-¦-¦-¦-¦-+-+-¦-¦-+-+-+-+
-++Ç+ü+é+â+ä+à+å+ç+ê+ë+è+ï+î+ì+Ä+Å+É+æ+Æ+ô+ö+ò+û+ù+ÿ+Ö+Ü+¢+£+¥+P+ƒ+á+í+ó+ú+ñ+Ñ+ª
+º+¿+¬+¬+½+¼+¡+«+»+¦+¦+¦+¦+¦+¦+¦+++++¦+¦++++++++++-Ç-ü-é-â-ä-à-å-ç-ê-ë-è-ï-î-ì-Ä
-Å-É-æ-Æ-ô-ö-ò-û-ù-ÿ-Ö-Ü-¢-£-¥-P-ƒ-á-í-ó-ú-ñ-Ñ-ª-º-¿-¬-¬-½-¼
 
P

Peter J. Holzer

Regexes in 'split' can be done on a stream; for example:

open (SMTPD, "$out_name") or return undef;
undef $/;
my ($header, $body) = split (/\n{2,}/, <SMTPD>, 2);

That reads the complete contents from the stream into a
(temporary) scalar and then passes the scalar to split. The split
function still works on this string, not a stream.

awk does apply a regexp to a stream in one specific instance: You can
specify the record separator as a regexp and getline will read from the
stream until it finds a match (or EOF). perldoc perlvar mentions this:

Remember: the value of $/ is a string, not a regex. awk has to
be better for something. :)

A more general example would be the code generated by lex: You describe
your tokens with regexps and the generated code reads one token after
the other from the stream by applying those regexps to the stream.

hp
 
E

Eric Pozharski

I've thought a lot. I should admit, whenever I see C<use 'utf8';>
instead of C<use encoding 'utf8';> I'm going nuts. Look at what we've
got here

*SKIP*
Let's compare 4 programs, which are all essentially the same:
*SKIP*
The differences are in the encoding of the source file (UTF-8 vs.
ISO-8859-7) and the line marked "use XXX ###" above.

1) encoded in UTF-8, contains "use utf8;"
prints:

char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
*SKIP*
3) encoded in ISO-8859-7, contains "use encoding 'ISO-8859-1';"
prints:

char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
*SKIP*

And with C<use encoding 'utf8';> you'll get the same character string,
and lots of other useful stuff. (I just can't get why anyone would need
implicit upgrade of scalars into characters and yet then maintain wide
IO.) But my point isn't that F<encoding.pm> outperforms F<utf8.pm>.
I'm scared. I consider F<utf8.pm> kind of Pandora box. Read this, if
you can

проц запроÑить {
мое ($имÑфайла) = @_;
еÑли (ÑущеÑтвует $ЗÐГÐЛ{$имÑфайла}) {
вернуть 1 еÑли $ЗÐГÐЛ{$имÑфайла};
прекратить "Сбой компилÑции в запроÑить";
}
мое ($наÑтоÑщийфайл,$результат);
ИТЕР: {
длÑкаждого $Ð¿Ñ€ÐµÑ„Ð¸ÐºÑ (@ЗÐГÐЛ) {
$наÑтоÑщийфайл = "$префикÑ/$имÑфайла";
еÑли (-Ñ„ $наÑтоÑщийфайл) {
$ЗÐГÐЛ{$имÑфайла} = $наÑтоÑщийфайл;
$результат = делать $наÑтоÑщийфайл;
поÑледний ИТЕР;
}
}
прекратить "$имÑфайла не найдено в \@ЗÐГÐЛ";
}
еÑли ($@) {
$ЗÐГÐЛ{$имÑфайла} = неопред;
прекратить $@;
} другое (!$результат) {
удалить $ЗÐГÐЛ{$имÑфайла};
прекратить "не ИСТИÐРвозвращена из $имÑфайла";
} иначе {
вернуть $результат;
}
}

I admit, it's imposible to write this with F<utf8.pm> alone
(F<overload.pm> comes to mind, however I can't comment on this I haven't
used it). I looked for simple yet rich code, and then found this piece
more showing. I bet you've seen this before, you use it constantly.
Yet can you name it?

Someone could say "Who the heck would need that stupidity?" Idiots. It
still surprises me how many idiots are around. They would scream:
"Look! What a cool stuff! I have to learn nothing!"

You can say: "Eric, what a strange stuff you smoke? That's
impossible." I think you're wrong. I've come to conclusion
(overoptimistic?) that idiots around you are the same that around me.
So they would scream. (BTW, I don't smoke, I pipe "Prima optima
light".)

[ Lots of irrelevant stuff below, can easily be skipped ]

*SKIP*
This paragraph is confusing. I have a vague idea what the author wanted
to say but even then it's not quite correct. I doubt somebody can
understand this paragraph unless they already exactly understood the
problems before.



This has nothing to do with "use utf8" and "use encoding". The
"implicit upgrading" which is mentioned here happens (for example) when
you concatenate a byte string to a character string. But then the result
*is* a character string, not a byte string.

Byte strings are *not* implicitely assumed to be ISO-8859-1, as you can
easily check by matching against a character class:

% perl -le '$_ = "\x{FC}"; print /\w/ ? "yes" : "no"'
no
% perl -le '$_ = "\x{FC}"; utf8::upgrade($_); print /\w/ ? "yes" : "no"'
yes

For those unaware what happened

perl -MDevel::peek -wle '
$_ = "\x{FC}";
Dump $_;
utf8::upgrade($_);
Dump $_;
print' | recode latin1..utf8
SV = PV(0x92556d0) at 0x9280470
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x926ca60 "\374"\0
CUR = 1
LEN = 4
SV = PV(0x92556d0) at 0x9280470
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9269068 "\303\274"\0 [UTF8 "\x{fc}"]
CUR = 2
LEN = 3
ü

*SKIP*
It does for me. If I change "use encoding 'ISO-8859-7'" to "use utf8"
in my ISO-8859-7 encoded file, I get a lot of warnings.

Yes, it does. Since I've typed examples on command-line I'd gone with
those hex-escapes. They don't warn. If B<perl> finds *bytes* with high
bit set (so they aren't valid utf8) while being in any way utf8 encoding
mode then it really complains (and complains a lot).

*SKIP*
I believe I already said that once or twice in this thread.



Bullshit. A script is the complete program text, including any string
constants, numeric constants, comments, the __DATA__ stream, if any.
Why would a string constant in a script not be part of it?

Yes, I should agree that "script" in general means this. That's my
understanding of what was said (or meant) here by this word.

*SKIP*
You mean:

{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

That doesn't fix the endianness, and it behaves completely differently.
"perl -Mencoding=ucs2" can't work, as I already explained to sln.

This fixes endianness?

{56061:37} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{0444}\x{044b}\x{0432}\x{0430}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

However, since I don't understand why it "can't work", I won't complain
why it can't "locate object method".

[ Lots of irrelevant stuff above, can easily be skipped ]

However, in spite of confessing being scared of F<utf8.pm> features, I
promise to rant anytime I'll find C<use utf8;> instead of
C<use encoding 'utf8';>
 
P

Peter J. Holzer

I've thought a lot. I should admit, whenever I see C<use 'utf8';>
instead of C<use encoding 'utf8';> I'm going nuts.

I'm going nuts when I see "use encoding". It does way too many magic
things and most of them only half-baked. Here be dragons! Don't use that
stuff, if you value your sanity.

And with C<use encoding 'utf8';> you'll get the same character string,
and lots of other useful stuff.

Correction: Lots of stuff which looks useful at first glance but which
works in subly different ways than you expect (and some stuff which you
simply don't expect). "use utf8" OTOH does only one thing, and it does
it well.
(I just can't get why anyone would need
implicit upgrade of scalars into characters and yet then maintain wide
IO.) But my point isn't that F<encoding.pm> outperforms F<utf8.pm>.
I'm scared. I consider F<utf8.pm> kind of Pandora box. Read this, if
you can

проц запроÑить {
мое ($имÑфайла) = @_; [...]
}

I admit, it's imposible to write this with F<utf8.pm> alone

Right. "sub" still is "sub", not "проц", and "my" is still "my", not
"мое". Your example is more like a Russian(?) equivalent to
Lingua::Romana::perligata.

And frankly, "проц запроÑить" is only marginable less readable to me
than "sub zaprosit". I need a dictionary for both, and "запроÑить" at
least has the advantage that I can actually find it in a Russian
dictionary :). If you want your software to be maintainable by authors
from other countries, stick to English and write "sub request". If you
want to use Russian names you might as well go all the way and use
cyrillic letters instead of a transliteration into latin letters which
neither those who speak Russian nor those who don't speak Russian
understand easily.

I bet you've seen this before,

I've seen German versions of BASIC in the 1980's. They weren't a huge
success, to put it mildly. About the only successful localization of a
programming language i can think of is MS-Excel (and I guess it's
debatable if this is even a programming language (without VBA) - is it
turing-complete?).

Someone could say "Who the heck would need that stupidity?" Idiots. It
still surprises me how many idiots are around. They would scream:
"Look! What a cool stuff! I have to learn nothing!"

Idiots indeed, if they think learning a few dozen keywords is the
hardest part in learning a programming language. In fact I think the
main reason why localized programming languages are so unpopular is that
people figure out that it doesn't make any difference whether you
declare a lexical variable with "my", "мое", or "mein": You have to
learn the keyword anyway and you have to learn what it means and how to
use it.

Yes, it does. Since I've typed examples on command-line I'd gone with
those hex-escapes. They don't warn.

Of course not. They are just US-ASCII, which is a subset of UTF-8 and
ISO-8859-x.
If B<perl> finds *bytes* with high bit set (so they aren't valid utf8)
while being in any way utf8 encoding mode then it really complains
(and complains a lot).
Yup.



Yes, I should agree that "script" in general means this. That's my
understanding of what was said (or meant) here by this word.

I don't see how you get that understanding. Clearly string constants in
a script *are* interpreted as utf8 if you use "use utf8".

You mean:

{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

That doesn't fix the endianness, and it behaves completely differently.
"perl -Mencoding=ucs2" can't work, as I already explained to sln.

This fixes endianness?

{56061:37} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{0444}\x{044b}\x{0432}\x{0430}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

Yes, that fixes the endianness.

However, since I don't understand why it "can't work",

It can't work because -Mencoding=ucs2 says that the source code is
encoded in ucs2. So your script would have to begin with the byte
sequence

FE FF 00 70 00 72 00 69 00 6e 00 74 00 20 ...

to be interpreted as "print ...". But it doesn't (and it can't because
you cannot pass arguments with embedded null bytes in Unix).
It begins with

70 72 69 6e 74 20

which doesn't seem to be anything useful in UCS2 (U+2074 is SUPERSCRIPT
FOUR, but the rest is unassigned in both big and little endian).

It becomes worse if you use "use encoding 'ucs2';" inside a script:
You would have to start the script in US-ASCII so that "use encoding
'ucs2';" is recognized and then switch to UCS2: So you need to mix two
different incompatible encodings in the same script: Good luck finding
an editor which supports this. And you don't have to, anyway, because if
you want to encode your scripts in UTF-16, you can just do it and perl
will notice it automatically (but Unix won't recognize the hashbang any
more, so you don't want to do this on Unix - you might on windows,
though).

hp
 
S

sln

[snip]
My goal is to convert 32-bit binary data to Unicode character's that can
be used as both the data and the pattern in regular expression matching.

I believe I am halfway there, the binary to Unicode (Perl internal utf8) conversion,
which at least will do (I think) from 0..10FFFF (hex). We need 0..(2**32 - 1), but the
references I looked at are pretty old (@2002) so this may be possible now.

This should do it. Values from 0..(2**32 - 1) are indeed possible since UTF-32
translates into utf-8 code points (1-6 bytes). I let it run for a while from different
points. Works fine.

Regular expressions work fine as well. I now see this as a good tool that can be
used in various categories, statistics, number series and others.

Enjoy!

-sln

-----------------------------

## fr10.pl
## --
## Tests cases 1,2.
## 32-bit binary pattern searching using Regexp.
##

use warnings;
use strict;

use Encode;

binmode STDOUT, ':utf8';

#use constant BOM32 => "\x{ff}\x{fe}\x{0}\x{0}"; # L.Endian
use constant BOM32 => 65279;

# :: TEST 1 - just print each array utf32 value with its decoded utf8 ord.

my @adata1 = (ord('a'), 0 .. 300, BOM32);
my $str;

for (@adata1)
{
$str = decode("UTF-32", pack 'L*', (BOM32, $_));
# print "$_ ",ord( $str ),"\n";
}

# :: TEST 2 - Generate multiple regexps with utf32 decoded chars from the array.
# - Apply them to a decoded string made up of the array.
# - This is just to test how the range of all numbers might behave as decoded chars in regexps.
# - In reality a single regexp would probably be applied across the whole string globally,
# advancing pos() to $-[0] +1 after each match.

my @adata2 = (120000,21,22,23,24,25,26,27,28,29,ord('a'),31, 55..300);
$str = decode("UTF-32", pack 'L*', (BOM32, @adata2));

print "\nstr = ",$str,"\nlength = ",length($str),"\n";

foreach my $val (@adata2)
{
# alternative pattern, identical result
# my @tt = map {quotemeta} split //, decode("UTF-32", pack 'L*', (BOM32, $val, $val+1, $val+2, $val+7, $val+10));
# my $pattern = sprintf "(%s%s%s).{0,5}([%s-%s])", @tt;

my @tt = map {quotemeta} split //, decode("UTF-32", pack 'L*', (BOM32, $val, $val+2, $val+7 , $val+10));
my $pattern = sprintf "([%s-%s]{3}).{0,5}([%s-%s])", @tt;

print "\nval = $val, params = @tt, pattern = $pattern\n";

if ( $str =~ /($pattern)/)
{
print "matched:\n";
print " 1 = '@{[getordsplit ($1)]}', length = ".length($1). "\n";
print " 2 = '@{[getordsplit ($2)]}', length = ".length($2). "\n";
print " 3 = '@{[getordsplit ($3)]}', length = ".length($3). "\n";

}
}

# split string, return array
sub getordsplit { return map {ord $_} split //, shift if @_; ()}
sub getcharsplit { return split //, shift if @_; ()}

__END__
 
E

Eric Pozharski

I'm going nuts when I see "use encoding". It does way too many magic
things and most of them only half-baked. Here be dragons! Don't use that
stuff, if you value your sanity.

I'm puzzled. However, I should admit, that I've yet found those dark
corners of C said:
And with C<use encoding 'utf8';> you'll get the same character string,
and lots of other useful stuff.

Correction: Lots of stuff which looks useful at first glance but which
works in subly different ways than you expect (and some stuff which you
simply don't expect). "use utf8" OTOH does only one thing, and it does
it well.
(I just can't get why anyone would need
implicit upgrade of scalars into characters and yet then maintain wide
IO.) But my point isn't that F<encoding.pm> outperforms F<utf8.pm>.
I'm scared. I consider F<utf8.pm> kind of Pandora box. Read this, if
you can

проц запроÑить {
мое ($имÑфайла) = @_; [...]
}

I admit, it's imposible to write this with F<utf8.pm> alone

Right. "sub" still is "sub", not "проц", and "my" is still "my", not
"мое". Your example is more like a Russian(?) equivalent to
Lingua::Romana::perligata.

And frankly, "проц запроÑить" is only marginable less readable to me
than "sub zaprosit". I need a dictionary for both, and "запроÑить" at
least has the advantage that I can actually find it in a Russian
dictionary :). If you want your software to be maintainable by authors
from other countries, stick to English and write "sub request". If you
want to use Russian names you might as well go all the way and use
cyrillic letters instead of a transliteration into latin letters which
neither those who speak Russian nor those who don't speak Russian
understand easily.

So you suggest that localizing Perl (or actually any other language) is
kind of online dictionary providers conspiracy? I didn't think it this
way, should consider.
I've seen German versions of BASIC in the 1980's. They weren't a huge
success, to put it mildly. About the only successful localization of a
programming language i can think of is MS-Excel (and I guess it's
debatable if this is even a programming language (without VBA) - is it
turing-complete?).

That's in case you have an option. There're places you have no option.
Idiots indeed, if they think learning a few dozen keywords is the
hardest part in learning a programming language. In fact I think the
main reason why localized programming languages are so unpopular is that
people figure out that it doesn't make any difference whether you
declare a lexical variable with "my", "мое", or "mein": You have to
learn the keyword anyway and you have to learn what it means and how to
use it.

Then get any dictionary handy, till they are cheap.


*SKIP*
It can't work because -Mencoding=ucs2 says that the source code is
encoded in ucs2. So your script would have to begin with the byte
sequence

FE FF 00 70 00 72 00 69 00 6e 00 74 00 20 ...

to be interpreted as "print ...". But it doesn't (and it can't because
you cannot pass arguments with embedded null bytes in Unix).
It begins with

70 72 69 6e 74 20

which doesn't seem to be anything useful in UCS2 (U+2074 is SUPERSCRIPT
FOUR, but the rest is unassigned in both big and little endian).

It becomes worse if you use "use encoding 'ucs2';" inside a script:
You would have to start the script in US-ASCII so that "use encoding
'ucs2';" is recognized and then switch to UCS2: So you need to mix two
different incompatible encodings in the same script: Good luck finding
an editor which supports this. And you don't have to, anyway, because if
you want to encode your scripts in UTF-16, you can just do it and perl
will notice it automatically (but Unix won't recognize the hashbang any

Ouch. Shame on me.
more, so you don't want to do this on Unix - you might on windows,
though).

I don't windows.
 
P

Peter J. Holzer

I'm puzzled. However, I should admit, that I've yet found those dark
corners of C<use encoding 'utf8';>.

I found quite a few of them, but that was in the early times of 5.8.x,
so some of them may have been bugs which have since been fixed and some
of them may have been caused by my lack of understanding how the (at the
time) new "utf8" strings were supposed to work. However, I think that
Encoding.pm (the module itself and its documentation) added
significantly to my confusion instead of lessening it. It took me some
time to figure things out.

But a couple of points which immediatly come to mind while scrolling
through perldoc encoding:

* encoding automatically pushes an :encoding() layer on STDIN and
STDOUT, but not on STDERR. Why not?

* And why does it do that at all? The user of the script may use a
completely different locale than the author, so mixing source code
encoding with runtime I/O conversion just isn't a good idea.
(and "use encoding ':locale'" is just a desaster waiting to happen,
but at least the pod warns you)

* All strings are converted, even if they contain \x escapes. So
"\xF1\xD1\xF1\xCC" is a string with *two* characters if "encoding
'euc-jp'" is in effect. This makes it quite hard to write scripts
which work in binary data.

* utf8::upgrade and utf8::downgrade aren't symmetric.

I fail to see how utfizing both literals and symbols makes F<utf8.pm>
doing one thing. I don't say that it doesn't do it well.

It doesn't utf8ize "both literals and symbols", it simply reads the
complete source text in character mode (with :encoding(utf8)). A very
simple operation, and everything else is a logical consequence from
that. So "\x{D0}\x{BF}" is still a string of 2 bytes (because it
reads "double-quote backslash open-brace capital D 0
closed-brace backslash ..." and that parses to two bytes by the normal
perl parsing rules, but "п" parses as one character because that reads
"double-quote cyrillic-small-letter-pe double-quote". And a variable $п
is allowed because п is a letter and Perl allows letters, digits and the
underscore in variable names. No surprises there.

(I just can't get why anyone would need
implicit upgrade of scalars into characters and yet then maintain wide
IO.) But my point isn't that F<encoding.pm> outperforms F<utf8.pm>.
I'm scared. I consider F<utf8.pm> kind of Pandora box. Read this, if
you can

проц запроÑить {
мое ($имÑфайла) = @_; [...]
}

I admit, it's imposible to write this with F<utf8.pm> alone

Right. "sub" still is "sub", not "проц", and "my" is still "my", not
"мое". Your example is more like a Russian(?) equivalent to
Lingua::Romana::perligata.

And frankly, "проц запроÑить" is only marginable less readable to me
than "sub zaprosit". I need a dictionary for both, and "запроÑить" at
least has the advantage that I can actually find it in a Russian
dictionary :). If you want your software to be maintainable by authors
from other countries, stick to English and write "sub request". If you
want to use Russian names you might as well go all the way and use
cyrillic letters instead of a transliteration into latin letters which
neither those who speak Russian nor those who don't speak Russian
understand easily.

So you suggest that localizing Perl (or actually any other language) is
kind of online dictionary providers conspiracy?

No, not at all. What I am saying is that people will use localized
variable and subroutine names, and write comments in their native
language, no matter if the programming language makes it easy or not.
Sometimes this is because they don't speak English too well, sometimes
it's because the problem domain is language-specific (for example, when
I was a teaching assistant at the university, we often wrote programs to
help in some administrative task - students here are identified by a
"Matrikelnummer". According to the dictionary, that's "enrollment
number" in English, but it simply didn't make sense to use a variable
$enrollment_number, because *nobody* who had to maintain that program
would know what an "enrollment number" is, but everybody would know what
$matrikelnummer means (you simply can't study or work at an Austrian
university without knowing that)). Now, using German variable names is
simple: We use only 7 non-ASCII letters, and there are standardized and
well-known transliterations for all of them. So it isn't a great
inconvenience if we need to write $uebung instead of $übung. But in
Russian (or Greek, or Japanese, or Hindi, ...) the entire Alphabet is
non-ASCII, and there are usually many transliteration systems. So if a
Russian (or Greek, or Japanese, or Indian, ...) programmer wants to use
some word from their native tongue, they need to choose one
transliteration which is not natural to them and may be almost
unreadable for the next programmer who has to maintain it.

Since this is simply a fact of life, it is much better to allow
non-English languages to be written in their native alphabet than to
force transliteration into US-ASCII. At least then programs written in
Russian are readable to people speaking Russian instead of being
unreadable for everybody except the programmer.

That's in case you have an option. There're places you have no option.

I don't understand your objection. I was relating the historic fact that
localized programming languages (i.e., programming languages where the
keywords (and to a lesser amount, the grammar) were localized, so that
you would write "wenn ... dann ... sonst ..." instead of "if ... then
.... else") were a failure. People had the option to use them, but they
didn't want to.


hp
 
E

Eric Pozharski

I found quite a few of them, but that was in the early times of 5.8.x,
so some of them may have been bugs which have since been fixed and some
of them may have been caused by my lack of understanding how the (at the
time) new "utf8" strings were supposed to work. However, I think that
Encoding.pm (the module itself and its documentation) added
significantly to my confusion instead of lessening it. It took me some
time to figure things out.

But a couple of points which immediatly come to mind while scrolling
through perldoc encoding:

* encoding automatically pushes an :encoding() layer on STDIN and
STDOUT, but not on STDERR. Why not?

That puzzles me too.
* And why does it do that at all? The user of the script may use a
completely different locale than the author, so mixing source code
encoding with runtime I/O conversion just isn't a good idea.
(and "use encoding ':locale'" is just a desaster waiting to happen,
but at least the pod warns you)

Mostly "Why Not?" is bad idea.
* All strings are converted, even if they contain \x escapes. So
"\xF1\xD1\xF1\xCC" is a string with *two* characters if "encoding
'euc-jp'" is in effect. This makes it quite hard to write scripts
which work in binary data.

That's really a challenge

perl -Mencoding=utf8 -MDevel::peek -wle '
$x = "\x{AB}\x{DC}";
Dump $x'
SV = PV(0x96aa6d0) at 0x96c63e8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x96c1a88 "\357\277\275\357\277\275"\0 [UTF8 "\x{fffd}\x{fffd}"]
CUR = 6
LEN = 8

Or not

perl -Mencoding=utf8 -MDevel::peek -wle '
open my $y, "<", "../foo.bravo" or die $!;
$x = <$y>;
Dump $x' <~/foo.bravo
SV = PV(0x95396d0) at 0x95dc230
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x9550378 "\306\331\327\301\n"\0
CUR = 5
LEN = 80

(just for reference)

{62509:74} [0:0]$ xxd ~/foo.bravo
0000000: c6d9 d7c1 0a .....
* utf8::upgrade and utf8::downgrade aren't symmetric.

I've noted, but... F<encoding.pm> is wrong exactly how?

*SKIP*
It doesn't utf8ize "both literals and symbols", it simply reads the
complete source text in character mode (with :encoding(utf8)). A very
simple operation, and everything else is a logical consequence from
that. So "\x{D0}\x{BF}" is still a string of 2 bytes (because it
reads "double-quote backslash open-brace capital D 0
closed-brace backslash ..." and that parses to two bytes by the normal
perl parsing rules, but "п" parses as one character because that reads
"double-quote cyrillic-small-letter-pe double-quote". And a variable $п
is allowed because п is a letter and Perl allows letters, digits and the
underscore in variable names. No surprises there.

So F<utf8.pm> utfizes symbols by accident. At least that wasn't an
intention.

In that context I understand "symbol" this way (quoting F<perlstyle.pod>)

symbol
Generally, any "token" or "metasymbol". Often used more
specifically to mean the sort of name you might find in a
"symbol table".

*SKIP*
No, not at all. What I am saying is that people will use localized
variable and subroutine names, and write comments in their native
language, no matter if the programming language makes it easy or not.
Sometimes this is because they don't speak English too well, sometimes
it's because the problem domain is language-specific (for example, when

Then they must. I don't say "should". I'm unaware of any other
language that nicely fits in 7bit. Calling it "US-ASCII" is pure accident.
That seemingly contradicts my point of having an option. Yes, but there
must be something common for all. By an accident -- it's English.
I was a teaching assistant at the university, we often wrote programs to
help in some administrative task - students here are identified by a
"Matrikelnummer". According to the dictionary, that's "enrollment
number" in English, but it simply didn't make sense to use a variable
$enrollment_number, because *nobody* who had to maintain that program
would know what an "enrollment number" is, but everybody would know what
$matrikelnummer means (you simply can't study or work at an Austrian
university without knowing that)). Now, using German variable names is
simple: We use only 7 non-ASCII letters, and there are standardized and
well-known transliterations for all of them. So it isn't a great
inconvenience if we need to write $uebung instead of $übung. But in
Russian (or Greek, or Japanese, or Hindi, ...) the entire Alphabet is
non-ASCII, and there are usually many transliteration systems. So if a
Russian (or Greek, or Japanese, or Indian, ...) programmer wants to use
some word from their native tongue, they need to choose one
transliteration which is not natural to them and may be almost
unreadable for the next programmer who has to maintain it.

So you have no real experience with alphabetical mix. My point isn't
language mix; I have no problem with this. A symbol is no more than a
letter sequence (+digits, +underscore). If someone is going to fight
some code (being it written by anyone), then this one must understand
the problem. Reading variable names is the least problem then.
Since this is simply a fact of life, it is much better to allow
non-English languages to be written in their native alphabet than to
force transliteration into US-ASCII. At least then programs written in
Russian are readable to people speaking Russian instead of being
unreadable for everybody except the programmer.

And they aren't anyway.
I don't understand your objection. I was relating the historic fact that
localized programming languages (i.e., programming languages where the
keywords (and to a lesser amount, the grammar) were localized, so that
you would write "wenn ... dann ... sonst ..." instead of "if ... then
... else") were a failure. People had the option to use them, but they
didn't want to.

Then read it again (maybe my English failed this time, again). Those
provided with germanized (is it right?) had an option. The option to
reject it. Sometimes there's no option. You just don't know what does
it mean having no option.
 
P

Peter J. Holzer

I found quite a few of them, but that was in the early times of 5.8.x,
so some of them may have been bugs which have since been fixed and some
of them may have been caused by my lack of understanding how the (at the
time) new "utf8" strings were supposed to work. However, I think that
Encoding.pm (the module itself and its documentation) added
significantly to my confusion instead of lessening it. It took me some
time to figure things out.

But a couple of points which immediatly come to mind while scrolling
through perldoc encoding:
[...]
* utf8::upgrade and utf8::downgrade aren't symmetric.

I've noted, but... F<encoding.pm> is wrong exactly how?

It isn't "wrong". It is documented. But it is surprising and illogical
behaviour. A source of subtle bugs because the programmer most likely
won't think of that. And I think encoding.pm is full of such crannies. I
learned to avoid it pretty quickly.

*SKIP*

So F<utf8.pm> utfizes symbols by accident. At least that wasn't an
intention.

I'm not sure what you mean by "utfize", but if you mean: "Symbols can
contain all Unicode letters and digits, not just A-Z, a-z, 0-9", then
that's quite intentional, not an accident. But it is a logical
consequence of interpreting the source code as a sequence of Unicode
characters instead of a sequence of ASCII characters. So, as a
programmer you don't have to remember that "use utf8" decodes string
constants from UTF-8 *and* that it allows all Unicode letters and digits
in symbols *and* that the DATA stream has an ':encoding(utf8)' layer
*and* whatever else may be affected. You have to remember one thing
only: Your source code consists of Unicode characters encoded in UTF-8
(or UTF-EBCDIC). Period. Nothing else. Clean and simple.
*SKIP*

Then they must. I don't say "should". I'm unaware of any other
language that nicely fits in 7bit. Calling it "US-ASCII" is pure accident.

You are contradicting yourself. First you say that English is the only
language that fits nicely into US-ASCII, then you say that US-ASCII is
called US-ASCII by accident. It isn't. US-ASCII was developed by an
American institute to write English texts. It is no accident at all that
it only contains characters which are frequently used in English
(technical) texts. And it is no accident that it is called ASCII -
"American Standard Code for Information Interchange". The US- in front
is somewhat redundant, but there were a lot variants of ASCII (e.g., the
ISO-646 encodings), so that serves as a reminder that this is indeed the
orginal American version of the American code.
That seemingly contradicts my point of having an option. Yes, but there
must be something common for all. By an accident -- it's English.

Yes. English. Not ASCII. If you write Russian in ASCII I understand it
just as little than if you write in in Cyrillic.

If you can write your programs in English, please do. Especially if you
plan to make it open source. Almost every programmer on the world has at
least a basic grasp of English. But if for some reason you have to write
in Russian, then I think you should use the Cyrillic alphabet, not the
Latin alphabet. That will make it easier for those who understand
Russian and even for those who don't (because then at least they can
paste the stuff into an online dictionary and get a translation).

So you have no real experience with alphabetical mix.

No, for several reasons:

* I only speak German and English. Both use the Latin alphabet.
(German uses a handful of extra letters)
* Most open source software is in English, and our own software is in
English, too.
* Until a few years ago, most programming languages were US-ASCII only
(I think Java was the first popular programming language which
defined a larger source character set, except for specialized
languages like APL - that was in 1996).
My point isn't language mix; I have no problem with this.

I have. A program where all the identifiers, comments etc. are written
in Portugese or Polish is hard to figure out if you don't speak the
language. That they use the latin alphabet doesn't help much (except
that I have an inuitive (though very probably wrong) idea how to
pronounce them).
A symbol is no more than a
letter sequence (+digits, +underscore). If someone is going to fight
some code (being it written by anyone), then this one must understand
the problem. Reading variable names is the least problem then.

Reading variable names (and subroutine names, and comments) is the
*first* step to figuring out what a program does. The most effective
step in an obfuscation program is to replace all identifiers with
meaningless identifiers like 'a0001', 'a0002', etc. and to strip
comments. Everything else is easily reverted.

Then read it again (maybe my English failed this time, again). Those
provided with germanized (is it right?) had an option. The option to
reject it. Sometimes there's no option. You just don't know what does
it mean having no option.

I didn't mean that every single programmer had this option. If you work
in a shop which writes software in FORTRAN-77 (I was talking about the
1980's, remember) you don't have the option to choose your programming
language. C or COBOL is just as unavailble as a germanized version of
FORTRAN.

But the industry as a whole had the option, and it rejected it (with the
single exception of spreadsheet formula languages). The industry still
has the option. There are new scripting languages all the time, and
every few years one of them becomes really popular. So introducing
new languages in general is still possible. But all the popular
programming languages are based on English. Obviously there is no need
to localize the few dozen keywords - even if you don't speak English at
all, learning what "if" and "sub" do is not a problem (and the latter
isn't a proper English word anyway, so the English speaker has to learn
it as well).

hp
 
E

Eric Pozharski

It isn't "wrong". It is documented. But it is surprising and illogical
behaviour. A source of subtle bugs because the programmer most likely
won't think of that. And I think encoding.pm is full of such crannies. I
learned to avoid it pretty quickly.

OK, let's leave it as a point to unnamed F<encoding.pm>'s dragons.

*SKIP*
I'm not sure what you mean by "utfize", but if you mean: "Symbols can
contain all Unicode letters and digits, not just A-Z, a-z, 0-9", then
that's quite intentional, not an accident. But it is a logical
consequence of interpreting the source code as a sequence of Unicode
characters instead of a sequence of ASCII characters. So, as a
programmer you don't have to remember that "use utf8" decodes string
constants from UTF-8 *and* that it allows all Unicode letters and digits
in symbols *and* that the DATA stream has an ':encoding(utf8)' layer

What seems to be undocumented BTW. However, after your explanation, I
think that it can't be any other way.
*and* whatever else may be affected. You have to remember one thing
only: Your source code consists of Unicode characters encoded in UTF-8
(or UTF-EBCDIC). Period. Nothing else. Clean and simple.

I wasn't about what to remember. I'm about "doing one thing". I think,
that neither F said:
You are contradicting yourself. First you say that English is the only
language that fits nicely into US-ASCII, then you say that US-ASCII is
called US-ASCII by accident. It isn't. US-ASCII was developed by an
American institute to write English texts. It is no accident at all that
it only contains characters which are frequently used in English
(technical) texts. And it is no accident that it is called ASCII -
"American Standard Code for Information Interchange". The US- in front
is somewhat redundant, but there were a lot variants of ASCII (e.g., the
ISO-646 encodings), so that serves as a reminder that this is indeed the
orginal American version of the American code.

(maybe I wasn't enough verbose this time) English fits in 7bit
encoding, whatever encoding it would have been. It could be any other
encoding (I did some reading about ASCII history (yes, I know wikipedia
is a vague source)). It could not be any other language.
Yes. English. Not ASCII. If you write Russian in ASCII I understand it
just as little than if you write in in Cyrillic.

If you can write your programs in English, please do. Especially if you

That "if" (the latter one) is somewhat offending.
plan to make it open source. Almost every programmer on the world has at

That "open" is somewhat offending.
least a basic grasp of English. But if for some reason you have to write

"Quotation needed (tm)". Or define "programmer".
in Russian, then I think you should use the Cyrillic alphabet, not the
Latin alphabet. That will make it easier for those who understand
Russian and even for those who don't (because then at least they can
paste the stuff into an online dictionary and get a translation).

I beg to differ. I have no problem to understand what code does
(comparing to what it was supposed to do as described in comments and
symbol names) till any reasonable block of code fits on screen. When it
doesn't -- I become a way slow.

*SKIP*
I have. A program where all the identifiers, comments etc. are written
in Portugese or Polish is hard to figure out if you don't speak the
language. That they use the latin alphabet doesn't help much (except
that I have an inuitive (though very probably wrong) idea how to
pronounce them).

And here we have another difference between us. I look inside others
code mostly when I have problems with it, and sometimes when
documentation is incomplete, or seemingly wrong, or there's no
documentation at all. I don't look inside out of pure curiousity. And
you know what? I bet you know. There's no comments.

So if someday I step over comments written in Chinese, or Turkish, or
Portugal, or whatever else I'll just pretend there's no comments (as
ever). But I'm trying to imagine what I would do if the code would be
written in Korean. Maybe someday when F<utf8.pm> would make its way
into masses.

OK, read this (that depends on your context of course, it's possible
you would get it even without I<-Mstrict> or I<-Mwarnings>):

perl -Mutf8 -le '
print "vvv";
@OEM = qw/ 1 2 3 /;
print "@ОЕМ";
print "^^^";'
vvv

^^^


*SKIP*
I didn't mean that every single programmer had this option. If you work
in a shop which writes software in FORTRAN-77 (I was talking about the
1980's, remember) you don't have the option to choose your programming
language. C or COBOL is just as unavailble as a germanized version of
FORTRAN.

But the industry as a whole had the option, and it rejected it (with the
single exception of spreadsheet formula languages). The industry still

Watch what you're saying. Industry, community, society, whatever isn't
just a mix of protoplasm.
has the option. There are new scripting languages all the time, and
every few years one of them becomes really popular. So introducing
new languages in general is still possible. But all the popular
programming languages are based on English. Obviously there is no need
to localize the few dozen keywords - even if you don't speak English at
all, learning what "if" and "sub" do is not a problem (and the latter
isn't a proper English word anyway, so the English speaker has to learn
it as well).

(I'm still unclear) I know what does it mean -- having no option. At
all.

p.s. Are we still talking Perl?
 
S

sln

Peter has just pointed out that utf8.pm does exactly one thing: it
pushes a :utf8 (*not* :encoding(utf8): this is important, as it means
your source mustn't contain invalid sequences) layer on the DATA
filehandle at the point it is called. Everything else that happens is
simply a natural consequence of that.


Admittedly I don't speak it, but I'm fairly sure modern Greek would fit
into 7 bits (if one didn't need to also encode Roman characters).


How so? Those planning to release as open source need to be more
careful to make their code comprehensible to people they've never met.
Someone writing internal proprietary code for a company where $Language
is spoken can reasonably assume any maintainance programmer will speak
$Language; this doesn't apply to open-source code.


Name two languages whose primary documentation isn't in English. (Two
because I can name one: Ruby. Even so, I would wager most Ruby code is
written in English.)

Ben

Un-fucking real. Any fucking thing this complicated should be fuckin
removed from ANY fucking admittance from and provided support from the
language.

Perl should absolutely be ashamed of itself. After reading all this bullshit,
I'm very ready to piss off on Perl forever. Apparently, its a fuckedup language
that requries incomprehensible itteratations in its usage.

Makes me sick.
-sln
 
E

Eric Pozharski

Peter has just pointed out that utf8.pm does exactly one thing: it
pushes a :utf8 (*not* :encoding(utf8): this is important, as it means
your source mustn't contain invalid sequences) layer on the DATA
filehandle at the point it is called. Everything else that happens is
simply a natural consequence of that.

Then what I do: do I turn the key or do I open the door?

*SKIP*
How so? Those planning to release as open source need to be more
careful to make their code comprehensible to people they've never met.
Someone writing internal proprietary code for a company where $Language
is spoken can reasonably assume any maintainance programmer will speak
$Language; this doesn't apply to open-source code.

I just distinguish "open" and "free".
Name two languages whose primary documentation isn't in English. (Two
because I can name one: Ruby. Even so, I would wager most Ruby code is
written in English.)

РÐЯ and РÐПИРÐ. I had thought about ФОКÐЛ, but just discovered that
while it's documentation (reachable for me) was in pure Russian (while
the language itself was pure English) it happend to be copied(?)
(stolen(?) or bought(?)) from FOCAL (of 1969). OTOH, the implementation
was surely broken; then either that was reimplementation (so it
constitutes a stand alone language) or БК-0010 was actually PDP-8?
Puzzled.

And one more; both languages have their syntax russianized, but the
symbols aren't. Weird.
 
P

Peter J. Holzer

OK, let's leave it as a point to unnamed F<encoding.pm>'s dragons.

*SKIP*

What seems to be undocumented BTW. However, after your explanation, I
think that it can't be any other way.


I wasn't about what to remember. I'm about "doing one thing". I think,
that neither F<utf8.pm> nor F<encoding.pm> do one thing.

What it does and what I have to remember is the same thing at the
interface level. I don't have to know how it is implemented. "use utf8"
may be implemented by sending carrier pidgeons to the oracle of Delphi,
for all I care. But what "use utf8" *does* in an observable manner, is
exactly one thing: It turns my source code from a byte stream into a
character stream (encoded in UTF-8).

(maybe I wasn't enough verbose this time) English fits in 7bit
encoding, whatever encoding it would have been. It could be any other
encoding (I did some reading about ASCII history (yes, I know wikipedia
is a vague source)). It could not be any other language.

German fits very nicely into 7 bits, too (we have 7 more letters, but
who needs an @, a \ or three sets of brackets? Or 33 control
characters?). So does Russian or Greek (if you only care about the
Russian language, you need only the Cyrillic alphabet, not both the
Cyrillic and the Latin alphabet).
That "if" (the latter one) is somewhat offending.

Firstly, this was a generic "you", I wasn't speaking about you
personally. But even if I was, I don't see why that should be offensive.
You were asserting several times that sometimes there is no choice. So
why do you think it is offensive if I agree that sometimes there is no
choice? If your employer or client insists on the local language, you
can't use English. The only option you have then is to quit or reject
the contract.

That "open" is somewhat offending.

Again, I don't see why. Proprietary software is likely to be maintained
by programmers who speak the same language as the original programmer
(especially if the programmer was forced to use this language by company
policy). Free software OTOH will be maintained by people all over the
world.

"Quotation needed (tm)". Or define "programmer".

Anybody who writes programs of more than trivial complexity on a regular
basis.

And here we have another difference between us. I look inside others
code mostly when I have problems with it, and sometimes when
documentation is incomplete, or seemingly wrong, or there's no
documentation at all.

So do I.
I don't look inside out of pure curiousity. And
you know what? I bet you know. There's no comments.

But the subroutine and variable names are almost always at least related
to their meaning. Sometimes they are too short and cryptic, and
sometimes they are misleading, but usually you get a good impression of
what the programmer was trying to achieve from the identifiers alone,
without analysing the algorithm in detail. Just try one of these code
obfuscators one time which turn all subroutine names into "s0001",
"s0002", etc. and all variable names into "v0001", "v0002", etc. and
then try to understand the program. It is possible, of course, but it is
a *lot* harder.
OK, read this (that depends on your context of course, it's possible
you would get it even without I<-Mstrict> or I<-Mwarnings>):

perl -Mutf8 -le '
print "vvv";
@OEM = qw/ 1 2 3 /;
print "@ОЕМ";
print "^^^";'
vvv

^^^

This is deliberately obfuscated.
(I'm still unclear)

Yes.

hp
 
E

Eric Pozharski

Anybody who writes programs of more than trivial complexity on a regular
basis.

Agreed. And being paid or not is orthogonal to this definition, what I
agree too.

I've thought a lot about all that thread (again). In case you would
have gone with a quotation there would be almost nothing to say. But
you've gone with the definition. Pretty fair. But then I have a
problem with the conclusion.

Consider this. You've met someone, he doesn't speak your languages, you
doesn't speak his languages. How are you going to find out if he's
programmer, barber, or president of some banana republic? You have no
means.

I assure you, there're lots of programmers who speak only those
languages they speak outside of programming. There're those, who speak
only one language, which they speak from birth. (personal experience)
10 years I'd studied French at school; I know for sure there're 10
verbs that are in Passe Compose conjugated with a verb "aller"(?) and
others are with a verb "etre" (spelling is incorrect); And I don't know
this list; And I didn't know ever. After 10 years. And what is even
more surprising there're those who teach English and don't speak either.
Or any other language. I've met those.

And there're lots of them. Once I've spotted a message, with a
statement I remember up today: "NTU KPI -- 11 thousand cretins yearly"
(I can't get that message, google fails even to find phrases that I've
seen yesterday). I don't think them are 11 thousands, they all are
cretins, and that they are even cretins. That's a panic. We just
believe in some "famous" high-quality education. It wasn't even
mere-quality ever. But it easier to believe, then facing truth.

Look, there's a culture, where learning is suspicious. Where being
smart (not even brilliant) is an insult. We just get used to it.

But then, in case your statement would be based on quote, that would be
problem with a source. But that conclusion is yours. How you've
managed to get that idea?

I've thougt a lot. The only way I've found is based on statistical
observation that almost all (maybe without "almost") those you know, who
fit the definition, speak English (they don't have "basic grasp", they
really fluently speak). And then you expand that observation over whole
Universe. Doesn't that seem to be insane?

It's not.

I just get used to those arguments for sake of argument itself. Almost
every such argument end with the statement: "Eric! Shut up! Do what I've
said!" (BTW, it wasn't expressed this way ever.)

I give up. I have nothing to learn on this thread any more.

*CUT*
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top