How to decode JavaScript's encodeURIComponent in Perl.

C

Cloink

This is an answer, not a question. It's an answer for people like me
who struggle with the Perl language and all it's myriad idiosyncracies.

It's simple:-
----------------------
# I believe this package should be in the standard Perl distribution
use Encode;
# emulate é, e-acute, as encoded in js: encodeURIComponent('é')
my $str = '%C3%A9'; # nothing special about this example
$str =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
# DO NOT "print $str;" - "decode" is all important - in the "Encode"
pkg that's been "use"d.
print decode('UTF-8',$str);
----------------------

Now that you know, you are very welcome to stop reading. I advise it.

First, beware of documentation that refers to the regexp-substitution
WITHOUT the decode. To my pretty little (ascii) head, once we've done
"$str =~ s//", then $str is a string, let's print it and look at it.
NO!! (And I never use more than one single exclamation mark.) So if
you don't decode() it, it's wrong. decode() looks at it as a
BYTE-stream, not as an ascii-character list.

I say it's simple. It ain't. It's taken me a lot of heartache to find
someone to speak to me in plain English. The above example will very
probably do the job unless you are dealing with, ooh, I don't know,
this is where my eyes go hazy and my mind wanders to bikini-clad women.
Come on Clark, pull yourself together. I'll say "complicated"
character sets, but I acknowledge that if you *are* dealing with what I
have described as a "complicated" character set, it is almost certainly
very familiar to you and therefore very UN-complicated.

But here's the deal. What on the face of it seems like a simple
conundrum is actually a complicated conundrum, and Dan Kogai has
presumably put an awful lot of work into the Encode package/module
(sorry if "package" and "module" aren't actually interchangeable words,
in my world they are -- or maybe I meant something else again -- you
get the idea).

So use his module. Package. Thing. It works. Almost always, but I
can't guarantee it, and I don't think Dan can either - but not because
he doesn't know what he's doing - read on.

Now, thanks to Mark, here is an enormous amount of description as to
exactly why this seemingly simple subject is a very complicated
subject. (Mark has kindly emailed me and I feel his very useful,
tutoring, you-can-understand-it-if-you-speak-English comments on the
matter deserve a wider audience than moi alone.)

So. I'm going now, the rest of this is Mark's email to me. It really
is very elucidating.

Bye!
From Mark:

Clark, I read you, and to quote a former U.S. president, "I feel your
pain." (in my best Arkansas accent)

I'll look at the documentation for URI::Escape::uri_unescape to see
if it's confused or confusing. It might be an "apples and
oranges" question.

The problem you've found is a tough nut to crack. From the beginning
of time (maybe an exaggeration, at least since the creation of the
Internet), URIs have been an ASCII-only thing. The very name, American
Standard Code for Information Interchange, is insulting to people when
they find they can't use their own native character-set in a URI,
domain name, etc. What kind of "world standard" excludes the
majority of the world's population? I'm not unsympathetic.

So now we have Unicode -- a vastly superior term, to some people,
simply because it does not include the word "American" -- to save
us from ourselves.

Stuff like URI::Escape::uri_unescape was written to support
long-established standards that understand only ASCII. There's a vast
amount of existing software with the same problem. It's not been
updated to allow for Unicode. It's legacy software, but it works, and
it's safe, and secure (mostly). Even with Unicode, the common
language of the Internet name system is still US-ASCII (no pound or
euro characters, please).

Here's the trick... URI encoding is described in RFC 2396 (see
http://rfc.net/rfc2396.html for a printable copy). It defines ASCII as
the valid character set and allows for encoding of single-byte (8 bits)
non-ASCII characters. However, valid Unicode characters can be 2-bytes
or 4-bytes, potentially even more. URI encoding can't handle those
situations. Right now, we're dealing with UTF-8. We can trick URI
encoding into thinking it's dealing with simple ASCII characters when
in fact, the actual characters are Unicode. Here's a web page that
explains UTF-8 in lots of detail.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Let me digress a little bit. Basic e-mail is text only. It doesn't
allow pictures because they're binary data, not simple text. We can
trick e-mail into thinking a picture is text. The result looks
something like this. You may have seen text like this in the
"source" of an e-mail message.

/9j/4AAQSkZJRgABAQEASABIAAD/4SUmRXhpZgAASUkqAAgAAAAVAA8BAgAYAAAACgEAABABAgAM
AAAAIgEAABIBAwABAAAAAQAkmRoBBQABAAAALgEAABsBBQABAAAANgEAACgBAwABAAAAAgBtGDEB
....

As long as the lines stay in the correct order, we can turn the text
back into a picture. Rearrange the lines, and the result is garbage
even if the encoding is correct. URI encoding only works only as it was
designed, but we can trick it into thinking Unicode is ASCII.

With Unicode, the grouping of the bytes is important. For example,
byte-1 must be grouped with byte-2. Left to its own devices, URI
encoding turns valid Unicode into a sequence of encoded, but ungrouped,
single-byte characters. The grouping of bytes that's required for
valid Unicode is lost. Am I making sense... it is confusing. I guess I
should draw a picture (worth many words). The bottom line is, we have
to keep the Unicode bytes in the right groups and in the right
sequence. URI encoding alone destroys valid Unicode.

Ah, ha! (light bulb above my head) I've got an analogy... remember
algebra... (4 * 3) + 5 = 17, is different than 4 * (3 + 5) = 32. The
numbers are the same, the order is the same, but with different
grouping, the results are different. Unicode is like that. Different
groupings result in a different character set or totally invalid
Unicode. There are ambiguous cases that can't be resolved if one
doesn't know the original sequence and grouping.

Unicode has been sneaking up slowly. Surprisingly, it's not a simple
one-to-one correspondence from URI to Unicode. Consider this URI
(Don't go there! Well, you may if you wish, but there's no need),

http://www.bankofamerica.com/

With strict ASCII, there's no problem. However, if someone uses, for
example, a Cyrillic character set to register the same domain, we run
into trouble. Cyrillic 'a' looks like ASCII 'a' when it's in
a browser address line, but the two characters are entirely different
in Unicode. A single character of Unicode could send Bank of America
customers to a web site in the farthest reaches of Siberia.

See "The Cyrillic Charset Soup" at
http://czyborra.com/charsets/cyrillic.html for an idea of the scope of
the problem. Unicode is intended to fix this mess, but it's a slow,
evolving process.

In general, here's the scheme of things. It's the order that's
important. Using JavaScript with perl adds an extra layer of
complexity. From the perl point-of-view:

1. Get a fully encoded string
2. Decode URI encoding
3. Decode Unicode with error checking
4. If you expect a particular character set at this point, do a
sanity check
5. Be happy


If someone wants to slip in a bogus URI, it will most likely include
obscure Unicode characters or entirely invalid Unicode. This is where
David's comments apply. I called it a sanity check. David explained
the same thing in a different way. The Cyrillic character problem with
the letter 'a' is the tip of the iceberg. This is where the trouble
begins. From a pragmatic point of view, it's *probably* sufficient to
error check and do a minimal sanity check if you're working with
URIs. If you're working with form data or text in general, you need
more sanity checking.

Going the other way from valid Unicode to a fully encoded URI looks
like this:

1. Get a string assuming it's in the correct character set, i.e.,
valid Unicode
2. Encode Unicode with error checking for good measure
3. Encode with URI encoding
4. Pray!


It should work every time, but... If you need to depend on a valid URI
that can't be spoofed, do some error checking. See:

http://search.cpan.org/~dankogai/Encode-2.18/Encode.pm#Handling_Malformed_Data

There are some Unicode "gotchas" lurking in perl. Before version
perl version 5.6, Unicode was a patch-up job. After version 5.6,
Unicode is well supported. In most cases, perl is smart enough to do
the right thing; it makes the right guess; it does what most people
would expect. I suppose it's intuitive in that way. For the complete
story, check "perldoc perlunicode".
 
C

Cloink

[The only reason for this reply is so that the post also gets hits if
you search for "encodeURI" (without the the "Component" on the end). I
hope, but I can't tell 'til I post it then google-group it.]
 
J

John Bokma

Cloink said:
[The only reason for this reply is so that the post also gets hits if
you search for "encodeURI" (without the the "Component" on the end). I
hope, but I can't tell 'til I post it then google-group it.]

Don't do that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top