UTF-8 module

J

Jerry Maguire

Hi,
Is there any module to encode a URL into UTF-8 format? I have looked at cpan
and could not find anything.
At the moment I am using MIME::base64 perl module to do URL
encoding/decoding but also need a UTF-8 encoding/decoding.


Thanks
Jerry
 
B

Bob Walton

Jerry said:
Hi,
Is there any module to encode a URL into UTF-8 format? I have looked at cpan
and could not find anything.
At the moment I am using MIME::base64 perl module to do URL
encoding/decoding but also need a UTF-8 encoding/decoding.


Thanks
Jerry

UTF-8 support is native in Perl 5.8. See:

perldoc perluniintro

and

perldoc perlunicode
 
P

pkent

Presumably what's needed is that every character whose ord() value is
above 127 decimal needs to be subjected to some kind of unpack() or
sprintf to yield the relevant number of %HH%HH... encoded ASCII
<snip>

Another thing I should point out is this code in CGI/Util.pm:


sub unescape {
....
$todecode =~ s/%(?:([0-9a-fA-F]{2})|u([0-9a-fA-F]{4}))/
defined($1)? chr hex($1) : utf8_chr(hex($2))/ge;
....
(from http://search.cpan.org/src/JHI/perl-5.8.0/lib/CGI/Util.pm )

This implies to me that escapes of the form:

name=%u2112;age=23

will be accepted by CGI.pm and decoded as utf-8 2-octet sequences. I
can't find this documented anywhere though, after a brief search.

P
 
J

Julian Mehnle

pkent said:
The problem I can forsee tripping people up is when you're encoding a
character into a 2+ octet representation and one octet happens to be
equal to a 'special' character, e.g. an '='.

This won't happen, since every octet that is part of a UTF-8 multibyte
character sequence is >= 0x80. That's the beauty of UTF-8. ;-)
 
P

pkent

Alan J. Flavell said:
On Sat, Jul 26, pkent inscribed on the eternal scroll:
The two formats are evidently designed to be equivalent in some sense,
but there might be subtle issues, I'm not certain. One difference is
the representation of a space ("+" in forms submission format, "%20"
in URIencoding format).

AIUI, and ICBW, a '+' in a URL is basically equivalent to a '%20'. That
said, I haven't actually checked the RFCs but everythign I've seen shows
that this is the case. It's a handy special-case though.
In my limited experiments (e.g. submitting an HTML form where I'd typed
in some text with accented chars) IE6 will create a URL like:

http://example.com/cgi-bin/t.pl?spanish=se%C0%A2or%20manuel

I would never use IE as my reference implementation - it deliberately
violates some mandatory requirements of the applicable
specifications.[1]

Neither would I use IE6 as a general reference implementation, but as
most of our users use IE we tend to try things out quite early on it.
Although I didn't mention that Mozilla did the same thing when we tried.
I'm not saying they're right but the behaviour supports the 'high order
char becomes two octets %hh%hh in url' theory.

Oh the fun we've had with perl5.6.1, XML::parser, high-order characters,
web browsers and CGI.

P
 
A

Alan J. Flavell

AIUI, and ICBW, a '+' in a URL is basically equivalent to a '%20'.

This is the problem when folks raise issues that are off-topic - and
without consulting the relevant specifications.
That said, I haven't actually checked the RFCs

I can assure you that I did so, directly before posting what I had
said.
Neither would I use IE6 as a general reference implementation, but as
most of our users use IE we tend to try things out quite early on it.

But that wasn't my point, though.
Although I didn't mention that Mozilla did the same thing when we tried.

Sure, and on the whole I'd trust Mozilla to get things right, though
I've seen the occasional unfortunate compromise ("we've got to do it
this way because web pages made for IE look bad otherwise").

But I think I cited enough references to specifications too.
I'm not saying they're right but the behaviour supports the 'high order
char becomes two octets %hh%hh in url' theory.

Yes, but not necessarily two, in general!
Oh the fun we've had with perl5.6.1, XML::parser, high-order characters,
web browsers and CGI.

More so than with 5.8 ?

cheers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,056
Members
48,769
Latest member
Clifft

Latest Threads

Top