Efficient URL-decoding.

P

Peter Jansson

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific


//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::eek:stringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------



Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?


Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
 
P

Phlip

Peter said:
Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?

Where are its unit tests? And note that refactors which clean the code up
will often squeeze out flab, too...
 
R

rossum

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.
You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.
2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',
Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.
3. Code use iostreams, so it's locale specific
Remove the stringstreams.


As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum
//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::eek:stringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------



Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?


Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
 
P

Peter Jansson

rossum said:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.

3. Code use iostreams, so it's locale specific

Remove the stringstreams.


As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum

Thank you for your input. Well I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?

Any more ideas/comments?

Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/


//------------- code begins ------------------------------
#include <cctype>
#include <cstdlib>
#include <iostream>
#include <string>
#include <sstream>
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

Peter said:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific

URI encoding and decoding is a really tricky issue, quite apart from
the issues in the code that you have (that others have already helped
with). I think maybe I can shed some light on your third question
though. It's all a little off topic for this forum, but important. I
hope everybody indulges me :)

The URI is split into several parts, but for HTTP(S) the only parts
that will be encoded are the file specification and the query string
(if present). Note that they are encoded _differently_. They are
seperated by a single question mark (?).

If you are decoding somebody else's URI then stop right now. You don't
know enough about it to be able to work it out. By all means give it a
go, but don't expect that it will make any sense and _don't_ manipulate
it before using it. You will break something.

The file specfication (which is what it looks like you're decoding) can
be in any locale. These days people are tending to drift towards UTF-8,
but it's by no means universal. If it's a URI you've encoded yourself
then you should know which locale you used.

For the query string, the format is described in the HTML specs, _but_
only browsers have to use that format when they create a query string
from a form submission using GET. Any other URI creation can be
different and in fact the W3C recommends a slightly different format
that doesn't cause common entity problems in HTML. The biggest
difference between the file specification and the query string encoding
is the space. In the file spec they are '%20' and in a query string
they are '+' - many people get this wrong (most clearly Technorati who
have wrecked the C++ blogging community through this error). If you're
trying to decode somebody else's query string you can't rely on the
format at all.

Now, the final thing you need to think about is security. You need to
check and double check the URI for all sorts of things. What checks you
need depends on where the URI came from and what you're doing with it.

I have a number of aticles on my web site about these issues:

Problems with IIS due to HTTP.SYS decoding the file specification:
http://www.kirit.com/Getting the correct Unicode path within an ISAPI filter

Another problem caused in the custom 404 handling:
http://www.kirit.com/Errors in IIS's custom 404 error handling

Another encoding fault leading to problems with redirections:
http://www.kirit.com/Response.Redirect and encoded URIs

And even the W3C doesn't get it right:
http://www.kirit.com/W3C's CSS validation service

These articles should give you a feel for the sorts of things that can
go wrong. They link in many places to the relevant RFCs and other specs
and notes. There are some other articles on Unicode that also touch on
other aspecs of these issues. There are also a load of tricky URIs that
you should be able to deal with properly.


So, again, I know this is strictly off-topic for this group. I hope
nobody minds too much though as the issues are pretty important.


K
 
M

Michiel.Salters

Peter said:
I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?

Tricky code, but it accidentily works ;)

The reason it works is that you use c_str(). That tacks a \0 on the end
of the char[] returned, even though it's not present in the string
itself.
Now, \0 is not a hexdigit, which implies you won't check after that \0.
Also, you only increment in if you know you saw three non-\0
charachters
which means you haven't hit the end yet. Therefore, it is portable.
Some
comments would be nice, though ;)

BTW: call URLout.reserve(URLin.size()). The input size is a very good
estimate
of the output size. And you seem to have some old headers lying around.

HTH,
Michiel Salters
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top