What is better encoding method?

mistral · Jul 12, 2006

What is difference between two encoding methods below and what method
can be considered more "web safe", fully retaining functionality of the

original source code, without the danger of misinterpretation of
original code characters (code contains long Registry entries, activeX,

etc).

A.

<script
type="text/javascript">document.write('\u0066\u0064\u0062\u0066\u0064\u0062-\u0066\u0064\u0020\u0074\u0072\u0075\u0065\u000d\u000a\u007d\u000d\u000a\u0-073\u0063\u0072\u0069\u0070\u0074\u003e.....')</script>

B.

<script language="text/javascript"">

</script>

thanks.

mistral

Lasse Reichstein Nielsen · Jul 12, 2006

mistral said:
What is difference between two encoding methods below

I think the difference should be obvious. One uses two-character MIME
escapes and the other uses four-character character literal escapes.

and what method can be considered more "web safe", fully retaining
functionality of the original source code, without the danger of
misinterpretation of original code characters (code contains long
Registry entries, activeX, etc).

Either should work.

If the encoded text contains non-ASCII characters with a Unicode code
point above 255, the mime encoding will use four-character escapes as
well:
escape("\u0101") == "%u0101"
Likewise, characters below codepoint 256 can be escapes using literals
like \x22, so there is really not much difference in size.

The big question is why you try to escape the completely normal
characters at all.

/L

Richard Cornford · Jul 12, 2006

mistral said:
What is difference between two encoding methods below

Neither are "encoding methods" (they both resemble (futile) attempts at
obfuscation).

and what method can be considered more "web safe",

"Web safe" has no meaning in relation to encoding or what you show
below.

fully retaining functionality of the original source code, without the danger
of misinterpretation of original code characters (code contains long
Registry entries, activeX, etc).

<snip>

There is no direct relationship between the act of -
document.writing-ing obfuscated strings and the effective execution of
source code (beyond syntax errors in the actual source and runtime
errors generated in any operations performed by it).

It is often said that there is an inverse relationship between the
desire to conceal source code on the web and the worth of that source
code.

Richard.

Bart Van der Donck · Jul 12, 2006

Lasse said:
I think the difference should be obvious. One uses two-character MIME
escapes and the other uses four-character character literal escapes.

Sorry for the nitpick, but MIME escapes are actually in the following
format; e.g. for the equality sign:

=3d

whereas the correct description for the OP's notation is 'URL
encoding':

%3D

[...]
If the encoded text contains non-ASCII characters with a Unicode code
point above 255, the mime encoding will use four-character escapes as
well:
escape("\u0101") == "%u0101"
Likewise, characters below codepoint 256 can be escapes using literals
like \x22, so there is really not much difference in size.

Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

Lasse Reichstein Nielsen · Jul 12, 2006

Bart Van der Donck said:
Sorry for the nitpick,

I'm not in a position to complain about nitpicking

Thank you for the correction.

[%hh vs \xhh]

Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

A quick test shows that if n is a number between 128 and 255, and
hh is a hex representatio of it, then the following gives the same
result:
String.fromCharCode(n)
"\xhh"
"\u00hh"
unescape("%hh")
unescape("%u00hh")
(which is a string with .charCodeAt(0)==n, however much sense that
makes).

Testcode:
---
for(var i = 127; i < 255; i++) {
var s = String.fromCharCode(i);
var l = eval('"\\x'+(i).toString(16)+'"');
var ll = eval('"\\u00'+(i).toString(16)+'"');
var u = unescape("%"+(i).toString(16));
var ul = unescape("%u00"+(i).toString(16));
if (s.charCodeAt(0) != i ||
l.charCodeAt(0) != i ||
ll.charCodeAt(0) != i ||
u.charCodeAt(0) != i ||
ul.charCodeAt(0) != i) {
alert("Error for value: " + i);
}
}
---

I have not tested what that character means, but getCharCodeAt() is
expected to return a code point, which is defined as "a 16- bit
unsigned value used to represent a single 16-bit unit of UTF-16 text."

/L

mistral · Jul 12, 2006

Lasse Reichstein Nielsen Ð¿Ð¸ÑÐ°Ð»(Ð°):

"Bart Van der Donck" <[email protected]> writes:

Sorry for the nitpick,

I'm not in a position to complain about nitpicking
Thank you for the correction.

[%hh vs \xhh]
Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

A quick test shows that if n is a number between 128 and 255, and
hh is a hex representatio of it, then the following gives the same
result:
String.fromCharCode(n)
"\xhh"
"\u00hh"
unescape("%hh")
unescape("%u00hh")
(which is a string with .charCodeAt(0)==n, however much sense that
makes).

Testcode:

---
for(var i = 127; i < 255; i++) {
var s = String.fromCharCode(i);
var l = eval('"\\x'+(i).toString(16)+'"');
var ll = eval('"\\u00'+(i).toString(16)+'"');
var u = unescape("%"+(i).toString(16));
var ul = unescape("%u00"+(i).toString(16));
if (s.charCodeAt(0) != i ||
l.charCodeAt(0) != i ||
ll.charCodeAt(0) != i ||
u.charCodeAt(0) != i ||
ul.charCodeAt(0) != i) {
alert("Error for value: " + i);
}
}
---

I have not tested what that character means, but getCharCodeAt() is
expected to return a code point, which is defined as "a 16- bit
unsigned value used to represent a single 16-bit unit of UTF-16 text."

----------------

Not fully clear with this encoding. So, what output encoding will
preferable to use for obfuscating: ASCII, European ASCII (ISO-8859-1),
or UNICODE (UTF-8 or UTF-16)? What with unescape? Most obfuscators use
this unescape.

Mistral

Bart Van der Donck · Jul 12, 2006

Lasse said:
Bart Van der Donck said:

Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

Click to expand...

A quick test shows that if n is a number between 128 and 255, and
hh is a hex representatio of it, then the following gives the same
result:
String.fromCharCode(n)
"\xhh"
"\u00hh"
unescape("%hh")
unescape("%u00hh")
(which is a string with .charCodeAt(0)==n, however much sense that
makes).
[...]

The code point table would probably be identical across all these
commands, it's probably decided by the js engine itself. It doesn't
look like the page's own charset has any influence. I didn't find a way
to force getCharCodeAt() to a specific code page neither.

It appears that even Microsoft follows some standards in this matter

Based upon their Windows-1252 character set (which they try to
dictate as much as they can though), one would expect that

alert('\x131')

would return

ƒ

But instead, they use:

alert('ƒ'.charCodeAt(0))

Thus corresponding to cp 402 (Unicode>255) in stead of Microsoft's "own
invented" proprietary 131 (Windows-1252).

But then again, 131 seems to be present in FF/MSIE/NS numeric html
entities though (which one wouldn't expect anymore then, IMO):

document.write('ƒ is &fnof; and ƒ and ')

Richard Cornford · Jul 12, 2006

Bart said:
Lasse said:

Bart Van der Donck said:

Yes, but those code points do not necessarliy represent the same
character in the \x80-\x9F range. My test seems to turn out that even
MSIE prefers ISO-8859-1 in stead of the expected Windows-1252 there.

Click to expand...

A quick test shows that if n is a number between 128 and 255, and
hh is a hex representatio of it, then the following gives the same
result:
String.fromCharCode(n)
"\xhh"
"\u00hh"
unescape("%hh")
unescape("%u00hh")
(which is a string with .charCodeAt(0)==n, however much sense that
makes).
[...]

Click to expand...

The code point table would probably be identical across all these
commands, it's probably decided by the js engine itself.

<quote cite="ECMA 262, 3rd Ed. Section 6">
6 Source Text

ECMAScript source text is represented as a sequence of characters in
the Unicode character encoding, version 2.1 or later, using the UTF-16
transformation format. The text is expected to have been normalised to
Unicode Normalised Form C (canonical composition), as described in
Unicode Technical Report #15. Conforming ECMAScript implementations
are not required to perform any normalisation of text, or behave as
though they were performing normalisation of text, themselves.

SourceCharacter ::
any Unicode character

ECMAScript source text can contain any of the Unicode characters. All
Unicode white space characters are treated as white space, and all
Unicode line/paragraph separators are treated as line separators.
Non-Latin Unicode characters are allowed in identifiers, string
literals, regular expression literals and comments.

It doesn't look like the page's own charset has any influence.

The/a character set asserted by an HTTP content type header would
probably be employed in deciding how to translate incoming javascript
source into the "of characters in the Unicode character encoding" that
is needed prior to the tokenisation of the code.

I didn't find a way
to force getCharCodeAt() to a specific code page neither.

<snip>

You wouldn't as by the time you are dealing with javascript you are
past the point where the normalisation to Unicode ahs happened and so
code pages are not an issue.

Richard.

Bart Van der Donck · Jul 12, 2006

Richard said:
<quote cite="ECMA 262, 3rd Ed. Section 6">
6 Source Text

ECMAScript source text is represented as a sequence of characters in
the Unicode character encoding, version 2.1 or later, using the UTF-16
transformation format. The text is expected to have been normalised to
Unicode Normalised Form C (canonical composition), as described in
Unicode Technical Report #15. Conforming ECMAScript implementations
are not required to perform any normalisation of text, or behave as
though they were performing normalisation of text, themselves.

SourceCharacter ::
any Unicode character

ECMAScript source text can contain any of the Unicode characters. All
Unicode white space characters are treated as white space, and all
Unicode line/paragraph separators are treated as line separators.
Non-Latin Unicode characters are allowed in identifiers, string
literals, regular expression literals and comments.
</quote>

I'll get back after my first ECMAScript study, okay

The/a character set asserted by an HTTP content type header would
probably be employed in deciding how to translate incoming javascript
source into the "of characters in the Unicode character encoding" that
is needed prior to the tokenisation of the code.

I had to read that sentence 5 times, but, yes, I'ld say this is a
correct representation. One side remark though. I'ld say browsers
should normally accept the stream in the offered character set, as you
said. For example, setting the output stream and <meta http-equiv/> to
ASCII should prevent a character like 'é' to be displayed. And yes,
MSIE seems to implement this correctly:

http://www.dotinternet.be/temp/ascii.pl

But Firefox seems to throw away the charset rules, and display them
anyway. The code:

#!/usr/bin/perl
print <<'HTM'
Content-Type: text/html; charset=ascii

<html>
<body>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ascii"/>
</head>
<body>
é
</body>
</html>
HTM

Interesting!

RC · Jul 12, 2006

mistral wrote:

Not fully clear with this encoding. So, what output encoding will
preferable to use for obfuscating: ASCII, European ASCII (ISO-8859-1),
or UNICODE (UTF-8 or UTF-16)? What with unescape? Most obfuscators use
this unescape.

<html><head>
<meta http-equiv=Content-Type content="text/html; charset=UTF-8">
</head>
<body>
do whatever you want

</body></html>

you can try charset=UTF-16, ... ,etc.

Difference between two encoding methods	0	Jul 11, 2006
encoding javascript	1	Nov 21, 2007
Obfuscated, trilingual	3	Jan 10, 2008
How to create python codecs?	0	Aug 6, 2008
Homework	8	Feb 3, 2006
Javascript with %%%%%%?	4	Dec 21, 2007
window.close	3	Jun 12, 2009
Replace every n instances of a string	1	Aug 15, 2003

What is better encoding method?

mistral

Lasse Reichstein Nielsen

Richard Cornford

Bart Van der Donck

Lasse Reichstein Nielsen

mistral

Bart Van der Donck

Richard Cornford

Bart Van der Donck

RC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads