Generic innerHTML functionality and other minor questions...

L

Luke Matuszewski

VK napisal(a):
I may keep your document in UTF-8 format and you may ask your server
admin to turn off encoding mechanics so such document would not be
double encoded while sending. But why would you want to do this? And
it's definitely would not improve your text portability because you
would have to make the same agreement with each server admin. HTTP
protocol just doesn't work this way and I have great doubts that you
will manage to change HTTP standards on all servers across the globe
:)

Effectively your idea is similar to keep all executables on your
computer in base64 text format and turn off your browser's base64
encoder so you could upload your file onto server manually. It is also
doable but in the Name why? ;-)

You really don't understand me, and i doubt you will. I provied some
meaningful text init from w3c (which says that server of a document is
trying to 'detect' the encoding of the document(or it is set manually
by admin or [see my prevous post]) - so if somehow it can detect that
my document (eg. html page) is in UTF-8 then no transformation is
needed if he sends data using Content-Type: text/html charset=UTF-8.
I far as i know standard http server is trying to detect the encoding
of my document and then all it needs to do is set charset in
Content-Type header to detected encoding - NO TRANSFORMATION (or double
transformation).

Aside from servers like Http Apache - there is a (historically)
conformance that if the server cannot somehow detect encoding of the
served file it will use ISO-8859-1.
Again if you are writing php page or jsp page or other pages (probably
in other language) you CAN EXPLICITE SET CONTENT TYPE HEADER. In jsp
can do that via <@ page directive on top of the page.

Unicode if standard - its main role is to give integer value to
specified national language, while UTF specifies how to transform that
character value (which may be even 4 bytes long) to byte stream - eg
used to produce a raw file.

BR
Luke.
 
V

VK

Luke,

You asked for the most secure format to destribute JavaScript files
containing non-Base ASCII characters.

You've got the answer: \uFFFF Unicode escape sequences. If you think
that UTF-8 pre-coded text is more secure in this concern you are
welcome to try it across the servers - it's a free world, man! :)

If you look my posts around here you'll see that I'm the last man in
the block to prove anything by quoting some documents. As I said
several times: if you have a question - put an experiment. Make a
JavaScript with Polish text in it, convert it into UTF-8, create a
document-holder with UTF-8 Content-Type declaration and destribute it
on free hosting providers across the globe. Get the text into your
browser, check the result. Do the same with \u escaped variants, check
the result. You see - it is rather easy and it doesn't involve any
quotations, cross-references, document versioning comparison and other
academical toys. And most importantly - it gives some *practical*
answers.

If you don't want to experiment, than this particular topic is explored
to the its limits I guess.
 
L

Luke Matuszewski

VK napisal(a):
Luke,

You asked for the most secure format to destribute JavaScript files
containing non-Base ASCII characters.
No i asked about escaping mechanism - especially how (or does ) it
works with XmlHttpRequest object if using GET. My confusion was in
particular case:
If i have a page which is stated in a browser/User Agent as having
encoding eg. Windows-1250 then when i take value from this page in
JavaScript by eg.

var str = selectElem.options[selectElem.selectedIndex].value;

and use it as request parameter in send method eg.

x.open("GET", "/someUrl?MyRequestParameter="+str, true);

then in what encoding would be that MyRequestParameter paramter value
at server side ?
Now i know that all strings in JavaScript 1.0 are in Unicode and
particularlly my parameter would be in Unicode (UTF-8 as i presume) ...
so if send by GET needs to be escaped (becouse when using GET method
only limited characters are permitted in URL string eg.
even space ' ' is escaped as %20 and @ is escaped as %40 ... (%xx - xx
hex value)).
escape() method properly escapes ' ' and @ and other basic (probably
taken from US-ASCII) characters. But when i want to send a some latin
character like polish l crossed i have to first encode it via
encodeURLComponent added in ECMAScript v3).
When i use POST this is not a problem since URL and POSTed data need no
escaping.

To see escaping mechanism in GET method of HTTP protocol try this ->
open your browser and in Location (URL) field type:

http://www.google.com/?myRequestParam=Some Value

and you will see

http://www.google.com/?myRequestParam=Some Value

and a page with text:

Not Found
The requested URL /?myRequestParam=Some%20Value was not found on this
server.

Got it ?

The only restriction here is that all request parameters send via
open() (GET) or send() method (POST) are send using UTF-8 format (so at
server side my script must be aware of it... eg. i must use in servlet
request.setCharacterEncoding("UTF-8"); before i will be taking any of
parameters from HTTP Request.
(personally i use filter which always sets
request.setCharacterEncoding("UTF-8"); so i dont have to do it in every
servlet).

It may be limitation that XmlHttpRequest object is doing GET using
UTF-8 transformation format.

I have done some testing on Apache httpd server - from httpd.conf i
learned that its default charset is set by

#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset UTF-8

so (default charset in Content-Type header) is UTF-8

....so i have made 2 files
- one saved in Windows-1250 encoding (Central Europe);
- one saved in UTF-8 encoding;
and opened them in my browser...only file saved as UTF-8 was displayed
correctly.
Both files were served with Content-Type header set to UTF-8 (and
browser View->Encoding was set to that value to -> UTF-8) - so again no
transformation on server side and so no file charset detection was used
(as stated in w3c.org html part).
 
T

Thomas 'PointedEars' Lahn

Luke said:
I have read the in the faq from jibbering about the generic
DynWrite, but i also realized that is uses only innerHTML feature of
HTML objects.
(1) Is there a DOM function which is very similar to innerHTML property
eg. (my guess) setInnerNodeAsText or sth... ?

`innerHTML' is a DOM feature, and I think one that can be considered a
feature of "DOM Level 0" (IE3+/NN3+). It is just not a feature of the
_W3C_ DOM, and AFAIK the latter DOM provides no equivalent for it. I think
that is because including such would include describing how the method
should handle invalid markup as it would be nonsensical to provide a method
that allows to destroy the DOM tree, the very thing it operates on. So
far, assignments to `innerHTML' are not checked for well-formedness, hence
the Gecko DOM disallows write access to it for documents served with an XML
document type, such as XHTML served as application/xhtml+xml; in that case,
you must do as Michael described.

You can use W3C DOM Level 3 Core's `textContent' attribute/property for
objects that implement the Node interface (includes all HTML element
objects) but that is restricted to plain text content, and I think that
is a Good Thing after all. `textContent' is supported in more recent
Mozilla/5.0 based UAs. It is on the wish list for Opera 9.

<URL:http://www.w3.org/TR/DOM-Level-3-Core/>
<URL:http://developer.mozilla.org/en/docs/DOM:element.textContent>
<URL:http://www.mozilla.org/docs/dom/reference/levels.html>


HTH

PointedEars
 
T

Thomas 'PointedEars' Lahn

You SHOULD encode all reserved characters unless they serve their intended
purpose. You SHOULD NOT encode unreserved characters. See RFC3986.
- so if i will pass something like
that
x.open("GET",
"/bw/My_Strust_Action/SelectChanged.do?"+escape(selectElem.name+"="+

selectElem.options[selectElem.selectedIndex].value), true);
/* and further more if i know that the value and name property is taken
from my <select> element used on a page with UTF-8 encoding then it
should be escaped due to that encoding so eg latin characters like
polish l and a line crossing it would be encoded as %C5%82 for UTF-8
and %B3 in Windows-1250 encoding).

Does my assumptions are right ?

Not exactly.

1) As I explained before UTF-8 is a *transport encoding* thus UTF-8
char sequences are existing only during their travel time from server
to browser and from browser to server.

This is utter nonsense again. UTF means Unicode _Transformation_ Format,
its use is not restricted to a certain transport channel nor has UTF
anything to do with client-server communication.
At the moment you are able to operate with strings using JavaScript,
UTF-8 mission is already completed and all char sequences are converted
to the relevant Unicode chars.

Nonsense! A transformation format is needed to encode Unicode chars at code
points up to U+10FFFF in 8-bit code, including script code string literals.
It is only that this happens completely transparent to the user of string
literals as in a supporting implementation _all_ characters in the string
literal are encoded using UTF-16. Which is why "\uABCD" is a string of
length 1 there although two bytes are required to store it.
So you never need to bother with UTF-8 transformations unless
you decided to emulate an HTTP server using JavaScript.
Nonsense.

2) escape / unescape methods are working only with ASCII characters.

No, that also works for ISO-8859-xx characters, however it is not specified.
For Unicode transformations you have to use encodeURIComponent /
decodeURIComponent methods.

This was true if one replaced "transformations" with "percent encoding".


PointedEars
 
T

Thomas 'PointedEars' Lahn

VK said:
Luke said:
encodeURI or encodeURIComponent() works very well and producing %C5%82.
It happens everytime no matter what charset in Content-Type of HTTP
response was used.
(becouse every string in JavaScript is in Unicode format - even
forms["myForm"].elements["myFormField"].value string).

You've got it! From withing JavaScript everything is Unicode, no matter
what encoding is used. [...]

You meant that even though you gave him the worst _opposite_ advice, he
managed to see through it. And now you are confirming that he is right
without admitting that you were utterly wrong. There is a word for that:
hypocrisy.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Luke said:
No i asked about escaping mechanism - especially how (or does ) it
works with XmlHttpRequest object if using GET. [...]
Now i know that all strings in JavaScript 1.0 are in Unicode

No, Unicode support was not included before JavaScript 1.3 (NN 4.06). Since
then, strings are encoded using UTF-16 in accordance with ECMAScript. The
current JavaScript version is JavaScript 1.6 (in Mozilla/5.0 rv:1.8b+,
hence Firefox 1.5). Unicode support for Internet Explorer was probably not
included before version 5.5/Win which was the only to support JScript 5.5
which was the first JScript version to support encodeURIComponent().

<URL:http://docs.sun.com/source/816-6408-10/whatsnew.htm>
and particularlly my parameter would be in Unicode (UTF-8 as i
presume) ...

My expectation is instead that ASCII percent-encoding (as described in
RFC2986) will be used for characters below code point 0x80 and UTF-8
percent-encoding will be used for the rest. I tried `?]' (where ? is
fortunately AltGr+l here :)) and submitted it -- %C5%82%5D was used if
UTF-8 was set as Character Encoding in View menu before.[1] I assume
this will be triggered by the default response header you configured.
However, I suggest that you use server-side scripting instead to set

Content-Type: text/html; charset=UTF-8

only if you need it. Serving all, even non-UTF-encoded documents as
UTF-8 encoded is probably harmful.

[1] Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922
Firefox/1.0.7 (Debian package 1.0.7-1) Mnenhy/0.7.2.0
so if send by GET needs to be escaped (becouse when using GET method
only limited characters are permitted in URL string eg.
even space ' ' is escaped as %20 and @ is escaped as %40 ... (%xx - xx
hex value)).
escape() method properly escapes ' ' and @ and other basic (probably
taken from US-ASCII) characters. But when i want to send a some latin
character like polish l crossed i have to first encode it via
encodeURLComponent added in ECMAScript v3).

encodeURIComponent(), you are right on the rest.
When i use POST this is not a problem since URL and POSTed data need no
escaping.

That is not entirely true. I think if your POST request would include
Unicode characters, it would be necessary to declare them as such, probably
with

Accept-Charset: UTF-8,*


PointedEars
 
T

Thomas 'PointedEars' Lahn

Luke said:
No i asked about escaping mechanism - especially how (or does ) it
works with XmlHttpRequest object if using GET. [...]
Now i know that all strings in JavaScript 1.0 are in Unicode

No, Unicode support was not included before JavaScript 1.3 (NN 4.06). Since
then, strings are encoded using UTF-16 in accordance with ECMAScript. The
current JavaScript version is JavaScript 1.6 (in Mozilla/5.0 rv:1.8b+,
hence Firefox 1.5). Unicode support for Internet Explorer was probably not
included before version 5.5/Win which was the only to support JScript 5.5
which was the first JScript version to support encodeURIComponent().

<URL:http://docs.sun.com/source/816-6408-10/whatsnew.htm>
and particularlly my parameter would be in Unicode (UTF-8 as i
presume) ...

My expectation is instead that ASCII percent-encoding (as described in
RFC2986) will be used for characters below code point 0x80 and UTF-8
percent-encoding will be used for the rest. I tried `Å‚]' (where Å‚ is
fortunately AltGr+l here :)) and submitted it -- %C5%82%5D was used if
UTF-8 was set as Character Encoding in View menu before.[1] I assume
this will be triggered by the default response header you configured.
However, I suggest that you use server-side scripting instead to set

Content-Type: text/html; charset=UTF-8

only if you need it. Serving all, even non-UTF-encoded documents as
UTF-8 encoded is probably harmful.

[1] Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922
Firefox/1.0.7 (Debian package 1.0.7-1) Mnenhy/0.7.2.0
so if send by GET needs to be escaped (becouse when using GET method
only limited characters are permitted in URL string eg.
even space ' ' is escaped as %20 and @ is escaped as %40 ... (%xx - xx
hex value)).
escape() method properly escapes ' ' and @ and other basic (probably
taken from US-ASCII) characters. But when i want to send a some latin
character like polish l crossed i have to first encode it via
encodeURLComponent added in ECMAScript v3).

encodeURIComponent(), you are right on the rest.
When i use POST this is not a problem since URL and POSTed data need no
escaping.

That is not entirely true. I think if your POST request would include
Unicode characters, it would be necessary to declare them as such, probably
with

Accept-Charset: UTF-8,*


PointedEars
 
R

RobG

one said:
hey, that's interesting. I tried it though in Firefox/Gecko. a
window.alert( text) gets me the text I wanted to insert, but the text on
the html page does not change.

How do I get a text change to take effect?

<p onclick="changeTextContent(this);">Click on me...</p>

<script type="text/javascript">

function changeTextContent(el){
if (el.textContent){
el.textContent = (
prompt('Current text is: ' + el.textContent + '\n'
+ 'Enter new text or click \'Cancel\' to '
+ 'keep current text')
|| el.textContent);
}
}
</script>
 
L

Luke Matuszewski

Thomas 'PointedEars' Lahn napisal(a):
No, that also works for ISO-8859-xx characters, however it is not specified.


This was true if one replaced "transformations" with "percent encoding".
But this "percent encoding" is named escaping mechanism in rfc
documents. Also there is one thing to remeber, that when using
encodeURIComponent/decodeURIComponent it uses its argument which is
Unicode string (writen in memory using UTF-16 as spec say) and
transforms it using UTF-8 - this is limitation - because escaping
mechanism in GET forms (forms with action="GET") works dependent on
charset value in Content-Type HTTP header of a document served by
server - so
- if charset is windows-1250 then polish l crossed with line is encoded
as %B3
- if charset is utf-8 then polish l crossed with line is encoded as
%C5%82 (produced by encodeURIComponent);

Far more better would be function encodeURIComponent, which would take
second argument charset - which in turn would specify the encoding to
use when doing escaping mechanism, but i don't really know if there is
a one (provied by IE or SpiderMonkey engine in Mozilla based browsers).

BR.
Luke.
 
M

Michael Winter

On 23/11/2005 09:16, Luke Matuszewski wrote:

[snip]
[The] escaping mechanism in GET forms (forms with action="GET") works
dependent on charset value in Content-Type HTTP header of a document
served by server [...]

There should be no escaping at all. If the GET transfer method is in
use, data should be limited to 7-bit ASCII. Anything else is undefined.
In practice, user agents do encode data, but the problem is that, unlike
with POST, there is no way to specify a charset parameter.

The most sensible approach for user agents would be to /always/ use
UTF-8, particularly as RFC 3986 (URI Generic Syntax) requires it for
certain URI components, creating consistent behaviour. Unfortunately,
they don't. Alternatively, avoid GET when transmitting multilingual data.

I refer you to <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.

[snip]

Mike
 
T

Thomas 'PointedEars' Lahn

one said:
hey, that's interesting. I tried it though in Firefox/Gecko. a
window.alert( text) gets me the text I wanted to insert, but the
text on the html page does not change.

How do I get a text change to take effect?

WFM, Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922
Firefox/1.0.7 (Debian package 1.0.7-1) Mnenhy/0.7.2.0.

Why it does not work for you is impossible to say without you showing
the error message received, or the User-Agent and source code used.

<URL:http://validator.w3.org/>
<URL:http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you>
<URL:http://jibbering.com/faq/#FAQ4_43>


PointedEars
 
L

Luke Matuszewski

Thomas 'PointedEars' Lahn napisal(a):
That is not entirely true. I think if your POST request would include
Unicode characters, it would be necessary to declare them as such, probably
with

Accept-Charset: UTF-8,*
No, Accept-Charset influence only on message body of the response.
Status-Line and HTTP headers of the response is all constructed from
the ISO-8859-1.
General HTTP response stream consist:
I Status-Line
II Message-Headers (optiona)
III Blank Line
IV Message Body
 
L

Luke Matuszewski

Michael Winter napisal(a):
There should be no escaping at all.

Yes, but here theory and pratice is evidently not the same - escaping
mechanism is used even on US-ASCI characters like space ' ' (%20) or @
(%40)
http://czyborra.com/charsets/iso646-us.gif

but as a consequence browser implementations and newer server side
scripts has extended escaping mechanism to all (supported by them)
encodings (so eg. polish l is translated in UTF-8 as %C5%82) [dot]

Nice article ;)
 
M

Michael Winter

Michael Winter napisal(a):


Yes, but here theory and pratice is evidently not the same - escaping
mechanism is used even on US-ASCI characters like space ' ' (%20) or @
(%40)

That's entirely different, and not what I was referring to. Certain
US-ASCII characters are reserved within URI components, and the URI
syntax RFCs specify how they are to be treated. Characters from outside
this repertoire, including Unicode characters, are not specified, nor is
there universal agreement in practice. As Flavell concludes, "all other
things being equal, this form submission content-type should be avoided
for serious i18n work."

This part of the thread started due to your concern using
XMLHttpRequest, so idempotence isn't really an issue, and XMLHttpRequest
objects are known (from what I've read) to always transform data using
UTF-8.

Anyway, as far as the HTML side of things are concerned, you should ask
in comp.infosystems.www.authoring.html and consider reviewing archived
material from that group (as they've no doubt discussed it all before).

[snip]

Mike
 
L

Luke Matuszewski

Michael Winter napisal(a):
Anyway, as far as the HTML side of things are concerned, you should ask
in comp.infosystems.www.authoring.html and consider reviewing archived
material from that group (as they've no doubt discussed it all before).

Yep sure.

Best way to use HTML forms and i18n is to serve them with
Content-Type specifing charset=UTF-8. Then when it is form without
charset attribute like:

<form action="url" method="get">

</form>

its contents (values of fields) are encoded in UTF-8 and then escaped
(%xx, as XmlHttpRequest do - as you suggested).
The implementation of XmlHttpRequest may differ on some user agents -
life is rich of that situations - so as the best we should always
encode parameters using encodeURI or encodeURIComponent (which are
fully unicode compilant as not as its old counterpartner escape() )
when using GET HTTP method.

Best Regards.
Luke.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,165
Latest member
JavierBrak
Top