P
Patient Guy
Coding patterns for regular expressions is completely unintuitive, as far
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.
Consider the example of the HTML element TABLE with the following
attributes producing sufficient complexity within the element:
<table id="machines" class="noborders inred"
style="margin:2em 4em;background-color:#ddd;">
Note that the HTML was created as a string in code, and thus there are NO
newlines ('\n') in the string, as if a file was parsed...so newlines are
not an issue. The only whitespace is the space character ' ' itself,
required to delimit the element components.
I want to write an RE containing paranthesized substring matching that
neatly orders attribute components. The resulting array, after the
execution of the string .match() method upon the example, should look as
follows:
attrs = [ "id", "machines", "class", "noborders inred", "style",
"margin:2em 4em;background-color:#ddd;" ]
I can then march down the array (in steps of 2) setting attributes
(name=value) to the element using standard DOM interface methods, right?
In approaching the writing of the RE, I have to take into account the
characters permitted to form the attribute name and the attribute value.
I assume a start to the RE pattern as:
<attribute name>=<attribute value>
I then try to find the right RE pattern for <attribute name>, keeping in
mind what the legal characters are for attribute names according to the
HTML standard ("recommendation"):
[A-Za-z0-9\-]+
I believe this patterns conforms to the standard for attribute values:
[,;'":!%A-Za-z0-9\s\.\-]+
That pattern tries to be more exclusive than inclusive, although I think
just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an
HTML document.
I also have to take into account that the <attribute value> may be
delimited by appropriate characters, the single quote and double quote
(which it should be, according to the HTML "recommendation").
So with all this information, assuming it is correct, writing the RE
should be as easy, if not painless, as falling off a chair:
attrsRE = /([A-Za-z0-9\-]+)=['"]?([,;'":!%A-Za-z0-9\s\.\-]+)/ig;
This was only the first of the tens of variations I have been writing on
the RE to make it work, which it has not, up to now. I have included
special expression controls, such as '?=' and '?:' only recently
introduced in JS1.5, but I would prefer not to include RE special
characters that will break in interpreters not doing version 1.5. The
above variation actually completely ignores the parenthesized substring
matching: it will produce an array that looks like this:
attrs = [ "id="\"machines", "class=\"noborders inred",
"style=\"margin:2em 4em;background-color:#ddd;" ]
I have come to the conclusion that perhaps the use of the global flag
(/.../g) and parenthesized substring matching does not really work, or is
mutually exclusive, because I don't recall ever seeing examples of its
use in the official JavaScript guide or reference. I suppose as a
general rule, it is best not to push the ability of the interpreter to
handle extremely complex tasks in a single JS statement, but to break
them down into simpler task in multiple JS statements, right?
Anyway, the code fragment with numbered lines below represents my code
that is supposed to deal with finding a start tag (end tags are
identified in code preceding this fragment) and handling its attributes.
I have thrown up my hands after hours and hours (over several days)
reading and reading, searching the Internet, and trying to find
variations that work.
1: elem = stringPtr.match(/<([^>]+)/);
2: tag = elem[1].match(/(\w+)/);
3: if (verifyElem(tag[1]) == true)
4: {
5: elemNode = document.createElement(tag[1]);
6: if (levelNode != null)
7: levelNode.appendChild(elemNode);
8: if (isContainer(tag[1]) == true)
9: {
10: levelNode = elemNode;
11: levelTagName[level++] = tag[1];
12: }
13: if ((attrs = elem[1].match(attrsRE)) != null)
14: for (j = 1; j < attrs.length; j += 2)
15: elemNode.setAttribute(attrs[j], attrs[j + 1]);
16: }
NOTES
Line 1 contains a completely unintuitive RE that matches one and only one
tag, and every character in between it. It was kindly provided by Martin
Honnen.
The element name itself is taken in line 2, its validity determined in a
function call in line 3 (function not shown), and the DOM element node
created and made a part of the document fragment in lines 5 and 6. If
the element can contain text and elements, an administrative procedure is
done in lines 9-12.
Then it's on to dealing with attributes in lines 13-15.
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.
Consider the example of the HTML element TABLE with the following
attributes producing sufficient complexity within the element:
<table id="machines" class="noborders inred"
style="margin:2em 4em;background-color:#ddd;">
Note that the HTML was created as a string in code, and thus there are NO
newlines ('\n') in the string, as if a file was parsed...so newlines are
not an issue. The only whitespace is the space character ' ' itself,
required to delimit the element components.
I want to write an RE containing paranthesized substring matching that
neatly orders attribute components. The resulting array, after the
execution of the string .match() method upon the example, should look as
follows:
attrs = [ "id", "machines", "class", "noborders inred", "style",
"margin:2em 4em;background-color:#ddd;" ]
I can then march down the array (in steps of 2) setting attributes
(name=value) to the element using standard DOM interface methods, right?
In approaching the writing of the RE, I have to take into account the
characters permitted to form the attribute name and the attribute value.
I assume a start to the RE pattern as:
<attribute name>=<attribute value>
I then try to find the right RE pattern for <attribute name>, keeping in
mind what the legal characters are for attribute names according to the
HTML standard ("recommendation"):
[A-Za-z0-9\-]+
I believe this patterns conforms to the standard for attribute values:
[,;'":!%A-Za-z0-9\s\.\-]+
That pattern tries to be more exclusive than inclusive, although I think
just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an
HTML document.
I also have to take into account that the <attribute value> may be
delimited by appropriate characters, the single quote and double quote
(which it should be, according to the HTML "recommendation").
So with all this information, assuming it is correct, writing the RE
should be as easy, if not painless, as falling off a chair:
attrsRE = /([A-Za-z0-9\-]+)=['"]?([,;'":!%A-Za-z0-9\s\.\-]+)/ig;
This was only the first of the tens of variations I have been writing on
the RE to make it work, which it has not, up to now. I have included
special expression controls, such as '?=' and '?:' only recently
introduced in JS1.5, but I would prefer not to include RE special
characters that will break in interpreters not doing version 1.5. The
above variation actually completely ignores the parenthesized substring
matching: it will produce an array that looks like this:
attrs = [ "id="\"machines", "class=\"noborders inred",
"style=\"margin:2em 4em;background-color:#ddd;" ]
I have come to the conclusion that perhaps the use of the global flag
(/.../g) and parenthesized substring matching does not really work, or is
mutually exclusive, because I don't recall ever seeing examples of its
use in the official JavaScript guide or reference. I suppose as a
general rule, it is best not to push the ability of the interpreter to
handle extremely complex tasks in a single JS statement, but to break
them down into simpler task in multiple JS statements, right?
Anyway, the code fragment with numbered lines below represents my code
that is supposed to deal with finding a start tag (end tags are
identified in code preceding this fragment) and handling its attributes.
I have thrown up my hands after hours and hours (over several days)
reading and reading, searching the Internet, and trying to find
variations that work.
1: elem = stringPtr.match(/<([^>]+)/);
2: tag = elem[1].match(/(\w+)/);
3: if (verifyElem(tag[1]) == true)
4: {
5: elemNode = document.createElement(tag[1]);
6: if (levelNode != null)
7: levelNode.appendChild(elemNode);
8: if (isContainer(tag[1]) == true)
9: {
10: levelNode = elemNode;
11: levelTagName[level++] = tag[1];
12: }
13: if ((attrs = elem[1].match(attrsRE)) != null)
14: for (j = 1; j < attrs.length; j += 2)
15: elemNode.setAttribute(attrs[j], attrs[j + 1]);
16: }
NOTES
Line 1 contains a completely unintuitive RE that matches one and only one
tag, and every character in between it. It was kindly provided by Martin
Honnen.
The element name itself is taken in line 2, its validity determined in a
function call in line 3 (function not shown), and the DOM element node
created and made a part of the document fragment in lines 5 and 6. If
the element can contain text and elements, an administrative procedure is
done in lines 9-12.
Then it's on to dealing with attributes in lines 13-15.