Regular Expressions Challenge

P

Patient Guy

Coding patterns for regular expressions is completely unintuitive, as far
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.

Consider the example of the HTML element TABLE with the following
attributes producing sufficient complexity within the element:

<table id="machines" class="noborders inred"
style="margin:2em 4em;background-color:#ddd;">

Note that the HTML was created as a string in code, and thus there are NO
newlines ('\n') in the string, as if a file was parsed...so newlines are
not an issue. The only whitespace is the space character ' ' itself,
required to delimit the element components.

I want to write an RE containing paranthesized substring matching that
neatly orders attribute components. The resulting array, after the
execution of the string .match() method upon the example, should look as
follows:

attrs = [ "id", "machines", "class", "noborders inred", "style",
"margin:2em 4em;background-color:#ddd;" ]

I can then march down the array (in steps of 2) setting attributes
(name=value) to the element using standard DOM interface methods, right?

In approaching the writing of the RE, I have to take into account the
characters permitted to form the attribute name and the attribute value.

I assume a start to the RE pattern as:

<attribute name>=<attribute value>

I then try to find the right RE pattern for <attribute name>, keeping in
mind what the legal characters are for attribute names according to the
HTML standard ("recommendation"):

[A-Za-z0-9\-]+

I believe this patterns conforms to the standard for attribute values:

[,;'":!%A-Za-z0-9\s\.\-]+

That pattern tries to be more exclusive than inclusive, although I think
just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an
HTML document.

I also have to take into account that the <attribute value> may be
delimited by appropriate characters, the single quote and double quote
(which it should be, according to the HTML "recommendation").

So with all this information, assuming it is correct, writing the RE
should be as easy, if not painless, as falling off a chair:

attrsRE = /([A-Za-z0-9\-]+)=['"]?([,;'":!%A-Za-z0-9\s\.\-]+)/ig;

This was only the first of the tens of variations I have been writing on
the RE to make it work, which it has not, up to now. I have included
special expression controls, such as '?=' and '?:' only recently
introduced in JS1.5, but I would prefer not to include RE special
characters that will break in interpreters not doing version 1.5. The
above variation actually completely ignores the parenthesized substring
matching: it will produce an array that looks like this:

attrs = [ "id="\"machines", "class=\"noborders inred",
"style=\"margin:2em 4em;background-color:#ddd;" ]

I have come to the conclusion that perhaps the use of the global flag
(/.../g) and parenthesized substring matching does not really work, or is
mutually exclusive, because I don't recall ever seeing examples of its
use in the official JavaScript guide or reference. I suppose as a
general rule, it is best not to push the ability of the interpreter to
handle extremely complex tasks in a single JS statement, but to break
them down into simpler task in multiple JS statements, right?

Anyway, the code fragment with numbered lines below represents my code
that is supposed to deal with finding a start tag (end tags are
identified in code preceding this fragment) and handling its attributes.
I have thrown up my hands after hours and hours (over several days)
reading and reading, searching the Internet, and trying to find
variations that work.

1: elem = stringPtr.match(/<([^>]+)/);
2: tag = elem[1].match(/(\w+)/);
3: if (verifyElem(tag[1]) == true)
4: {
5: elemNode = document.createElement(tag[1]);
6: if (levelNode != null)
7: levelNode.appendChild(elemNode);
8: if (isContainer(tag[1]) == true)
9: {
10: levelNode = elemNode;
11: levelTagName[level++] = tag[1];
12: }
13: if ((attrs = elem[1].match(attrsRE)) != null)
14: for (j = 1; j < attrs.length; j += 2)
15: elemNode.setAttribute(attrs[j], attrs[j + 1]);
16: }

NOTES
Line 1 contains a completely unintuitive RE that matches one and only one
tag, and every character in between it. It was kindly provided by Martin
Honnen.
The element name itself is taken in line 2, its validity determined in a
function call in line 3 (function not shown), and the DOM element node
created and made a part of the document fragment in lines 5 and 6. If
the element can contain text and elements, an administrative procedure is
done in lines 9-12.
Then it's on to dealing with attributes in lines 13-15.
 
M

Matthew Lock

Coding patterns for regular expressions is completely unintuitive, as
far
as I can see.

Regular expressions are unintuitive because pattern matching is
unintuitive.

I can't recommend the following book enough. After I read the first 3
chapters I have never struggled with regex since:
http://www.oreilly.com/catalog/regex/
[A-Za-z0-9\-]+

You can represent the above as [\w-]+
I believe this patterns conforms to the standard for attribute values:

[,;'":!%A-Za-z0-9\s\.\-]+

That pattern tries to be more exclusive than inclusive, although I think
just about every character on the planet, including a newline, is
acceptable in an attribute value, at least the kind one might see in an
HTML document.

Don't forget the hash/pound/bang symbol "#" for hex colour values,
like:

<body bgcolor="#ffffff">

Parsing HTML by hand with regex is notoriously difficult to get right.
If you are doing it to analyse HTML in the wild I would stick with
letting the browser's DOM parse it.

Good luck
 
O

osfameron

Matthew said:
I can't recommend the following book enough. After I read the first 3
chapters I have never struggled with regex since:
http://www.oreilly.com/catalog/regex/

Seconded. Of course, depending on your needs, an introductory chapter
on Regexes in any Perl, javascript or similar book might do for you.
(Though if you're trying to parse HTML with regular expressions, you may
not fall into that category)
Parsing HTML by hand with regex is notoriously difficult to get right.
If you are doing it to analyse HTML in the wild I would stick with
letting the browser's DOM parse it.

Seconded. Actually, people tend to say it's impossible. I think the
O'Reilly book goes into why. You'd be better off writing an HTML parser
(which could of course make heavy use of regexes internally). This is
the advice that is regularly brought up on Perl newsgroups. (And bear
in mind that Perl hackers tend to love regexes, and love doing twisted,
clever things with them).

The advice to give up and use another parser (the browser's DOM, as
above) is a good idea.

(Don't give up on regular expressions though - for a certain class of
problems that don't necessarily include HTML, they are indispensable).
 
F

Fred Oz

Patient said:
Coding patterns for regular expressions is completely unintuitive, as far
as I can see. I have been trying to write script that produces an array
of attribute components within an HTML element.
[...]

Why are you parsing HTML? Are you reading HTML from somewhere,
then replacing the HTML with DOM create element commands?

Are you reading from the current document? If so, every element
has an "attributes" parameter that returns an array of all the
attributes on an element.

If you are dealing with HTML as ASCII text, how will you deal
with single word attributes such as "checked"?

What's the point?
 
M

Matthew Lock

osfameron said:
Seconded. Actually, people tend to say it's impossible. I think the
O'Reilly book goes into why.

Yeah one of the reasons it's impossible is that keeping track of HTML
comments and possible nested comments requires a state machine. Other
things which are pretty difficult with regex are javascript blocks, and
attributes with escaped quotes in them.
 
M

Matthew Lock

I am trying to write a completely client-side script that allows one
to
create/write questions in making an examination and which allows a test-
taker to take the exam. The script involves accessing the filesystem
clearly, and the user will deal with that.

Be careful with a completely client-side approach to exams, as all the
answers will have to be stored in the test-taker's browser, making it
possible for the test-taker to cheat.
1) make a HTML Document Fragment (or root DIV element for browsers that
break on the DOM standard)
2) hang all containing nodes and text off that
3) find the code that must read the parts of the HTML text string and
make sense of it, short of doing a character-by-character reading of the
string

You have lost me somewhere, what do you want to do exactly? Allow the
exam writer to specify HTML code that will be attached to the document
at some stage?
Someone has done it (such as the programmers who wrote that part of the
code that parses the text for browsers). You are correct about single
word attributes, and this probably makes the construction of the regular
expression pattern enormously difficult, but does it make it
impossible?

Yes but when the browser makers did it, they probably used a recursive
decent parser rather than regular expressions. Besides, just because
*some* programmers have done it, doesn't mean that it is within the
reach of you or I.

I would say that a "parser" that can parse real world HTML would be
practically impossible with just regular expressions.
 
P

Patient Guy

Be careful with a completely client-side approach to exams, as all the
answers will have to be stored in the test-taker's browser, making it
possible for the test-taker to cheat.

Actually, I was more than confusing here by saying it is client-side only
because the browser's functionality (ability to render HTML, interpret
scripts, style the content) is basically being used as a stand-alone
application on the system.

What will be done here is that the teacher will create the exam (see next
response paragraph for details of the interface) on the computer. Then
the exam will be opened by the teacher or the examinee, and the examinee
will be set in front of the computer and take the examination.

The test-writing/creating feature is as follows: HTML coding is used to
produce standard form controls (textbox or textarea, radio, checkboxes,
buttons) that ask the teacher to indicate the type of question (multiple
choice, fill-in or short answer), the correct answer to the question (if
a choice type question), and a textbox for the question itself. The test
writer hits a button, the form input data is properly formatted (possibly
encrypted), and stored to permanent media. There are at least two levels
to the interface: one for general settings and options and a broad view
of the file being worked (list of questions), and the specific interface
level just described for composing the exam question.

On the test-taking side, the test file is accessed (possibly decrypted),
and presented in the format using a form (with controls) that accepts the
answer of the user. Only one question and its place for answer appears
on a screen, with standard 'next' and 'previous' buttons for the examinee
to go from question to question. A timer might be started either to
measure the amount of time an examinee takes to answer a question, or to
impose a limit on total exam time. Sure the timer might be defeated by
sophisticated users in various ways, but if it is really called for, I
can write features that timestamp the viewing/opening of a question and
its answering. This test module is intended for taking all assaults
against the most sophisticated user. My original motivation for writing
this whole thing was as a tool to assist in the education of my 9-year
old daughter, who should use the computer for more than playing
"Spiderman 2."

When the examinee finishes, the test can be automatically scored if it is
one in which all the answers can be determined by the system. Besides
the examinee's score, I may also present the examiner the total time used
to take the test, as well as the time for each question, as this can be
an indicator of sticking points.

You have lost me somewhere, what do you want to do exactly? Allow the
exam writer to specify HTML code that will be attached to the document
at some stage?

The exam writer will interact with a standard browser form (rendered in
HTML). The script will read in the form control settings from the HTML
form (including the textbox that contains text---I allow for the exam
writer to include tags to format his text, such as with bold,
super/subscript, etc.), and store the written exam content in a file on
disk, the file format of my own creation, much like any database program
creator makes a database (or data) file having its own specialized
format. I may even encrypt the disk-stored data with a simple encryption
algorithm (the key embedded in the script itself rather than given by the
user), assuming that it really a concern.

The script on the test-taking side reads in the file and holds its
contents, then presents the questions in the HTML browser, formatting
appearance using HTML/CSS according the examiner's options settings when
the test was created. Thus a block of the page (DIV, DocFrag) is
dynamically updated, and thus a function is necessary to build a document
fragment tree with element nodes and content.

impossible?

Yes but when the browser makers did it, they probably used a recursive
decent parser rather than regular expressions. Besides, just because
*some* programmers have done it, doesn't mean that it is within the
reach of you or I.

I would say that a "parser" that can parse real world HTML would be
practically impossible with just regular expressions.

Based on the we-don't-want-to-even-think-of-going-there responses I am
getting about trying to build document fragments and plant them in an
existing document, I am thinking of trying another approach. The advice
here is that I should use the browser's own built-in capabilities of
presenting HTML, meaning that I have to use document.write() statements,
right? Well document.write() statements are explicitly or implicitly
preceded by document.open() and followed by document.close() statements,
correct? And they erase any of the previous contents of a window,
correct? So if I have a browser's presntation area ("client window
area"), I have to figure a way to make multiple windows out of it, with
the content of some of the windows being static---that is, content I
always want to be present on screen----and the content of one or more
other windows being dynamic. I think the use of HTML frames work in this
case, and I'll just have to put a <noframes> warning that the mini-
application will not work on frames-incapable browsers.

What do you think? Is that a reasonable solution, or is it worth it to
write a simple HTML parser that uses ONE or MORE regular expressions to
divide up the work of reading HTML?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top