Regex (?(?{CODE})) has too many branches

James Taylor · Jul 20, 2005

I have a web page in $page and I'm trying to scrape out a
particular table of information from it. There are many
tables in the page and some of them are nested. The table
I'm interested in is unique in that it satisfies both of the
following two conditions:

1. It does not contain any nested tables
2. It does contain cells with class="whiteHeading"

I feel it ought to be possible to extract the table with a
single regex and, even if only as a learning exercise, I'd
like to know how to achieve it. This is what I've got so far:

my ($table) = $page =~ m{
(?> <table\b.*?> (.*?) </table> ) # Get table without backtracking
(?(?{
$1 !~ /<table\b/i && # Must not contain another table
$1 =~ /\bclass="whiteHeading"/i # and must contain white headings
}) | _FAIL_ )
}six;

Unfortunately, this generates the following error:

/\bclass="whiteHeading"/: Switch (?(condition)... contains too many
branches at myprog line 66.

Is my mistake obvious? How else can I match a portion that
does NOT contain a particular substring?
Thanks.

James Taylor · Jul 20, 2005

James said:
James said:

my ($table) = $page =~ m{
(?> <table\b.*?> (.*?) </table> ) # Get table
(?(?{
$1 !~ /<table\b/i && # Mustn't contain a table
$1 =~ /\bclass="whiteHeading"/i # Must have white headings
}) | _FAIL_ )
}six;

Unfortunately, this generates the following error:

/\bclass="whiteHeading"/: Switch (?(condition)... contains too
many branches at myprog line 66.

Click to expand...

The regex engine is not re-entrant, [snip]

Use index().

Aha! Thanks for that. After spending some time trying to work
out why index() wasn't working, I eventually tried index
without brackets and got the following to compile and run:

my ($table) = $page =~ m{

# Get a table without backtracking
(?> <table\b.*?> (.*?) </table> )

# Put a lowercase copy of its content in $content
(?{ local $content = lc $1; })

# Check it does not contain another table
(?(?{ -1 == index $content, '<table' }) | --FAIL-- )

# Debug message
(?{ print "Top level table found\n"; })

# Check it does contain a white heading
(?(?{ -1 != index $content, 'class="whiteHeading"' }) | --FAIL-- )

# Debug message
(?{ print "Matched\n" })

}six;

This prints "Top level table found\n" 25 times and falls through
without matching. Further investigation reveals that at the point
$1 is assigned to $content, it is still set to the result of
a *previous* match from further up in my program! This is why
both assertions fail to find what they're looking for, of course.
However, the Camel book (3rd ed) demonstrates (on page 213)
that code blocks *can* access backreferences from earlier in
the current match. So, what am I doing wrong?

I'm starting to worry that this may be a bug in my particular
copy of perl. I'm running the RISC OS port which reports as:

This is perl, version 5.005_03 built for arm-riscos

Copyright 1987-1999, Larry Wall

RISC OS port by Andrew Black and Nicholas Clark (1998),
Steve Ellacott (1996), Luke Taylor (1995) and Paul Moore (1990).
Release 1.13

Unfortunately, there isn't a more up to date version available
for RISC OS (at least not one that works fully) so I'm stuck.
Can anyone help?
Thanks.

Big and Blue · Jul 21, 2005

James Taylor ([email protected]) wrote on MMMMCCCXLI September

`' However, the Camel book (3rd ed) demonstrates (on page 213)
`' that code blocks *can* access backreferences from earlier in
`' the current match. So, what am I doing wrong?

(Wild guessing here...)

Don't you need \1 rather than $1 for that?

James Taylor · Jul 21, 2005

(Wild guessing here...)

Don't you need \1 rather than $1 for that?

I believe the only place where it is correct to use \1 is in the
non-code parts of the match pattern. The insides of (?{ }) are
normal Perl code and so use the $1 syntax.

James Taylor · Jul 22, 2005

I'd write that as (untested):

m{ <table\b [^"'>]* (?: (?: "[^"]*" | '[^']*' ) [^"']*) * >
[^<c]* (?: (?: <(?!table) | c(?!lass="whiteHeading") ) [^<c]* )*
</table>
}xi;

Wow, that looks pretty clever, and it gets away without using
any code blocks too which must be an efficiency improvement.
So that I can be sure I've understood what this is doing,
I'm going to number each line and comment it. Perhaps you'd
be kind enough to point out any misunderstandings. Thanks.

1 m{
2 <table\b # Start of begin table tag
3 [^"'>]* # inside of tag but avoiding quotes
4 (?: # zero or more of...
5 (?: # either...
6 "[^"]*" # a double quoted attribute value
# which may include a > char
7 | # or...
8 '[^']*' # a single quoted attribute value
# which may include a > char
9 ) #
10 [^"']* # more tag between quoted bits (but we must
# add a > char to stop it running away)
11 )* #
12 > # End of begin table tag
# Table content follows (should be captured):
13 [^<c]* # content between tags or possible class=
14 (?: # zero or more of...
15 (?: # either...
16 < # a tag
17 (?!table) # which isn't a begin table tag
18 | # or...
19 c # a 'c'
20 (?!lass="whiteHeading") # which isn't a class="whiteHeading"
21 ) #
22 [^<c]* # more non-tag non-c content
23 )* #

24 </table> # End table tag
25 }xi;

I've never seen a > inside an attribute value that hasn't been
converted to a > entity. Wouldn't it would be an error anyway?
To guard against missing close quotes on attribute values I would
prefer to regard the first > as a tag terminator regardless of
whether all quotes have been closed correctly. I could therefore
simplify the regex by removing lines 4-11. Would this be sensible,
or would it actually contravene the formal syntax rules?

It'll fail to match if there's a `class="whiteHeading"'
outside a tag

Yes, but this is never likely to happen on the specific page
I'm scraping, so I'm happy to fudge over that.

something like `' in the text,

If commented out portions of HTML become a problem, I can
simply strip out the comments before applying the table
finding regex under discussion.

The main problem I see with this regex is that it seems to
capture only tables that do NOT contain class="whiteHeading",
but my two criteria for selecting the right table are that:

1. It does not contain any nested tables
2. It DOES contain cells with class="whiteHeading"

Is there a simple way to ensure class="whiteHeading" *is* present
within the table whilst still using your clever trick for
avoiding nested tables? I cannot simply change line 20 into a
positive look-ahead because there are other class="" attributes
which have nothing to do with my selection criteria. I suppose I
could flag the occurrence of a class="whiteHeading" using yet
more code blocks like this:

# Table content follows (should be captured):
13 [^<c]* # content between tags or possible class=
13a (?{ local $found = 0 }) # whiteHeading not yet found
14 (?: # zero or more of...
15 (?: # either...
16 < # a tag
17 (?!table) # which isn't a begin table tag
17a | # or...
17b class="whiteHeading" # a class="whiteHeading"
17c (?{ $found = 1 }) # which we flag as found
18 | # or...
19 c # a 'c'
20 (?!lass="whiteHeading") # which isn't a class="whiteHeading"
21 ) #
22 [^<c]* # more non-tag non-c content
23 )* #
23a (?(?{ $found }) | --FAIL-- ) # Backtrack unless whiteHeading found

However, I'm reluctant to go back to code blocks now that
you've shown me how to avoid them. Is there a better way?

James Taylor · Jul 23, 2005

James said:
James said:

I've never seen a > inside an attribute value that hasn't been
converted to a > entity. Wouldn't it would be an error anyway?

Click to expand...

No. You seldomly need to escape an > in HTML. About the only
time you need to escape an > is in the ]]> token - and there's no
mainstream browser that can handle <!INCLUDE [ ... [ ... ]]> in a
meaningful way anyway.

<*whoosh*> That's the sound of that paragraph going way over

So, you're willing to mismatch correctly written HTML in order
to deal with incorrectly written HTML?

Well, I'm just trying to match the sophistication of the
solution to the size of the problem and save some complexity
where it isn't needed. It's a case of pragmatism over perfection.
A sledge hammer's great for building railroads, but not
appropriate for cracking nuts. (If you see what I mean.)

I'd worry more about using whitespace around the equal sign in

class="whiteHeading"

or that single quotes (or no quotes at all) are used.
Or 'class' in capitals.

Yes, I accept all your concerns, and if I were writing code
for other people to use I'd put more time into making it
work in all circumstances. However, I'm in the lucky position
of writing only for myself and I can rewrite as necessary if
the details of the web page I'm scraping should change.
Indeed, even if I were using a perfect HTML parser, it would
still be impossible to guard against the page format changing
sufficiently to break the seek and scrape code, so I'll have
to monitor it and update it as needed anyway.

But an attribute might contain ''. What's in between is not an HTML comment.

Shocking! It may be legitimate to put unescaped angle brackets
in attribute values but, frankly, anyone who does so is asking
for trouble. Fortunately, I can modify my scrape code as needed,
but others may not have that luxury. Anyone who actually produces
HTML with unescaped angle brackets in unusual places is not
writing robust defensive code and probably deserves what they get.

Anyway, I'd use something like (untested):

m{ <table\b [^"'>]* (?: (?: "[^"]*" | '[^']*' ) [^"']*) * >
[^<]* (?: <(?!table) [^<]* )*
class="whiteHeading"
[^<]* (?: <(?!table) [^<]* )*
</table>
}xi;

Okay, I tried that but got the following output:

Fatal signal received: EMT trap
A core dump will now follow ...

stack backtrace:

pc: d700c sp: f257c __unixlib_internal_post_signal()
pc: 66530 sp: f25bc regcppush()
pc: 6810c sp: f2678 regmatch()
pc: 6810c sp: f2734 regmatch()
pc: 6810c sp: f27f0 regmatch()
pc: 6810c sp: f28ac regmatch()
pc: 6810c sp: f2968 regmatch()

Tad McClellan · Jul 23, 2005

James Taylor said:
James said:

I've never seen a > inside an attribute value that hasn't been
converted to a > entity. Wouldn't it would be an error anyway?

Click to expand...

No. You seldomly need to escape an > in HTML. About the only
time you need to escape an > is in the ]]> token - and there's no
mainstream browser that can handle <!INCLUDE [ ... [ ... ]]> in a
meaningful way anyway.

Click to expand...

<*whoosh*> That's the sound of that paragraph going way over
my head. I assume that <!INCLUDE> is an SGML thing.

Yes, it is called a "marked section".

Is it
also relevant to HTML?

Since HTML is an "SGML application", all SGML things apply to HTML
things, despite the fact that the most common processors (ie. browsers)
are not spec-compliant.

Where can I read up on this?

http://www.google.com/search?q=SGML+"marked+section

Shocking!

Not for folks that pay attention to specifications.

However, such folks are exceedingly rare in the WWW realm...

It may be legitimate to put unescaped angle brackets
in attribute values but, frankly, anyone who does so is asking
for trouble.

Not if they are using code that complies to the specification.

Anyone who actually produces
HTML with unescaped angle brackets in unusual places is not
writing robust defensive code and probably deserves what they get.

Firstly, HTML is _data_, not code.

And any HTML-processing code that cannot handle unescaped angle brackets
in unusual places is not robust defensive code, and anyone who uses
such code probably deserves what they get.

And it kinda sounds like you are leaning toward producing such code...

but I have a feeling that something more fundamental is awry.

Attempting to treat a "context free grammar" as if it was a
"regular grammar" is fundamentally awry.

IOW, attempting to use regexes rather than a real parser is
what's complicating things.

James Taylor · Jul 23, 2005

James said:
James said:

Abigail said:

James Taylor wrote:

I've never seen a > inside an attribute value that hasn't been
converted to a > entity. Wouldn't it would be an error anyway?

No. You seldomly need to escape an > in HTML. About the only
time you need to escape an > is in the ]]> token - and there's no
mainstream browser that can handle <!INCLUDE [ ... [ ... ]]> in a
meaningful way anyway.

Click to expand...

<*whoosh*> That's the sound of that paragraph going way over
my head. I assume that <!INCLUDE> is an SGML thing. Is it
also relevant to HTML? Where can I read up on this?

Click to expand...

Since HTML is an SGML application, anything SGML is relevant
to HTML. Not that 99% of the webauthors or browser programmers
care about that though.

In my O'Reilly HTML book, "HTML & XHTML, The Definitive Guide" 4th ed.
it has this to say on page 9:

"The problem with SGML is that it is so broad and
all-encompassing that mere mortals cannot use it. Using SGML
effectively requires very expensive and complex tools that
are completely beyond the scope of regular people who just
want to bang out an HTML document in their spare time. As a
result, HTML and other language standards adhere to some,
but not all SGML standards, eliminating many of the more
esoteric features so that HTML is readily usable and used."

Even the W3C advise against things like <![ INCLUDE [ ]]>:

http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.3.3

So, if the HTML standards themselves and other authorities
who I respect say that HTML doesn't have to be, and isn't in
practice, a complete SGML application, then I feel it's not
entirely unreasonable for me to take that advice on board.

I certainly feel that for the current, one-off, just for myself,
scrape of a web page, it would not be appropriate for me to
spend weeks or months implementing a perfect SGML parser.
Nor do I feel my tiny self sufficient script needs to start
C<use>ing packages of parser modules, even assuming I could
successfully install them on RISC OS without a compilation
which is doubtful. All I need is a simple but effective
regex for extracting the table I desire from a particular
page *as it stands now*. If I run my program again in the
future and discover the page has changed enough to break the
scrape, then I'll just alter my program without fuss.

I thank you very much for your help with this task, and on
another level I appreciate your perfectionism, however it is
also important to keep a sense of perspective and match the
size of the solution to the size of the problem.

In fact, properly defining what you want to match is 95%
of writing your regex.

I couldn't agree more and, if I've not been clear, it's my
fault, sorry. Here's a quick attempt at restating it:

Find and capture the first occurrence of text between
'<table' and '</table>' which does not contain another
'<table' but does contain 'class="whiteHeading"'.
There is no need to interpret the input at any higher
level than a sequence of bytes.

Huh? They're programming against a rigorously defined standard.
And following it. You can't be more defensive than that.

Well, I beg to differ. Many standards leave plenty of room
for choice and interpretation, and are nowhere near as
rigorous as one might wish. Standards bodies often leave
wiggle room as a *feature*. In this particular case, it is
clear to me that angle brackets within quoted values could
just as legitimately be encoded as < > entities and
that doing so would be more robust than not doing so. Given
a completely free choice, a good programmer would anticipate
the difficulty someone else might have in parsing the HTML
(especially given the pragmatic concerns over missing
quotes) and would therefore encode the angle brackets.

Anyway, if you want to cut corners, go ahead. Don't expect
my sympathy. Or help. The web is already ruined enough by
people with your attitude.

I think that's a bit harsh, and I think you misunderstand my
attitude. I wouldn't cut corners in production code or in
my public HTML output because doing so would impoverish
society as a whole. However, I *would* cut corners in code
I write as a quick hack for my own purposes. Perl makes the
easy things easy so that you can get your job done. If you
were honest, I think you'd admit to cutting corners too.

As for what ruins the web, you've really pressed a trigger
for me. I abhor the mindless way in which the accessibility
of the web is being eroded by people who unthinkingly use
the most maximally fragile HTML, JavaScript, CSS, Flash, and
any other bleeding edge technology they get their hands on
without considering its adverse effects in older or otherwise
less capable browsers. In stark contrast I take extreme care
only to use the most robust, minimally fragile, gracefully
degrading, and pragmatic code possible. As a result my software
and web sites are fully accessible and look good in *all*
browsers. I've given up trying to educate the bleeding edge
crowd because they're too braindead and recalcitrant to help.

No. And I don't get a core dump.

Hmmm, curious. I'd better ask in another thread.

Thanks for all your help.

James Taylor · Jul 23, 2005

Yes, it is called a "marked section".

Since HTML is an "SGML application", all SGML things apply
to HTML things, despite the fact that the most common
processors (ie. browsers) are not spec-compliant.

Well, from what I was able to find on this matter, opinion seems
to favour the view that HTML is, for practical purposes, only
a subset of SGML. Marked sections such as <![INCLUDE [] ]>
appear to be specifically depreciated/discouraged in HTML.
Publishing HTML with SGML includes in it is asking for trouble.

Not for folks that pay attention to specifications.

No, my point is that I would find it shocking to see HTML
written like that because, whether the author is an SGML
expert or not, it demonstrates a decision not to write
robust HTML likely to work in every browser, or possibly a
blindness to such practicalities. We live in a world full of
far too much fragile HTML as it is, and we should expect
higher standards of interoperability for the sake of our
freedom. There is a social responsibility to be as inclusive
as possible with HTML markup that few website authors seem
to care about. For a website author to defiantly claim
"my site is SGML compliant so it must be your browser that's
broken" is as ignorant as him saying "my site works in the
latest version of MSIE, so I don't care about your browser".
I could get really cross with people like that because, not
only are they turning the web into shit for those of us not
using the latest browser, but they are also impoverishing
all minority platforms to the benefit of monopolies which
have sufficient finances to keep up with the ever increasing
development costs. Minority platforms continue to die away
as the monopoly becomes ever stronger. It's like watching
people merrily destroying a rainforest to replace its rich
diversity with a rat infested dump. Grrr... I could get all
RMS about this. We're raising the bar of complexity for
little gain and throwing away our technological freedoms
without thinking. Soon we'll all be using the same bloated
buggy browser and talking in NewSpeak!

HTML is _data_, not code.

I realise that HTML is not a programming language, but the
use of the term "code" to describe it is fairly widely used
and understood. I was lax, perhaps, but pedantry tends to
have a greater negative effect on discourse.

attempting to use regexes rather than a real parser is
what's complicating things.

The HTML parsers I've looked at would all involve more work
to use for this specific task and would be slower to run too.
This is what I've settled on, and it works quite well:

my $table;
while ($page =~ m{ (?> <table\b [^>]* > (.*?) </table> ) }xsig) {
$table = $1;
last if $table !~ /<table\b/i &&
$table =~ /class="whiteHeading"/i;
}

Thanks to all.

James Taylor · Jul 26, 2005

Warning: This thread is seriously off-topic now.
In fact, I'll change the subject to reflect this.

Furthermore, I cannot imagine a "good, anticiapting programmer"
writing < instead of '<' inside quotes for the "difficulties"
someone might have parsing HTML, and then leaving off the quotes.

Perhaps, but the parser has to cope with HTML from both experts
*and* amateurs. To cope with missing quotes from amateurs,
it might choose to treat the first > as the tag terminator
then look to see if it can make sense of the tag contents.

It occurs to me that you may be confusing this hypothetical
parser with one that I would write, but that's not my point.
You may also be confusing the poor HTML that the parser is
trying to defend against as something I might write, but
that isn't my point either.

My point is that anyone writing HTML for public consumption
must be mindful of the fact that it will be interpreted by
all manner of parsers; some good, some bad, some good but old,
some new but poor, and some that for pragmatic reasons are
operating in quirks mode to defend against poor HTML. So,
as long as the HTML author doesn't produce invalid HTML,
they should make sensible choices to guard against likely
faults in HTML parsers (unless they don't give a damn about
the web in the first place, of course).

I really hate that attitude. Instead of blaming the people
writing buggy browsers, or other bad parsers, you blame the
people writing HTML!

That's because web developers should know better and anyway
can make corrections easily, whereas on many minority
computing platforms there is no choice of web browser. The
user has to use what's available, or write their own which
is usually not practical or even possible.

Given that the standards allow a free choice of whether to
write angle brackets in attribute values either raw or as
character entities, it doesn't make sense to mindlessly
choose the technique that will cause *more* problems rather
than less for those receiving your HTML. This is just one
example of where a socially conscious choice can improve the
web for everyone. There are plenty of other areas where
careful selection of the correct technique can make a big
difference to accessibility and usability, but it requires
the HTML author to have reasonable clue about the chronology
of the introduction of new technologies, how widespread
support for them is, and how well older browsers degrade.

Which of the following techniques would *you* choose:

1. Encode smart quotes as:
a) unicode entities like ’ etc.
b) Windows codepage 1252 entities
c) convert them to ascii ' and "

2. Site navigation:
a) superkewl Flash app the user has to learn first
b) javascripted rollover images with no alt text
c) static images with alt text and plain href links
d) normal textual links

3. Links that you want the user to open in another window:
a) using a javascript: scheme so you can position the
window and strip it of the normal controls
b) using <a href="whatever" target="_blank">
c) normal href links so the user can make their
own choice about whether they want a new window

4. Glossy company branding to impress people:
a) a Flash splash page with wizzy animations (after all nobody
still uses dial-up, and who wants to get indexed anyway)
b) an animated GIF that constantly draws the eye
c) a static image of the company logo in one corner

5. Named anchors within a page:
a) use <div id="name"> and to hell with older browsers
b) use <a name="name"> which works everywhere

6. Headings, subheadings, and bold:
a) use <div class="head">, <div class="subhead">
and <span class="emphasis">
b) use <h1 class="head">, <h2 class="head">
and <b class="emphasis">

7. Multimap.com style application where the positional
relationship between page elements is important:
a) use positional CSS for everything
b) use HTML tables

8. An HTML form:
a) with a javascript button to submit it
b) with a real submit button

I could probably go on to fill a book with a list of these
sort of choices, but then I'm an experienced web developer
with sufficient clue. The vast majority of kiddie web
deeziners out there would be completely oblivious to the
existence of a choice, and anyway would pick (a) from every
selection just because it's the newest wizzy technology that
gives them the maximum scope for creativity, "so it *has* to
be the right choice doesn't it". It would never even occur
to them that using minimal new technology to achieve their
goals is better than using the maximally new and fragile
technique. Such deeziners have little understanding of what
they're doing (beyond the use of whatever wysiwyg editor
they're using) and they care even less about social niceties,
such as ensuring accessibility to the widest audience,
keeping their HTML and graphics small, neat and efficient
for the benefit dial-up modem users, web caches, etc, or
allowing people to view the site on any browser, at any font
size, or in any window size. They don't care about allowing
people (or bots) to automatically crawl and scrape the site,
in fact they probably think that's a *bad* thing and would
prefer everyone to enter their site only from the front page
so they can throw the right combination of popup advertising
at the hapless suckers!

Not only does the appallingly fragile construction of most
websites reduce the general quality of the web, but it also
imposes a pressure of extinction on minority browsers and
platforms that don't have sufficient market share and
financial muscle to keep up with the grubby complexity that
results from this. Furthermore, it raises the barrier for
entry to anyone wishing to write their own browser, crawler,
or other web client. Despite being a competent Linux user
and fan, I still do the majority of my work on an alternative
platform (RISC OS) because its GUI usability *completely*
outclasses anything available on Linux. (Of course, I have
it networked to my Linux box for the best of both worlds.)
Alternative platforms have much to offer and, just as we
should look after the bio-diversity of the rainforest, we
should avoid needlessly killing off computing platforms in a
mindless lust for the latest kewl thing, otherwise we'll
look back and wonder why we didn't see it coming when some
megacorp owns the world and there are no freedoms left.

So you see, you might think it's a small thing, but when I
see someone advocating the use of fragile markup (needlessly)
in the full knowledge that some browsers won't cope with it
and suggesting that browsers should just get fixed and
upgraded, I hope you now see why I oppose this socially
harmful and myopic attitude as a matter of utmost principle.

axel · Jul 26, 2005

James Taylor said:
Warning: This thread is seriously off-topic now.
In fact, I'll change the subject to reflect this.

Which of the following techniques would *you* choose:

Some of your [snipped] examples have nothing to do HTML as such,
but just various things in webpages.

I could probably go on to fill a book with a list of these
sort of choices, but then I'm an experienced web developer
with sufficient clue. The vast majority of kiddie web
deeziners out there would be completely oblivious to the
existence of a choice, and anyway would pick (a) from every
selection just because it's the newest wizzy technology that
gives them the maximum scope for creativity, "so it *has* to
be the right choice doesn't it". It would never even occur

That is their problem... as their clients may soon realise.

to them that using minimal new technology to achieve their
goals is better than using the maximally new and fragile
technique. Such deeziners have little understanding of what
they're doing (beyond the use of whatever wysiwyg editor
they're using) and they care even less about social niceties,
such as ensuring accessibility to the widest audience,
keeping their HTML and graphics small, neat and efficient
for the benefit dial-up modem users, web caches, etc, or
allowing people to view the site on any browser, at any font
size, or in any window size. They don't care about allowing
people (or bots) to automatically crawl and scrape the site,
in fact they probably think that's a *bad* thing and would
prefer everyone to enter their site only from the front page
so they can throw the right combination of popup advertising
at the hapless suckers!

Not only does the appallingly fragile construction of most
websites reduce the general quality of the web, but it also
imposes a pressure of extinction on minority browsers and
platforms that don't have sufficient market share and
financial muscle to keep up with the grubby complexity that
results from this.

Is that not a reason to keep to standards?

Furthermore, it raises the barrier for
entry to anyone wishing to write their own browser, crawler,
or other web client. Despite being a competent Linux user
and fan, I still do the majority of my work on an alternative
platform (RISC OS) because its GUI usability *completely*
outclasses anything available on Linux. (Of course, I have
it networked to my Linux box for the best of both worlds.)
Alternative platforms have much to offer and, just as we
should look after the bio-diversity of the rainforest, we
should avoid needlessly killing off computing platforms in a
mindless lust for the latest kewl thing, otherwise we'll
look back and wonder why we didn't see it coming when some
megacorp owns the world and there are no freedoms left.

Which computing platforms are being killed off? Hardly Solaris -
I have both Sparc and Intel editions at home and they run quite
well without regard to any HTML standard. HP-UX? A system I would
run Oracle on, not a web browser.

So you see, you might think it's a small thing, but when I
see someone advocating the use of fragile markup (needlessly)
in the full knowledge that some browsers won't cope with it
and suggesting that browsers should just get fixed and
upgraded, I hope you now see why I oppose this socially
harmful and myopic attitude as a matter of utmost principle.

No. If a browser cannot cope with good markup - it is the
fault of the browser. Being able to cope with bad markup is
a plus sign.

I design the webpages from my own site (no, I am not going to
make a plug for it as it would be of little interest to anyone)
to be Lynx viewable - except those containing photographs.

Axel

James Taylor · Jul 27, 2005

James said:
James said:

Which of the following techniques would *you* choose:

Click to expand...

Some of your [snipped] examples have nothing to do HTML as such,
but just various things in webpages.

Yes, but they illustrate my point that a good web developer
should write defensively (and that very few people do so).

That is their problem... as their clients may soon realise.

Agreed, although their clients generally have less clue than
they do and only bother to test the site on MSIE. The
clients certainly don't bother to look "under the hood" to
check whether the site has been defensively coded.

Is that not a reason to keep to standards?

Sure. I'm certainly not advocating deviation from any standards.
I'm advocating creating pages *within* the standards in such a
way that degrades gracefully, and uses the minimum technology
to achieve the requirements of the site. I'm suggesting
people learn a social conscience and try to be inclusive to
older browsers and simpler hand-coded web clients etc. I'm
saying don't make it impossible for anything but the latest
bleeding edge bloatware browser to access, otherwise that'll
be the only browser that gets sufficient development time to
keep up and all the other platforms will perish. People have
been compounding the Microsoft monopoly like that for far too
long as it is, and the result is that a crap system thrives
while excellent alternative systems whither on the vine.

Which computing platforms are being killed off?

Well, personally I use RISC OS. It blows most other systems
out of the water for sheer fluid productivity, but shortage
of software development has left it struggling to keep up
with modern web technology. Not being able to access many
important websites has driven people away from the platform,
reducing the market share, thus reducing the potential sales
for developers so they develop for other systems instead,
leading to a shortage of developers and thus a vicious
circle of decline. One of the primary forces behind my
beloved platform's decline is the thoughtless use of newer
web technology by most web designers, and by thoughtless I
mean things like javascript sniffers that test for MSIE and
NN then raise an error if it doesn't match and also don't
have any alternative content between the <noscript> tags.
The idiots who do that kind of thing not only harm their
own site and the web in general, but their social
irresponsibility also wounds all minority browsers and
platforms (including mine) and for that they should be shot.

Hardly Solaris - I have both Sparc and Intel editions at home
and they run quite well without regard to any HTML standard.
HP-UX? A system I would run Oracle on, not a web browser.

If you think "alternative" necessarily means a unix derivative,
you've been leading a very insular lifestyle.

No. If a browser cannot cope with good markup - it is the
fault of the browser. Being able to cope with bad markup is
a plus sign.

I understand this point of view, but it overlooks the social
responsibility web authors should feel to be as inclusive as
possible when publishing. Perhaps my command of English is
not sufficient to be persuasive, or perhaps I'm just not making
myself clear, but I can't believe I'm the only web developer
to understand this essential principle. God I hope not!

I design the webpages from my own site (no, I am not going to
make a plug for it as it would be of little interest to anyone)
to be Lynx viewable - except those containing photographs.

Good start.

James Taylor · Jul 27, 2005

This is the one of the most remarkable pieces of bullshit I've
ever seen.

Really? Is that a good thing? ;-)

The only reason people (used) to be able "to get away" with
not using a closing quote was the losing coding talents of
Marc Andreessen and his seven little dwarves.

What makes you think Mosaic derivatives are the only
browsers that behaved that way?

Well, it was you who didn't know you could use a '>' inside
an attribute value, and it was you would said HTML would
"get what they deserve" if they used '>' inside an attribute
value (instead of &gt, so you were giving me all the reasons
to assume you would write an HTML parser in such a way.

You're entitled to your opinion. I could defend myself
by saying that if I were writing a proper parser I'd start
with the standards. But, as I said, THIS IS NOT MY POINT.

Well, there *have* been browsers that *didn't* do entity
expansion inside attribute values, so if you would
"defend" against those, you wouldn't write '>' inside
an attribute value, but '>'.

Sheesh! Sounds like the only safe thing to do is avoid the use
of '>' in any form within attribute values. Thankfully, I don't
believe there are any situations where it is actually
necessary to put angle brackets in HTML attribute values.

And that's how bad web browsers stay in business. Why
bother fixing your shitty product, if everyone else will
fix their correct code?

I don't understand why you're being so unforgiving towards
browser producers. On many platforms there is little choice
of browser and, when a better one comes along, users have
good reason to be grateful *even* if it's not perfect. In
fact, I don't believe any browser on any platform is ever
likely to be perfect. That might be an irritating wrinkle in
reality for you, but a reality nevertheless.

I'm not saying browsers shouldn't get improved, I'm just saying
that HTML authors should be considerate. I hate the attitude
that says "damn anyone stuck using a less capable browser, as
I don't intend to make the slightest effort to include them".
People with that attitude have ruined the web, and supported
the big bloatware browsers along with the monopolistic
platforms that run them to the detriment of any alternatives.
It's myopic, socially harmful, irresponsible and ignorant.

FAQ 6.17 How do I efficiently match many regular expressions at once?	0	Apr 28, 2011
regex multiline help	0	Feb 3, 2007
Too many variable.	1	Mar 5, 2004
Regex challenge	15	Jun 4, 2008
[XS] direct access to the perl stack in XS code	1	Jul 8, 2013
Performance problem with RegEx	2	Jul 19, 2007
regex problem -- capturing variables	2	Mar 17, 2005
Query problem expressed in PHP	1	Dec 8, 2008

Regex (?(?{CODE})) has too many branches

James Taylor

James Taylor

Big and Blue

James Taylor

James Taylor

James Taylor

Tad McClellan

James Taylor

James Taylor

James Taylor

axel

James Taylor

James Taylor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads