a little parsing challenge â˜º

Xah Lee · Jul 21, 2011

I thought I'd have some fun with multi-processing:

Nice joke. â˜º

Here's a sane version:

https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

hi thomas,

i still cant get your code to work. I have a dir named xxdir with a
single test file xx.txt,with this content:

foo[(])bar

when i run your code
py3 validate_brackets_Thomas_Jollans_2.py

it simply exit and doesn't seem to do anything. I modded your code to
print the file name it's proccessing. Apparently it did process it.

my python isn't strong else i'd dive in. Thanks.

I'm on Python 3.2.1. Here's a shell log:

h3@H3-HP 2011-07-21 05:20:30 ~/web/xxst/find_elisp/validate matching
brackets
py3 validate_brackets_Thomas_Jollans_2.py
h3@H3-HP 2011-07-21 05:20:34 ~/web/xxst/find_elisp/validate matching
brackets
py3 validate_brackets_Thomas_Jollans_2.py
c:/Users/h3/web/xxst/find_elisp/validate matching brackets/xxdir
\xx.txt
h3@H3-HP 2011-07-21 05:21:59 ~/web/xxst/find_elisp/validate matching
brackets
py3 --version
Python 3.2.1
h3@H3-HP 2011-07-21 05:27:03 ~/web/xxst/find_elisp/validate matching
brackets

Xah

Xah Lee · Jul 21, 2011

just installed py3.
there seems to be a bug.
in this file

there's a mismatched double curly quote. at position 28319.

Click to expand...

the python code above doesn't seem to spot it?

Click to expand...

here's the elisp script output when run on that dir:

Click to expand...

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/tm-ch04.html
["“" 28319]
Done deal!

Click to expand...

That script doesn't check that the balance is zero at the end of file.

Patch:

--- ../xah-raymond-old.py 2011-07-19 20:05:13.000000000 +0200
+++ ../xah-raymond.py 2011-07-19 20:03:14.000000000 +0200
@@ -16,6 +16,8 @@
elif c in closers:
if not stack or c != stack.pop():
return i
+ if stack:
+ return i
return -1

def scan(directory, encoding='utf-8'):

Thanks a lot for the fix Raymond.

Though, the code seems to have a minor problem.
It works, but the report is wrong.
e.g. output:

30068: c:/Users/h3/web/xahlee_org/p/time_machine\tm-ch04.html

that 30068 position is the last char in the file.
The correct should be 28319. (or at least point somewhere in the file
at a bracket char that doesn't match.)

Today, i tried 3 more scripts. 2 fixed python3 versions, 1 ruby, all
failed again. I've reported the problems i encounter at python or ruby
newsgroups. If you are the author, a fix is very much appreciated.
I'll get back to your code and eventually do a blog of summary of all
different lang versions.

Am off to test that elaborate perl regex now... cross fingers.

Xah. Mood: quite discouraged.

Thomas Jollans · Jul 21, 2011

I thought I'd have some fun with multi-processing:

Click to expand...

Nice joke. â˜º

Here's a sane version:

https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

Click to expand...

hi thomas,

i still cant get your code to work. I have a dir named xxdir with a
single test file xx.txt,with this content:

foo[(])bar

when i run your code
py3 validate_brackets_Thomas_Jollans_2.py

it simply exit and doesn't seem to do anything. I modded your code to
print the file name it's proccessing. Apparently it did process it.

my python isn't strong else i'd dive in. Thanks.

Curious. Perhaps, in the Windows version of Python, subprocesses don't
use the same stdout? Windows doesn't have fork() (how could they
survive?), so who knows. Try replacing
ex.submit(process_file, fullname)
with
process_file(fullname)
for a non-concurrent version.

Xah Lee · Jul 21, 2011

2011-07-21

I don't know why, but I just had to try it (even though I don't usually
use Perl and had to look up a lot of stuff). I came up with this:

/(?|
Â Â Â ($)(?&matched)([\}\]â€â€ºÂ»ã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (\{)(?&matched)([$\]â€â€ºÂ»ã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (\[)(?&matched)([\)\}â€â€ºÂ»ã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (â€œ)(?&matched)([\)\}\]â€ºÂ»ã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (â€¹)(?&matched)([\)\}\]â€Â»ã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (Â«)(?&matched)([\)\}\]â€â€ºã€‘ã€‰ã€‹ã€ã€]|$) |
Â Â Â (ã€)(?&matched)([\)\}\]â€â€ºÂ»ã€‰ã€‹ã€ã€]|$) |
Â Â Â (ã€ˆ)(?&matched)([\)\}\]â€â€ºÂ»ã€‘ã€‹ã€ã€]|$) |
Â Â Â (ã€Š)(?&matched)([\)\}\]â€â€ºÂ»ã€‘ã€‰ã€ã€]|$) |
Â Â Â (ã€Œ)(?&matched)([\)\}\]â€â€ºÂ»ã€‘ã€‰ã€‹ã€]|$) |
Â Â Â (ã€Ž)(?&matched)([\)\}\]â€â€ºÂ»ã€‘ã€‰ã€‹ã€]|$))
(?(DEFINE)(?<matched>(?:
Â Â Â $(?&matched)$ |
Â Â Â \{(?&matched)\} |
Â Â Â \[(?&matched)\] |
Â Â Â â€œ(?&matched)â€ |
Â Â Â â€¹(?&matched)â€º |
Â Â Â Â«(?&matched)Â» |
Â Â Â ã€(?&matched)ã€‘ |
Â Â Â ã€ˆ(?&matched)ã€‰ |
Â Â Â ã€Š(?&matched)ã€‹ |
Â Â Â ã€Œ(?&matched)ã€ |
Â Â Â ã€Ž(?&matched)ã€ |
Â Â Â [^$\{\[â€œâ€¹Â«ã€ã€ˆã€Šã€Œã€Ž$\}\]â€â€ºÂ»ã€‘ã€‰ã€‹ã€ã€]++)*+))
/sx;

If the pattern matches, there is a mismatched bracket. $1 is set to the
mismatched opening bracket. $-[1] is its location. $2 is the mismatched
closing bracket or '' if the bracket was never closed. $-[2] is set to
the location of the closing bracket or the end of the string if the
bracket wasn't closed.

I didn't write all that manually; it was generated with this:

my @open = ('$','\{','\[','â€œ','â€¹','Â«','ã€','ã€ˆ','ã€Š','ã€Œ','ã€Ž');
my @close = ('$','\}','\]','â€','â€º','Â»','ã€‘','ã€‰','ã€‹','ã€','ã€');

'(?|'.join('|',map
{'('.$open[$_].')(?&matched)(['.join('',@close[0..($_-1),($_+1)..$#close]). ']|$)'}
(0 .. $#open)).')(?(DEFINE)(?<matched>(?:'.join('|',map
{$open[$_].'(?&matched)'.$close[$_]} (0 ..
$#open)).'|[^'.join('',@open,@close).']++)*+))'

Thanks for the code.

are you willing to make it complete and standalone? i.e. i can run it
like this:

perl Rouslan_Korneychuk.pl dirPath

and it prints any file that has mismatched pair and line/column number
or the char position?

i'd do it myself but so far i tried 5 codes, 3 fixes, all failed. Not
a complain, but it does take time to gather the code, of different
langs by different people, properly document their authors and
original source urls, etc, and test it out on my envirenment. All
together in the past 3 days i spent perhaps a total of 4 hours running
several code and writing back etc and so far not one really worked.

i know perl well, but your code is a bit out of the ordinary â˜º. If
past days have been good experience, i might dive in and study for
fun.

Xah

Ian Kelly · Jul 21, 2011

Thanks a lot for the fix Raymond.

That fix was from Thomas Jollans, not Raymond Hettinger.

Though, the code seems to have a minor problem.
It works, but the report is wrong.
e.g. output:

30068: c:/Users/h3/web/xahlee_org/p/time_machine\tm-ch04.html

that 30068 position is the last char in the file.
The correct should be 28319. (or at least point somewhere in the file
at a bracket char that doesn't match.)

Previously said:
If a file has mismatched matching-pairs, the script will display the
file name, and the line number and column number of the first
instance where a mismatched bracket occures. (or, just the char number
instead (as in emacs's “point”))

I submit that as the file contains no mismatched brackets (only an
orphan bracket), the output is correct to specification (indeed you
did not define any output for this case), if not necessarily useful.

In other words, stop being picky. You may be willing to spend an hour
or moe on this, but that doesn't mean anybody else is. Raymond gave
you a basically working Python solution, but forgot one detail.
Thomas fixed that detail for you but didn't invest the time to rewrite
somebody else's function to get the output "correct". Continuing to
harp on it at this point is verging on trolling.

Xah Lee · Jul 21, 2011

Ok. Here's a preliminary report.

ã€ˆLisp, Python, Perl, Ruby â€¦ Code to Validate Matching Bracketsã€‰
http://xahlee.org/comp/validate_matching_brackets.html

it's taking too much time to go thru.

right now, i consider only one valid code, by Raymond Hettinger (with
minor edit from others).

right now, there's 2 other possible correct solution. One by Robert
Klemme but requires ruby19 but i only have ruby18x. One by Thomas
Jollans in Python 3 but didn't run on my machine perhaps due to some
unix/Windows issue, yet to be done.

the other 3 or 4 seems to be incomplete or just suggestion of ideas.

i haven't done extensive testing on my own code neither.
I'll revisit maybe in a few days.

Feel free to grab my report and make it nice. If you would like to fix
your code, feel free to email.

Xah

python · Jul 21, 2011

Xah,

1. Is the following string considered legal?

[ { ( ] ) }

Note: Each type of brace opens and closes in the proper sequence. But
inter-brace opening and closing does not make sense.

Or must a closing brace always balance out with the most recent opening
brace like so?

[ { ( ) } ]

2. If there are multiple unclosed braces at EOF, is the answer you're
looking for the position of the first open brace that hasn't been closed
out yet?

Malcolm

Xah Lee · Jul 21, 2011

Xah,

1. Is the following string considered legal?

[ { ( ] ) }

Note: Each type of brace opens and closes in the proper sequence. But
inter-brace opening and closing does not make sense.
nu!

Or must a closing brace always balance out with the most recent opening
brace like so?

[ { ( ) } ]
yeah!

2. If there are multiple unclosed braces at EOF, is the answer you're
looking for the position of the first open brace that hasn't been closed
out yet?

well, as many pointed out, i really haven't thought it out well.

originally, i just want to know the position of a un-matched char.

i haven't taken the time to think about what really should be the
desired behavior. For me, the problem started because i wanted to use
the script to check my 5k html files, in particular, classic novels
that involves double curly quotes and french quotes. So, the desired
behavior is one based on the question of what would best for the user
to see in order to correct a bracket mismatch error in a file. (which,
can get quite complex for nested syntax, because, usually, once you
have one missed, it's all hell from there. I think this is similar to
the problem when a compiler/interpreter encounters a bad syntax in
source code, and thus the poplar situation where error code of
computer programs are hard to understand...)

but anyway, just for this exercise, the requirement needn't be
stringent. I still think that at least the reported position should be
a matching char in the file. (and if we presume this, then only my
code works. LOL)

PS this is a warmup problem for writing a HTML tag validator. I looked
high and lo in past years, but just couldn't find a script that does
simple validation in batch. The w3c one is based on SGML, really huge
amount of un-unstandable irregular historical baggage. XML lexical
validator is much closer, but still not regular. I simply wanted one
just like the match-pair validator in our problem, except the opening
char is not a single char but string of the form <xyz â€¦> and the
*matching* closing one is of the form </xyz>, and with just one
exception: when a tag has â€œ/>â€ in ending such as <br/> thenit is
skipped (i.e. not considered as opening or closing).

I'll be writing this soon in elispâ€¦ since i haven't studied parsers, i
had hopes that parser expert would show some proper parser solutionsâ€¦
in particular i think such can be expressed in Parsing Expression
Grammar in just a few linesâ€¦ but so far no deity came forward to show
the light. lol

getting rantyâ€¦ it's funny, somehow the tech geekers all want regex to
solve the problem. Regex, regex, regex, a 40 years old deviant bastard
that by some twist of luck became a tool for matching text patterns.
One bloke came forward to show-off a perl regex obfuscation. That's
like, lol. But it might be good for the lulz if his code is actually
complete and worked. Then, you have a few who'd nonchalantly remark
â€œO, you just need push-down automataâ€. LOL, unless they show actual
working code, its Automata their asses.

folks, don't get angry with me. I'm a learner. I'm curious. I always
am eager to learn. And there's always things we can learn. Don't get
into a fit and do the troll dance in a pit with me. Nobody's gonna
give a shit if you think u knew it all. If u are not the master of one
thousand and one languages yet, you can learn with me. â˜º troll!!!!

Xah

Rouslan Korneychuk · Jul 21, 2011

Thanks for the code.

are you willing to make it complete and standalone? i.e. i can run it
like this:

perl Rouslan_Korneychuk.pl dirPath

and it prints any file that has mismatched pair and line/column number
or the char position?

Since you asked, I put up a complete program at http://pastebin.com/d8GNL0kx

I don't know if it will run on Perl earlier than version 5.10 and I'm
pretty sure it wont run below version 5.8.

Also, I realized that I had completely neglected the case of a closing
bracket that is never opened (e.g. "stuff] stuff"). The program I put on
paste bin has an updated regex that handles this case.

Terry Reedy · Jul 21, 2011

had hopes that parser expert would show some proper parser solutionsâ€¦
in particular i think such can be expressed in Parsing Expression
Grammar in just a few linesâ€¦ but so far no deity came forward to show
the light. lol

I am not a parser expert but 20 years ago, I wrote a program in C to
analyze C programs for proper fence matching. My motivation was the
often obsurity of parser error messages derived from mis-matched fences.
I just found the printed copy and an article I wrote but did not get
published.

Balance.c matches tokens, not characters (and hence can deal with /* and
*/). It properly takes into account allowed nestings. For C, {[]} is
legal, [{}] is not. Ditto for () instead of []. Nothing nests within '',
"", and /* */. (I know some C compilers do nest /* */, but not the ones
I used).

I initially started with a recursive descent parser but 1) this
hard-coded the rules for one language and make changes difficult and 2)
made the low-level parsing difficult. So I switched to a table-driven
recursive state/action machine. The tables for your challenge would be
much simpler as you did not specify any nesting rules, although they
would be needed for html checking.

A key point that simplifies things a bit is that every file is
surrounded by an unwritten BOF-EOF pair. So the machine starts with
having 'seen' BOF and is 'looking' for EOF. So it is always looking to
match *something*.

The total program is nearly four pages, but one page is mostly
declarations and command-line processing, another two pages have
typedefs, #DEFINEs, and tables. The actual main loop is about 25 lines,
and 10 lines of that is error reporting. The output is lines with file
name, row and columns of the two tokens matched (optional) or
mismatched, and what the two tokens are.

Since this program would be a useful example for my book, both
didactically and practically, I will try to brush-up a bit on C and
translate it to Python. I will use the re module for some of the
low-level token parsing, like C multibyte characters. I will then change
to tables for Python and perhaps for your challenge.

The current program assumes ascii byte input at it uses an array of
length 128 to classify ascii chars into 14 classes: 13 special for the
matching and 1 'normal' class for everything else. This could be
replaced in Python with a dict 'special' that only maps special
characters to their token class and used as "special.get(char, NORMAL)"
so that the thousands of normal characters are mapped by default to
NORMAL without a humongous array.

John O'Hagan · Jul 25, 2011

On Thu, 21 Jul 2011 05:58:48 -0700 (PDT)

[...]

[...]

That script doesn't check that the balance is zero at the end of file.

Patch:

--- ../xah-raymond-old.py Â Â Â 2011-07-19 20:05:13.000000000 +0200
+++ ../xah-raymond.py Â 2011-07-19 20:03:14.000000000 +0200
@@ -16,6 +16,8 @@
Â Â Â Â Â elif c in closers:
Â Â Â Â Â Â Â if not stack or c != stack.pop():
Â Â Â Â Â Â Â Â Â return i
+ Â Â if stack:
+ Â Â Â Â return i
Â Â Â return -1

Â def scan(directory, encoding='utf-8'):

Click to expand...

Thanks a lot for the fix Raymond.

Though, the code seems to have a minor problem.
It works, but the report is wrong.
e.g. output:

30068: c:/Users/h3/web/xahlee_org/p/time_machine\tm-ch04.html

that 30068 position is the last char in the file.
The correct should be 28319. (or at least point somewhere in the file
at a bracket char that doesn't match.)

[...]

If you want to know where brackets were opened which remain unclosed at EOF, then you have to keep the indices as well as the characters in the stack, and not return until the scan is complete, because anything still in the stack might turn out to be the earliest error. Easy enough to implement:

def checkmatch(string): #skipping the file handling
openers = {'[': ']', '(': ')', '{': '}' } #etc
closers = openers.values()
still_open, close_errors = [], []
for index, char in enumerate(string, start=1):
if char in openers:
still_open.append((index, char))
elif char in closers:
if still_open and char == openers[still_open[-1][1]]:
still_open.pop()
else:
close_errors.append((index, char))
if still_open or close_errors:
return min(still_open[:1] + close_errors[:1])[0]

although you might as well return still_open + close_errors and see them all.

Regards,

John

Programing Challenge: Constructing a Tree Given Its Edges.	0	Jan 8, 2014
a interesting Parallel Programing Problem: asciify-string	0	Mar 6, 2012
Guy Steele on Parallel Programing	1	Feb 5, 2011
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Problems of Symbol Congestion in Computer Languages	54	Feb 16, 2011
Emacs Form Feed (^L) Display Suggestion and Tips	6	Jun 24, 2010
which language has syntax that visually represent a tree? [was X#]	0	Feb 12, 2009
opinion: comp lang docs style	10	Jan 4, 2011

a little parsing challenge â˜º

Xah Lee

Xah Lee

Thomas Jollans

Xah Lee

Ian Kelly

Xah Lee

python

Xah Lee

Rouslan Korneychuk

Terry Reedy

John O'Hagan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads