a little parsing challenge â˜º

Xah Lee · Jul 19, 2011

2011-07-16

Click to expand...

I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.

import sys, os

pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
valid = set( v for pair in pairs.items() for v in pair )

for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
with open(os.path.join(dirpath, name), 'rb') as f:
chars = (c for line in f for c in line if c in valid)
for c in chars:
if c in pairs and stack[-1] == pairs[c]:
stack.pop()
else:
stack.append(c)
print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name

as Ian Kelly mentioned, your script fail because it doesn't report the
position or line/column number of first mismatched bracket. This is
rather significant part to this small problem. Avoiding unicode also
lessen the value of this exercise, because handling unicode in python
isn't trivial, at least with respect to this small exercise.

I added other unicode brackets to your list of brackets, but it seems
your code still fail to catch a file that has mismatched curly quotes.
(e.g. http://xahlee.org/p/time_machine/tm-ch04.html )

LOL Billy.

Xah

Xah Lee · Jul 19, 2011

Uh, okay...

Click to expand...

Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.

Click to expand...

Thanks to Python's expressiveness, this can be easily remedied (see below). Â

I also do not follow Billy's comment about Unicode. Â Unicode and thefact
that Python supports it *natively* cannot be appreciated enough in a
globalized world.

However, I have learned a lot about being pythonic from his posting (take
those generator expressions, for example!), and the idea of looking at the
top of a stack for reference is a really good one. Â Thank you, Billy!

Here is my improvement of his code, which should fill the mentioned gaps.
I have also reversed the order in the report line as I think it is more
natural this way. Â I have tested the code superficially with a directory
containing a single text file. Â Watch for word-wrap:

# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <[email protected]>, based on an ideaof
Billy Mays <[email protected]>
in <'''
import sys, os

pairs = {u'}': u'{', u')': u'(', u']': u'[',
Â Â Â Â Â u'â€': u'â€œ', u'â€º': u'â€¹', u'Â»': u'Â«',
Â Â Â Â Â u'ã€‘': u'ã€', u'ã€‰': u'ã€ˆ', u'ã€‹': u'ã€Š',
Â Â Â Â Â u'ã€': u'ã€Œ', u'ã€': u'ã€Ž'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
Â Â for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
Â Â Â Â for name in filenames:
Â Â Â Â Â Â stack = [' ']

Â Â Â Â Â Â # you can use chardet etc. instead
Â Â Â Â Â Â encoding = 'utf-8'

Â Â Â Â Â Â with open(os.path.join(dirpath,name), 'r') as f:
Â Â Â Â Â Â Â Â reported = False
Â Â Â Â Â Â Â Â chars = ((c, line_no, col) for line_no, line in enumerate(f)
for col, c in enumerate(line.decode(encoding)) if c in valid)
Â Â Â Â Â Â Â Â for c, line_no, col in chars:
Â Â Â Â Â Â Â Â Â Â if c in pairs:
Â Â Â Â Â Â Â Â Â Â Â Â if stack[-1] == pairs[c]:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â stack.pop()
Â Â Â Â Â Â Â Â Â Â Â Â else:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â if not reported:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â first_bad = (c, line_no + 1, col +1)
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â reported = True
Â Â Â Â Â Â Â Â Â Â else:
Â Â Â Â Â Â Â Â Â Â Â Â stack.append(c)

Â Â Â Â Â Â print '%s: %s' % (name, ("good"if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))

Thanks for the fix.
Though, it seems still wrong.

On the file http://xahlee.org/p/time_machine/tm-ch04.html

there is a mismatched curly double quote at 28319.

the script reports:
tm-ch04.html: bad ')' at 68:2

that doesn't seems right. Line 68 is empty. There's no opening or
closing round bracket anywhere close. Nearest are lines 11 and 127.

Maybe Billy Mays's algorithm is wrong.

Xah (fairly discouraged now, after running 3 python scripts all
failed)

Billy Mays · Jul 19, 2011

I added other unicode brackets to your list of brackets, but it seems
your code still fail to catch a file that has mismatched curly quotes.
(e.g.http://xahlee.org/p/time_machine/tm-ch04.html )

LOL Billy.

Xah

I suspect its due to the file mode being opened with 'rb' mode. Also,
the diction of characters at the top, the closing token is the key,
while the opening one is the value. Not sure if thats obvious.

Also returning the position of the first mismatched pair is somewhat
ambiguous. File systems store files as streams of octets (mine do
anyways) rather than as characters. When you ask for the position of
the the first mismatched pair, do you mean the position as per
file.tell() or do you mean the nth character in the utf-8 stream?

Also, you may have answered this earlier but I'll ask again anyways: You
ask for the first mismatched pair, Are you referring to the inner most
mismatched, or the outermost? For example, suppose you have this file:

foo[(])bar

Would the "(" be the first mismatched character or would the "]"?

Xah Lee · Jul 19, 2011

2011-07-16

Click to expand...

folks, this one will be interesting one.

Click to expand...

the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.

Click to expand...

â€¢ The files will be utf-8 encoded (unix style line ending).

Click to expand...

â€¢ If a file has mismatched matching-pairs, the script will display the
file name, and the Â line number and column number of the first
instance where a mismatched bracket occures. (or, just the char number
instead (as in emacs's â€œpointâ€))

Click to expand...

â€¢ the matching pairs are all single unicode chars. They are these and
nothing else: () {} [] â€œâ€ â€¹â€º Â«Â» ã€ã€‘ ã€ˆã€‰ ã€Šã€‹ ã€Œã€ ã€Žã€
Note that â€˜single curly quoteâ€™ is not consider matchingpair here.

Click to expand...

â€¢ You script must be standalone. Must not be using some parser tools.
But can call lib that's part of standard distribution in your lang.

Click to expand...

Here's a example of mismatched bracket: ([)], (â€œ[[â€), ((, ã€‘etc. (and
yes, the brackets may be nested. There are usually text between these
chars.)

Click to expand...

I'll be writing a emacs lisp solution and post in 2 days. Î™ welcome
other lang implementations. In particular, perl, python, php, ruby,
tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
(clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
Scala, C, C++, or any others, are all welcome, but i won't be able to
eval it. javascript implementation will be very interesting too, but
please indicate which and where to install the command line version.

Click to expand...

I hope you'll find this a interesting â€œchallengeâ€. Thisis a parsing
problem. I haven't studied parsers except some Wikipedia reading, so
my solution will probably be naive. I hope to see and learn from your
solution too.

Click to expand...

i hope you'll participate. Just post solution here. Thanks.

Click to expand...

I thought I'd have some fun with multi-processing:

https://gist.github.com/1087682

hi Thomas. I ran the program, all cpu went max (i have a quad), but
after i think 3 minutes nothing happens, so i killed it.

is there something special one should know to run the script?

I'm using Python 3.2.1 on Windows 7.

Xah

Thomas Jollans · Jul 19, 2011

http://pastebin.com/7hU20NNL

Click to expand...

just installed py3.
there seems to be a bug.
in this file

http://xahlee.org/p/time_machine/tm-ch04.html

there's a mismatched double curly quote. at position 28319.

the python code above doesn't seem to spot it?

here's the elisp script output when run on that dir:

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/tm-ch04.html
["â€œ" 28319]
Done deal!

That script doesn't check that the balance is zero at the end of file.

Patch:

--- ../xah-raymond-old.py 2011-07-19 20:05:13.000000000 +0200
+++ ../xah-raymond.py 2011-07-19 20:03:14.000000000 +0200
@@ -16,6 +16,8 @@
elif c in closers:
if not stack or c != stack.pop():
return i
+ if stack:
+ return i
return -1

def scan(directory, encoding='utf-8'):

Ian Kelly · Jul 19, 2011

just installed py3.
there seems to be a bug.
in this file

http://xahlee.org/p/time_machine/tm-ch04.html

there's a mismatched double curly quote. at position 28319.

the python code above doesn't seem to spot it?

It would appear that Raymond forgot to check that the stack is empty
at the end of the check_balance function. It's an easy enough thing
to fix.

Xah Lee · Jul 19, 2011

I added other unicode brackets to your list of brackets, but it seems
your code still fail to catch a file that has mismatched curly quotes.
(e.g.http://xahlee.org/p/time_machine/tm-ch04.html )

Click to expand...

LOL Billy.

Click to expand...

Xah

Click to expand...

I suspect its due to the file mode being opened with 'rb' mode. Also,
the diction of characters at the top, the closing token is the key,
while the opening one is the value. Not sure if thats obvious.

Also returning the position of the first mismatched pair is somewhat
ambiguous. File systems store files as streams of octets (mine do
anyways) rather than as characters. When you ask for the position of
the the first mismatched pair, do you mean the position as per
file.tell() or do you mean the nth character in the utf-8 stream?

Also, you may have answered this earlier but I'll ask again anyways: You
ask for the first mismatched pair, Are you referring to the inner most
mismatched, or the outermost? For example, suppose you have this file:

foo[(])bar

Would the "(" be the first mismatched character or would the "]"?

yes i haven't been precise. Thanks for brining it up.

thinking about it now, i think it's a bit hard to define precisely. My
elisp code actually reports the “)”, so it's wrong too. LOL

Xah

Thomas Jollans · Jul 19, 2011

hi Thomas. I ran the program, all cpu went max (i have a quad), but
after i think 3 minutes nothing happens, so i killed it.

is there something special one should know to run the script?

I'm using Python 3.2.1 on Windows 7.

Xah

Well, it overdoes the multi-processing â€œa littleâ€. Checking each
character in a separate process might have been overkill.

Here's a sane version:

https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

old, crazy version:
https://gist.github.com/1087682/6841c3875f7e88c23e0a053ac0d0f0565d8713e2

Thomas Jollans · Jul 19, 2011

Oh, by the way:

I ran the program, all cpu went max

Mission accomplished.

Terry Reedy · Jul 19, 2011

Also, you may have answered this earlier but I'll ask again anyways: You
ask for the first mismatched pair, Are you referring to the inner most
mismatched, or the outermost? For example, suppose you have this file:

foo[(])bar

Would the "(" be the first mismatched character or would the "]"?

Click to expand...

yes i haven't been precise. Thanks for brining it up.

thinking about it now, i think it's a bit hard to define precisely.

Then it is hard to code precisely.

My elisp code actually reports the â€œ)â€, so it's wrong too. LOL

This sort of exercise should start with a series of test cases, starting
with the simplest.

testpairs = (
('', True), # or whatever you want the OK response to be
('a', True),
('abdsdfdsdff', True),

('()', True), # and so on for each pair of fences
('(', False), # or exact error output wanted
(')', False), # and so on

The above could be generated programatically from the set of pairs that
should be the input to the program, so that the pairs are not hardcoded
into the logic.

'([)]', ???),
...
)

Mark Tarver · Jul 20, 2011

2011-07-16

folks, this one will be interesting one.

the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.

â€¢ The files will be utf-8 encoded (unix style line ending).

â€¢ If a file has mismatched matching-pairs, the script will display the
file name, and the Â line number and column number of the first
instance where a mismatched bracket occures. (or, just the char number
instead (as in emacs's â€œpointâ€))

â€¢ the matching pairs are all single unicode chars. They are theseand
nothing else: () {} [] â€œâ€ â€¹â€º Â«Â»ã€ã€‘ ã€ˆã€‰ ã€Šã€‹ ã€Œã€ ã€Žã€
Note that â€˜single curly quoteâ€™ is not consider matching pair here.

â€¢ You script must be standalone. Must not be using some parser tools.
But can call lib that's part of standard distribution in your lang.

Here's a example of mismatched bracket: ([)], (â€œ[[â€), ((,ã€‘etc. (and
yes, the brackets may be nested. There are usually text between these
chars.)

I'll be writing a emacs lisp solution and post in 2 days. Î™ welcome
other lang implementations. In particular, perl, python, php, ruby,
tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
(clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
Scala, C, C++, or any others, are all welcome, but i won't be able to
eval it. javascript implementation will be very interesting too, but
please indicate which and where to install the command line version.

I hope you'll find this a interesting â€œchallengeâ€. This is a parsing
problem. I haven't studied parsers except some Wikipedia reading, so
my solution will probably be naive. I hope to see and learn from your
solution too.

i hope you'll participate. Just post solution here. Thanks.

Â Xah

Parsing technology based on BNF enables an elegant solution. First
take a basic bracket balancing program which parenthesises the
contents of the input. e.g. in Shen-YACC

(defcc 
"(" ")" <br$> := [ | <br$>];
<item> ;
<e> := []

(defcc <br$>

(defcc <item>
-*- := (if (element? -*- ["(" ")"]) (fail) [-*-])

Given (compile ["(" 1 2 3 ")" 4]) the program produces [[1 2 3]
4]. When this program is used to parse the input, whatever residue is
left indicates where the parse has failed. In Shen-YACC

(define tellme
Stuff -> (let Br ( (@p Stuff []))
Residue (fst Br)
(if (empty? Residue)
(snd Br)
(error "parse failure at position ~A~%"
(- (length Stuff) (length Residue))))))

e.g.

(tellme ["(" 1 2 3 ")" "(" 4])
parse failure at position 5

(tellme ["(" 1 2 3 ")" "(" ")" 4])
[[1 2 3] [] 4]

The extension of this program to the case described is fairly simple.
Qi-YACC is very similar.

Nice problem.

I do not have further time to correspond right now.

Mark

Robert Klemme · Jul 20, 2011

Ok, here's my solution (pasted at bottom). I haven't tried to make it
elegant or terse, yet, seeing that many are already much elegent than
i could possibly do so with my code.

my solution basically use a stack. (i think all of us are doing
similar) Here's the steps:

â€¢ Go thru the file char by char, find a bracket char.
â€¢ check if the one on stack is a matching opening char. If so remove
it. Else, push the current onto the stack.
â€¢ Repeat the above till end of file.
â€¢ If the stack is not empty, then the file got mismatched brackets.
Report it.
â€¢ Do the above on all files.

Small correction: my solution works differently (although internally the
regexp engine will roughly do the same). So, my approach summarized

- traverse a directory tree
- for each found item of type "file"
- read the whole content
- throw it at a regexp which is anchored at the beginning
and does the recursive parsing
- report file if the match is shorter than the file

Note: special feature for recursive matching is used which Perl's regexp
engine likely can do as well but many others don't.

Cheers

robert

jmfauth · Jul 20, 2011

Also, you may have answered this earlier but I'll ask again anyways: You
ask for the first mismatched pair, Are you referring to the inner most
mismatched, or the outermost? For example, suppose you have this file:
foo[(])bar
Would the "(" be the first mismatched character or would the "]"?

Click to expand...

Click to expand...

yes i haven't been precise. Thanks for brining it up.

Click to expand...

thinking about it now, i think it's a bit hard to define precisely.

Click to expand...

Then it is hard to code precisely.

Not really. The trick is to count the different opener/closer
separately.
That is what I am doing to check balanced brackets in
chemical formulas. The rules are howerver not the same
as in math.

Interestingly, I fall on this "problem". enumerate() is very
nice to parse a string from left to right.
.... print i, c
....
0 a
1 b
2 c
3 d
But, if I want to parse a string from right to left,
what's the trick?
The best I found so far:
.... print len(s) - 1 - i, c
....
3 d
2 c
1 b
0 a

Ian Kelly · Jul 20, 2011

Not really. The trick is to count the different opener/closer
separately.
That is what I am doing to check balanced brackets in
chemical formulas. The rules are howerver not the same
as in math.

I think the difficulty is not in the algorithm, but in adhering to the
desired output when it is ambiguously described.

But, if I want to parse a string from right to left,
what's the trick?
The best I found so far:

... print len(s) - 1 - i, c

That violates DRY, since you have reversal logic in the iterator
algebra and then again in the loop body. I prefer to keep all such
logic in the iterator algebra, if possible. This is one possibility,
if you don't mind it building an intermediate list:
....

Otherwise, here's another non-DRY solution:
....

Unfortunately, this is one space where there just doesn't seem to be a
single obvious way to do it.

jmfauth · Jul 20, 2011

Otherwise, here's another non-DRY solution:

...

Unfortunately, this is one space where there just doesn't seem to be a
single obvious way to do it.

Well, I see. Thanks.

There is still the old, brave solution, I'm in fact using.
.... print i, s
....
3 d
2 c
1 b
0 a

Steven D'Aprano · Jul 20, 2011

DRY? acronym for ?

I'd like to tell you, but I already told somebody else...

*grins*

http://en.wikipedia.org/wiki/Don't_repeat_yourself
http://c2.com/cgi/wiki?DontRepeatYourself

Xah Lee · Jul 20, 2011

i've just cleaned up my elisp code and wrote a short elisp tutorial.

Here:

ã€ˆEmacs Lisp: Batch Script to Validate Matching Bracketsã€‰
http://xahlee.org/emacs/elisp_validate_matching_brackets.html

plain text version follows. Please let me know what you think.

am still working on going thru all code in other langs. Will get to
the ruby one, and that perl regex, and the other fixed python ones.
(possibly also the 2 common lisp codes but am not sure they are
runnable as is or just some non-working showoff. lol)

===============================================
Emacs Lisp: Batch Script to Validate Matching Brackets

Xah Lee, 2011-07-19

This page shows you how to write a elisp script that checks thousands
of files for mismatched brackets.

----------------------------------------------------------------
The Problem

------------------------------------------------
Summary

I have 5 thousands files containing many matching pairs. I want to to
know if any of them contains mismatched brackets.

------------------------------------------------
Detail

The matching pairs includes these: () {} [] â€œâ€ â€¹â€º Â«Â» ã€ˆã€‰ ã€Šã€‹ ã€ã€‘ ã€–ã€— ã€Œã€
ã€Žã€.

The program should be able to check all files in a dir, and report any
file that has mismatched bracket, and also indicate the line number or
positon where a mismatch occurs.

For those curious, if you want to know what these brackets are, see:

â€¢ Syntax Design: Use of Unicode Matching Brackets as Specialized
Delimiters
â€¢ Intro to Chinese Punctuation with Computer Language Syntax
Perspectives

For other notes and conveniences about dealing with brackets in emacs,
see:

â€¢ Emacs: Defining Keys to Navigate Brackets
â€¢ â€œextend-selectionâ€ at A Text Editor Feature: Extend Selection by
Semantic Unit
â€¢ â€œselect-text-in-quoteâ€ at Suggestions on Emacs's mark-word
Command

----------------------------------------------------------------
Solution

Here's outline of steps.

â€¢ Go thru the file char by char, find a bracket char.
â€¢ Check if the one on stack is a matching opening char. If so
remove it. Else, push the current onto the stack.
â€¢ Repeat the above till no more bracket char in the file.
â€¢ If the stack is not empty, then the file got mismatched
brackets. Report it.
â€¢ Do the above on all files.

Here's some interesting use of lisp features to implement the above.

------------------------------------------------
Define Matching Pair Chars as â€œalistâ€

We begin by defining the chars we want to check, as a â€œassociation
listâ€ (aka â€œalistâ€). Like this:

(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("â€œ" . "â€")
("â€¹" . "â€º")
("Â«" . "Â»")
("ã€" . "ã€‘")
("ã€–" . "ã€—")
("ã€ˆ" . "ã€‰")
("ã€Š" . "ã€‹")
("ã€Œ" . "ã€")
("ã€Ž" . "ã€")
)
)

If you care only to check for curly quotes, you can remove elements
above. This is convenient because some files necessarily have
mismatched pairs such as the parenthesis, because that char is used
for many non-bracketing purposes (e.g. ASCII smiley).

A â€œalistâ€ in lisp is basically a list of pairs (called key and value),
with the ability to search for a key or a value. The first element of
a pair is called its key, the second element is its value. Each pair
is a â€œconsâ€, like this: (cons mykey myvalue), which can also be
written using this syntax: (mykey . myvalue) for more easy reading.

The purpose of lisp's â€œalistâ€ is similar to Python's dictionary or
Pretty Home Page's array. It is also similar to hashmap, except that
alist can have duplicate keys, can search by values, maintains order,
and alist is not intended for massive number of elements. Elisp has a
hashmap datatype if you need that. (See: Emacs Lisp Tutorial: Hash
Table.)

(info "(elisp) Association Lists")

------------------------------------------------
Generate Regex String from alist

To search for a set of chars in emacs, we can read the buffer char-by-
char, or, we can simply use â€œsearch-forward-regexpâ€. To usethat,
first we need to generate a regex string from our matchPairs alist.

First, we defines/declare the string. Not a necessary step, but we do
it for clarity.

(setq searchRegex "")

Then we go thru the matchPairs alist. For each pair, we use â€œcarâ€ and
â€œcdrâ€ to get the chars and â€œconcatâ€ it to the string. Like this:

(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

Then we remove the ending â€œ|â€.

(setq searchRegex (substring searchRegex 0 -1)) ; remove the ending
â€œ|â€

Then, change | it to \\|. In elisp regex, the | is literal. The â€œregex
orâ€ is \|. And if you are using regex in elisp, elisp does not havea
special regex string syntax, it only understands normal strings. So,
to feed to regex \|, you need to espace the first backslash. So, your
regex needs to have \\|. Here's how we do it:

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex â€œorâ€ operation

You could shorten the above into just 2 lines by using \\| in the
â€œmapcâ€ step and not as a extra step of replacing | by \\|.

See also: emacs regex tutorial.

------------------------------------------------
Implement Stack Using Lisp List

Stack is done using lisp's list. e.g. '(1 2 3). The bottom of stack is
the first element. To add to the stack, do it like this: (setq mystack
(cons newitem mystack)). To remove a item from stack is this: (setq
mystack (cdr mystack)). The stack begin as a empty list: '().

For each element in the stack, we need the char and also its position,
so that we can report the position if the file does have mismatched
pairs.

We use a vector as entries for the stack. Each entry is like this:
(vector char pos). (See: Emacs Lisp Tutorial: List ï¼† Vector.)

Here's how to fetch a char from stack bottom, check if current char
matches, push to stack, pop from stack.

; check if current char is a closing char and is in our match pairs
alist.
; use â€œrassocâ€ to check alist's set of â€œvaluesâ€.
; It returns the first key/value pair found, or nil
(rassoc char matchPairs)

; add to stack
(setq myStack (cons (vector char pos) myStack) )

; pop stack
(setq myStack (cdr myStack) )

------------------------------------------------
Complete Code

Here's the complete code.

;; -*- coding: utf-8 -*-
;; 2011-07-15
;; go thru a file, check if all brackets are properly matched.
;; e.g. good: (â€¦{â€¦}â€¦ â€œâ€¦â€â€¦)
;; bad: ( [)]
;; bad: ( ( )

(setq inputFile "xx_test_file.txt" ) ; a test file.
(setq inputDir "~/web/xahlee_org/") ; must end in slash

(defvar matchPairs '() "a alist. For each pair, the car is opening
char, cdr is closing char.")
(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("â€œ" . "â€")
("â€¹" . "â€º")
("Â«" . "Â»")
("ã€" . "ã€‘")
("ã€–" . "ã€—")
("" . "")
("" . "")
("ã€Œ" . "ã€")
("ã€Ž" . "ã€")
)
)

(defvar searchRegex "" "regex string of all pairs to search.")
(setq searchRegex "")
(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

(setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
t)) ; remove the ending â€œ|â€

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex â€œorâ€ operation

(defun my-process-file (fpath)
"process the file at fullpath fpath ..."
(let (myBuffer myStack Î¾char Î¾pos)

(setq myStack '() ) ; each element is a vector [char position]
(setq Î¾char "") ; the current char found

(when t
;; (not (string-match "/xx" fpath)) ; in case you want to skip
certain files

(setq myBuffer (get-buffer-create " myTemp"))
(set-buffer myBuffer)
(insert-file-contents fpath nil nil nil t)

(goto-char 1)
(setq case-fold-search t)
(while (search-forward-regexp searchRegex nil t)
(setq Î¾pos (point) )
(setq Î¾char (buffer-substring-no-properties Î¾pos (- Î¾pos
1)) )

;; (princ (format "-----------------------------\nfound char:
%s\n" Î¾char) )

(let ((isClosingCharQ nil) (matchedOpeningChar nil) )
(setq isClosingCharQ (rassoc Î¾char matchPairs))
(when isClosingCharQ (setq matchedOpeningChar (car
isClosingCharQ) ) )

;; (princ (format "isClosingCharQ is: %s\n"
isClosingCharQ) )
;; (princ (format "matchedOpeningChar is: %s\n"
matchedOpeningChar) )

(if
(and
(car myStack) ; not empty
(equal (elt (car myStack) 0) matchedOpeningChar )
)
(progn
;; (princ (format "matched this bottom item on stack:
%s\n" (car myStack)) )
(setq myStack (cdr myStack) )
)
(progn
;; (princ (format "did not match this bottom item on
stack: %s\n" (car myStack)) )
(setq myStack (cons (vector Î¾char Î¾pos) myStack) ) )
)
)
;; (princ "current stack: " )
;; (princ myStack )
;; (terpri )
)

(when (not (equal myStack nil))
(princ "Error file: ")
(princ fpath)
(print (car myStack) )
)
(kill-buffer myBuffer)
)
))

(require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah match pair output*" )
(with-output-to-temp-buffer outputBuffer
;; (my-process-file inputFile)
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.txt$"))
(princ "Done deal!")
)
)

I added many comments and debug code for easy understanding. If you
are not familiar with the many elisp idioms such as opening file,
buffers, printing to output, see: Emacs Lisp Idioms (for writing
interactive commands) â—‡ Text Processing with Emacs Lisp Batch Style..

To run the code, simply open it in emacs. Edit the line at the top for
â€œinputDirâ€. Then call â€œeval-bufferâ€.

Here's a sample output:

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter_orig.txt
[")" 3625]
Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter.txt
[")" 2338]
Error file: c:/Users/h3/web/xahlee_org/p/arabian_nights/xx/v1fn.txt
["â€" 185795]
Done deal!

The weird Î¾ you see in my code is greek x. I use unicode char in
variable name for experimental purposes. You can just ignore it. (See:
Programing Style: Variable Naming: English Words Considered Harmful.)

------------------------------------------------
Advantages of Emacs Lisp

Note that the great advantage of using elisp for text processing,
instead of {perl, python, ruby, â€¦} is that many things are taken care
by the emacs environment.

I don't need to write code to declare file's encoding (emacs
automatically detects). No reading file is involved. Just open, save,
or move thru characters. No code needed for doing safety backup. (the
var â€œmake-backup-filesâ€ controls that). You can easily openthe files
by its path with a click or key press. I can add just 2 lines so that
clicking on the error char in the output jumps to the location in the
file.

Any elisp script you write inside emacs automatically become extension
of emacs and can be used in a interactive way.

This problem is posted to a few comp.lang newsgroups as a fun
challenge. You can see several solutions in python, ruby, perl, common
lisp, at: a little parsing challenge â˜º (2011-07-17) @ Source
groups.google.com.

Xah

Uri Guttman · Jul 20, 2011

a better parsing challenge. how can you parse usenet to keep this troll
from posting on the wrong groups on usenet? first one to do so, wins the
praise of his peers. 2nd one to do it makes sure the filter stays in
place. all the rest will be rewarded by not seeing the troll anymore.

anyone who actually engages in a thread with the troll should parse
themselves out of existance.

uri

rusi · Jul 20, 2011

a better parsing challenge. how can you parse usenet to keep this troll
from posting on the wrong groups on usenet? first one to do so, wins the
praise of his peers. 2nd one to do it makes sure the filter stays in
place. all the rest will be rewarded by not seeing the troll anymore.

anyone who actually engages in a thread with the troll should parse
themselves out of existance.

Goedelian paradox: Is this thread in existence?

Randal L. Schwartz · Jul 20, 2011

Uri> a better parsing challenge. how can you parse usenet to keep this troll
Uri> from posting on the wrong groups on usenet? first one to do so, wins the
Uri> praise of his peers. 2nd one to do it makes sure the filter stays in
Uri> place. all the rest will be rewarded by not seeing the troll anymore.

Uri> anyone who actually engages in a thread with the troll should parse
Uri> themselves out of existance.

Since the newsgroups: line is not supposed to have spaces in it, that
makes both his post and your post invalid. Hence, filter on invalid
posts.

Programing Challenge: Constructing a Tree Given Its Edges.	0	Jan 8, 2014
a interesting Parallel Programing Problem: asciify-string	0	Mar 6, 2012
Guy Steele on Parallel Programing	1	Feb 5, 2011
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Problems of Symbol Congestion in Computer Languages	54	Feb 16, 2011
Emacs Form Feed (^L) Display Suggestion and Tips	6	Jun 24, 2010
which language has syntax that visually represent a tree? [was X#]	0	Feb 12, 2009
opinion: comp lang docs style	10	Jan 4, 2011

a little parsing challenge â˜º

Xah Lee

Xah Lee

Billy Mays

Xah Lee

Thomas Jollans

Ian Kelly

Xah Lee

Thomas Jollans

Thomas Jollans

Terry Reedy

Mark Tarver

Robert Klemme

jmfauth

Ian Kelly

jmfauth

Steven D'Aprano

Xah Lee

Uri Guttman

rusi

Randal L. Schwartz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads