a little parsing challenge ☺

Discussion in 'Perl Misc' started by Xah Lee, Jul 17, 2011.

  1. Xah Lee

    Xah Lee Guest

    2011-07-16

    folks, this one will be interesting one.

    the problem is to write a script that can check a dir of text files
    (and all subdirs) and reports if a file has any mismatched matching
    brackets.

    • The files will be utf-8 encoded (unix style line ending).

    • If a file has mismatched matching-pairs, the script will display the
    file name, and the line number and column number of the first
    instance where a mismatched bracket occures. (or, just the char number
    instead (as in emacs's “pointâ€))

    • the matching pairs are all single unicode chars. They are these and
    nothing else: () {} [] “†‹› «» ã€ã€‘ 〈〉 《》 「ã€ã€Žã€
    Note that ‘single curly quote’ is not consider matching pair here.

    • You script must be standalone. Must not be using some parser tools.
    But can call lib that's part of standard distribution in your lang.

    Here's a example of mismatched bracket: ([)], (“[[â€), ((, 】etc. (and
    yes, the brackets may be nested. There are usually text between these
    chars.)

    I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
    other lang implementations. In particular, perl, python, php, ruby,
    tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
    (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
    Scala, C, C++, or any others, are all welcome, but i won't be able to
    eval it. javascript implementation will be very interesting too, but
    please indicate which and where to install the command line version.

    I hope you'll find this a interesting “challengeâ€. This is a parsing
    problem. I haven't studied parsers except some Wikipedia reading, so
    my solution will probably be naive. I hope to see and learn from your
    solution too.

    i hope you'll participate. Just post solution here. Thanks.

    Xah
     
    Xah Lee, Jul 17, 2011
    #1
    1. Advertising

  2. On Jul 17, 12:47 am, Xah Lee <> wrote:
    > i hope you'll participate. Just post solution here. Thanks.


    http://pastebin.com/7hU20NNL


    Raymond
     
    Raymond Hettinger, Jul 17, 2011
    #2
    1. Advertising

  3. Robert Klemme, Jul 17, 2011
    #3
  4. Xah Lee

    mhenn Guest

    Am 17.07.2011 15:20, schrieb Robert Klemme:
    > On 07/17/2011 11:48 AM, Raymond Hettinger wrote:
    >> On Jul 17, 12:47 am, Xah Lee<> wrote:
    >>> i hope you'll participate. Just post solution here. Thanks.

    >>
    >> http://pastebin.com/7hU20NNL

    >
    > Ruby solution: https://gist.github.com/1087583


    I acutally don't know Ruby that well, but it looks like your program
    recognizes "[(])" as correct although it is not, because you translate
    "[(])" to "(())" (which is indeed correct, but does not resemble the
    input correctly anymore).

    >
    > Kind regards
    >
    > robert
     
    mhenn, Jul 17, 2011
    #4
  5. On Jul 17, 9:47 am, Xah Lee <> wrote:
    > 2011-07-16
    >
    > folks, this one will be interesting one.
    >
    > the problem is to write a script that can check a dir of text files
    > (and all subdirs) and reports if a file has any mismatched matching
    > brackets.
    >
    > • The files will be utf-8 encoded (unix style line ending).
    >
    > • If a file has mismatched matching-pairs, the script will display the
    > file name, and the  line number and column number of the first
    > instance where a mismatched bracket occures. (or, just the char number
    > instead (as in emacs's “pointâ€))
    >
    > • the matching pairs are all single unicode chars. They are theseand
    > nothing else: () {} [] “†‹› «»ã€ã€‘ 〈〉 《》 「〠『ã€
    > Note that ‘single curly quote’ is not consider matching pair here.
    >
    > • You script must be standalone. Must not be using some parser tools.
    > But can call lib that's part of standard distribution in your lang.
    >
    > Here's a example of mismatched bracket: ([)], (“[[â€), ((,】etc. (and
    > yes, the brackets may be nested. There are usually text between these
    > chars.)
    >
    > I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
    > other lang implementations. In particular, perl, python, php, ruby,
    > tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
    > (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
    > Scala, C, C++, or any others, are all welcome, but i won't be able to
    > eval it. javascript implementation will be very interesting too, but
    > please indicate which and where to install the command line version.
    >
    > I hope you'll find this a interesting “challengeâ€. This is a parsing
    > problem. I haven't studied parsers except some Wikipedia reading, so
    > my solution will probably be naive. I hope to see and learn from your
    > solution too.
    >
    > i hope you'll participate. Just post solution here. Thanks.


    I thought I'd have some fun with multi-processing:

    https://gist.github.com/1087682
     
    Thomas Jollans, Jul 17, 2011
    #5
  6. Xah Lee

    Thomas Boell Guest

    On Sun, 17 Jul 2011 02:48:42 -0700 (PDT)
    Raymond Hettinger <> wrote:

    > On Jul 17, 12:47 am, Xah Lee <> wrote:
    > > i hope you'll participate. Just post solution here. Thanks.

    >
    > http://pastebin.com/7hU20NNL


    I'm new to Python. I think I'd have done it in a similar way (in any
    language). Your use of openers/closers looks nice though. In the
    initialization of openers, I guess you implicitly create a kind of
    hash, right? Then the 'in' operator checks for the keys. That is elegant
    because you have the openers and closers right next to each other, not
    in separate lists.

    But why do you enumerate with start=1? Shouldn't you start with index 0?
     
    Thomas Boell, Jul 17, 2011
    #6
  7. On 07/17/2011 03:55 PM, mhenn wrote:
    > Am 17.07.2011 15:20, schrieb Robert Klemme:
    >> On 07/17/2011 11:48 AM, Raymond Hettinger wrote:
    >>> On Jul 17, 12:47 am, Xah Lee<> wrote:
    >>>> i hope you'll participate. Just post solution here. Thanks.
    >>>
    >>> http://pastebin.com/7hU20NNL

    >>
    >> Ruby solution: https://gist.github.com/1087583

    >
    > I acutally don't know Ruby that well, but it looks like your program
    > recognizes "[(])" as correct although it is not, because you translate
    > "[(])" to "(())" (which is indeed correct, but does not resemble the
    > input correctly anymore).


    Right you are. The optimization breaks the logic. Good catch!

    Kind regards

    robert
     
    Robert Klemme, Jul 17, 2011
    #7
  8. On 07/17/2011 06:01 PM, Robert Klemme wrote:
    > On 07/17/2011 03:55 PM, mhenn wrote:
    >> Am 17.07.2011 15:20, schrieb Robert Klemme:
    >>> On 07/17/2011 11:48 AM, Raymond Hettinger wrote:
    >>>> On Jul 17, 12:47 am, Xah Lee<> wrote:
    >>>>> i hope you'll participate. Just post solution here. Thanks.
    >>>>
    >>>> http://pastebin.com/7hU20NNL
    >>>
    >>> Ruby solution: https://gist.github.com/1087583

    >>
    >> I acutally don't know Ruby that well, but it looks like your program
    >> recognizes "[(])" as correct although it is not, because you translate
    >> "[(])" to "(())" (which is indeed correct, but does not resemble the
    >> input correctly anymore).

    >
    > Right you are. The optimization breaks the logic. Good catch!


    Turns out with a little possessiveness I can fix my original approach
    which has the added benefit of not needing three passes through the file
    (the two #tr's are obsolete now).

    https://gist.github.com/1087583

    Cheers

    robert
     
    Robert Klemme, Jul 17, 2011
    #8
  9. On Jul 17, 8:49 am, Thomas Boell <> wrote:
    > But why do you enumerate with start=1? Shouldn't you start with index 0?


    The problem specification says that the the char number should match
    the emacs goto-char function which is indexed from one, not from
    zero. This is testable by taking the output of the program and
    running it through emacs to see that the cursor gets moved exactly to
    the location of the mismatched delimiter.


    Raymond
     
    Raymond Hettinger, Jul 17, 2011
    #9
  10. Xah Lee

    rantingrick Guest

    On Jul 17, 2:47 am, Xah Lee <> wrote:
    > 2011-07-16
    >
    > folks, this one will be interesting one.
    >
    > the problem is to write a script that can check a dir of text files
    > (and all subdirs) and reports if a file has any mismatched matching
    > brackets.
    >
    >[...]
    >
    > • You script must be standalone. Must not be using some parser tools.
    > But can call lib that's part of standard distribution in your lang.


    I stopped reading here and did...

    >>> from HyperParser import HyperParser # python2.x


    ....and called it a day. ;-) This module is part of the stdlib (idlelib
    \HyperParser) so as per your statement it is legal (may not be the
    fastest solution).
     
    rantingrick, Jul 18, 2011
    #10
  11. Xah Lee

    Xah Lee Guest

    On Jul 17, 12:47 am, Xah Lee <> wrote:
    > 2011-07-16
    >
    > folks, this one will be interesting one.
    >
    > the problem is to write a script that can check a dir of text files
    > (and all subdirs) and reports if a file has any mismatched matching
    > brackets.
    > …


    Ok, here's my solution (pasted at bottom). I haven't tried to make it
    elegant or terse, yet, seeing that many are already much elegent than
    i could possibly do so with my code.

    my solution basically use a stack. (i think all of us are doing
    similar) Here's the steps:

    • Go thru the file char by char, find a bracket char.
    • check if the one on stack is a matching opening char. If so remove
    it. Else, push the current onto the stack.
    • Repeat the above till end of file.
    • If the stack is not empty, then the file got mismatched brackets.
    Report it.
    • Do the above on all files.

    Many elegant solutions. Raymond Hettinger is very quick, posted a
    solution only after a hour or so when i posted it. Many others are
    very short, very nice. Thank you all for writing them. I haven't
    studied them yet. I'll run them all and post a summary in 2 days. (i
    have few thousands files to run this test thru, many of them have
    mismatched brackets. So i have good data to test with.)

    PS we still lack a perl, Scheme lisp, tcl, lua versions. These
    wouldn't be hard and would be interesting to read. If you are picking
    up one of these lang, this would be a good exercise. Haskell too. I
    particularly would like to see a javascript version ran from command
    line. Maybe somebody can put this exercise to Google folks ... they
    are like the js gods.

    also, now that we have these home-brewed code, how'd a parser expert
    do it? Is it possible to make it even simpler by using some parser
    tools? (have no idea what those lex yacc do, or modern incarnations)
    I've also been thinking whether this can be done with Parsing
    Expression Grammar. That would make the code semantics really elegant
    (as opposed home-cooked stack logic).

    Xah

    ;; -*- coding: utf-8 -*-
    ;; 2011-07-15, Xah Lee
    ;; go thru a file, check if all brackets are properly matched.
    ;; e.g. good: (…{…}… “…â€â€¦)
    ;; bad: ( [)]
    ;; bad: ( ( )

    (setq inputDir "~/web/xahlee_org/p/") ; must end in slash

    (defvar matchPairs '() "a alist. For each air, the car is opening
    char, cdr is closing char.")

    (setq matchPairs '(
    ("(" . ")")
    ("{" . "}")
    ("[" . "]")
    ("“" . "â€")
    ("‹" . "›")
    ("«" . "»")
    ("ã€" . "】")
    ("〈" . "〉")
    ("《" . "》")
    ("「" . "ã€")
    ("『" . "ã€")
    )
    )

    (defvar searchRegex "" "regex string of all pairs to search.")
    (setq searchRegex "")
    (mapc
    (lambda (mypair) ""
    (setq searchRegex (concat searchRegex (regexp-quote (car mypair))
    "|" (regexp-quote (cdr mypair)) "|") )
    )
    matchPairs)

    (setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
    t)) ; remove the ending “|â€

    (setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
    t)) ; change | to \\| for regex “or†operation

    (defun my-process-file (fpath)
    "process the file at fullpath fpath ..."
    (let (myBuffer (ii 0) myStack ξchar ξpos)

    (setq myStack '() ) ; each element is a vector [char position]
    (setq ξchar "")

    (setq myBuffer (get-buffer-create " myTemp"))
    (set-buffer myBuffer)
    (insert-file-contents fpath nil nil nil t)

    (goto-char 1)
    (while (search-forward-regexp searchRegex nil t)
    (setq ξpos (point) )
    (setq ξchar (buffer-substring-no-properties ξpos (- ξpos 1)) )

    ;; (princ (format "-----------------------------\nfound char: %s
    \n" ξchar) )

    (let ((isClosingCharQ nil) (matchedOpeningChar nil) )
    (setq isClosingCharQ (rassoc ξchar matchPairs))
    (when isClosingCharQ (setq matchedOpeningChar (car
    isClosingCharQ) ) )

    ;; (princ (format "isClosingCharQ is: %s\n" isClosingCharQ) )
    ;; (princ (format "matchedOpeningChar is: %s\n"
    matchedOpeningChar) )

    (if
    (and
    (car myStack) ; not empty
    (equal (elt (car myStack) 0) matchedOpeningChar )
    )
    (progn
    ;; (princ (format "matched this bottom item on stack: %s
    \n" (car myStack)) )
    (setq myStack (cdr myStack) )
    )
    (progn
    ;; (princ (format "did not match this bottom item on
    stack: %s\n" (car myStack)) )
    (setq myStack (cons (vector ξchar ξpos) myStack) ) )
    )
    )
    ;; (princ "current stack: " )
    ;; (princ myStack )
    ;; (terpri )
    )

    (when (not (equal myStack nil))
    (princ "Error file: ")
    (princ fpath)
    (print (car myStack) )
    )
    (kill-buffer myBuffer)
    ))


    ;; (require 'find-lisp)

    (let (outputBuffer)
    (setq outputBuffer "*xah match pair output*" )
    (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )
    )
     
    Xah Lee, Jul 18, 2011
    #11
  12. Xah Lee

    Guest

    Re: a little parsing challenge ?

    On Sun, 17 Jul 2011 00:47:42 -0700 (PDT), Xah Lee <> wrote:

    >2011-07-16
    >
    >folks, this one will be interesting one.
    >
    >the problem is to write a script that can check a dir of text files
    >(and all subdirs) and reports if a file has any mismatched matching
    >brackets.
    >

    [snip]
    >i hope you'll participate. Just post solution here. Thanks.
    >


    I have to hunt for a job so I'm not writing a solution for you.
    Here is a thin regex framework that may get you started.

    -sln

    ---------------------

    use strict;
    use warnings;

    my @samples = qw(
    A98(y[(np)r]x)tp[kk]a.exeb
    A98(y[(np)r]x)tp[kk]a}.exeb
    A98(‹ynprx)tpk›ka.mpeg
    ‹A98(ynprx)tpk›ka
    “A9«8(yn«pr{{[g[x].}*()+}»)tpkka».”
    “A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».”
    “A9«8(yn«pr»)tpkka».”
    “A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
    “A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
    );

    my $regex = qr/

    ^ (?&FileName) $

    (?(DEFINE)

    (?<Delim>
    \( (?&Content) \)
    | \{ (?&Content) \}
    | \[ (?&Content) \]
    | \“ (?&Content) \”
    | \‹ (?&Content) \›
    | \« (?&Content) \»
    # add more here ..
    )

    (?<Content>
    (?: (?> [^(){}\[\]“”‹›«»]+ ) # add more here ..
    | (?&Delim)
    )*
    )

    (?<FileName>
    (?&Content)
    )
    )
    /x;


    for (@samples)
    {
    print "$_ - ";
    if ( /$regex/ ) {
    print "passed \n";
    }
    else {
    print "failed \n";
    }
    }

    __END__

    Output:

    A98(y[(np)r]x)tp[kk]a.exeb - passed
    A98(y[(np)r]x)tp[kk]a}.exeb - failed
    A98(‹ynprx)tpk›ka.mpeg - failed
    ‹A98(ynprx)tpk›ka - passed
    “A9«8(yn«pr{{[g[x].}*()+}»)tpkka».” - failed
    “A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».” - passed
    “A9«8(yn«pr»)tpkka».” - passed
    “A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - passed
    “A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - failed
     
    , Jul 18, 2011
    #12
  13. Xah Lee

    Mark Tarver Guest

    On Jul 17, 8:47 am, Xah Lee <> wrote:
    > 2011-07-16
    >
    > folks, this one will be interesting one.
    >
    > the problem is to write a script that can check a dir of text files
    > (and all subdirs) and reports if a file has any mismatched matching
    > brackets.
    >
    > • The files will be utf-8 encoded (unix style line ending).
    >
    > • If a file has mismatched matching-pairs, the script will display the
    > file name, and the  line number and column number of the first
    > instance where a mismatched bracket occures. (or, just the char number
    > instead (as in emacs's “pointâ€))
    >
    > • the matching pairs are all single unicode chars. They are theseand
    > nothing else: () {} [] “†‹› «»ã€ã€‘ 〈〉 《》 「〠『ã€
    > Note that ‘single curly quote’ is not consider matching pair here.
    >
    > • You script must be standalone. Must not be using some parser tools.
    > But can call lib that's part of standard distribution in your lang.
    >
    > Here's a example of mismatched bracket: ([)], (“[[â€), ((,】etc. (and
    > yes, the brackets may be nested. There are usually text between these
    > chars.)
    >
    > I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
    > other lang implementations. In particular, perl, python, php, ruby,
    > tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
    > (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
    > Scala, C, C++, or any others, are all welcome, but i won't be able to
    > eval it. javascript implementation will be very interesting too, but
    > please indicate which and where to install the command line version.
    >
    > I hope you'll find this a interesting “challengeâ€. This is a parsing
    > problem. I haven't studied parsers except some Wikipedia reading, so
    > my solution will probably be naive. I hope to see and learn from your
    > solution too.
    >
    > i hope you'll participate. Just post solution here. Thanks.
    >
    >  Xah


    Parsing technology based on BNF enables an elegant solution. First
    take a basic bracket balancing program which parenthesises the
    contents of the input. e.g. in Shen-YACC

    (defcc <br>
    "(" <br> ")" <br$> := [<br> | <br$>];
    <item> <br>;
    <e> := [];)

    (defcc <br$>
    <br>;)

    (defcc <item>
    -*- := (if (element? -*- ["(" ")"]) (fail) [-*-]);)

    Given (compile <br> ["(" 1 2 3 ")" 4]) the program produces [[1 2 3]
    4]. When this program is used to parse the input, whatever residue is
    left indicates where the parse has failed. In Shen-YACC

    (define tellme
    Stuff -> (let Br (<br> (@p Stuff []))
    Residue (fst Br)
    (if (empty? Residue)
    (snd Br)
    (error "parse failure at position ~A~%"
    (- (length Stuff) (length Residue))))))

    e.g.

    (tellme ["(" 1 2 3 ")" "(" 4])
    parse failure at position 5

    (tellme ["(" 1 2 3 ")" "(" ")" 4])
    [[1 2 3] [] 4]

    The extension of this program to the case described is fairly simple.
    Qi-YACC is very similar.

    Nice problem.

    I do not have further time to correspond right now.

    Mark
     
    Mark Tarver, Jul 20, 2011
    #13
  14. On 18.07.2011 16:39, Xah Lee wrote:
    >
    > On Jul 17, 12:47 am, Xah Lee<> wrote:
    >> 2011-07-16
    >>
    >> folks, this one will be interesting one.
    >>
    >> the problem is to write a script that can check a dir of text files
    >> (and all subdirs) and reports if a file has any mismatched matching
    >> brackets.
    >> …

    >
    > Ok, here's my solution (pasted at bottom). I haven't tried to make it
    > elegant or terse, yet, seeing that many are already much elegent than
    > i could possibly do so with my code.
    >
    > my solution basically use a stack. (i think all of us are doing
    > similar) Here's the steps:
    >
    > • Go thru the file char by char, find a bracket char.
    > • check if the one on stack is a matching opening char. If so remove
    > it. Else, push the current onto the stack.
    > • Repeat the above till end of file.
    > • If the stack is not empty, then the file got mismatched brackets.
    > Report it.
    > • Do the above on all files.


    Small correction: my solution works differently (although internally the
    regexp engine will roughly do the same). So, my approach summarized

    - traverse a directory tree
    - for each found item of type "file"
    - read the whole content
    - throw it at a regexp which is anchored at the beginning
    and does the recursive parsing
    - report file if the match is shorter than the file

    Note: special feature for recursive matching is used which Perl's regexp
    engine likely can do as well but many others don't.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jul 20, 2011
    #14
  15. Xah Lee

    Xah Lee Guest

    i've just cleaned up my elisp code and wrote a short elisp tutorial.

    Here:

    〈Emacs Lisp: Batch Script to Validate Matching Brackets〉
    http://xahlee.org/emacs/elisp_validate_matching_brackets.html

    plain text version follows. Please let me know what you think.

    am still working on going thru all code in other langs. Will get to
    the ruby one, and that perl regex, and the other fixed python ones.
    (possibly also the 2 common lisp codes but am not sure they are
    runnable as is or just some non-working showoff. lol)

    ===============================================
    Emacs Lisp: Batch Script to Validate Matching Brackets

    Xah Lee, 2011-07-19

    This page shows you how to write a elisp script that checks thousands
    of files for mismatched brackets.

    ----------------------------------------------------------------
    The Problem

    ------------------------------------------------
    Summary

    I have 5 thousands files containing many matching pairs. I want to to
    know if any of them contains mismatched brackets.

    ------------------------------------------------
    Detail

    The matching pairs includes these: () {} [] “†‹› «» 〈〉 《》 ã€ã€‘ 〖〗 「ã€
    『ã€.

    The program should be able to check all files in a dir, and report any
    file that has mismatched bracket, and also indicate the line number or
    positon where a mismatch occurs.

    For those curious, if you want to know what these brackets are, see:

    • Syntax Design: Use of Unicode Matching Brackets as Specialized
    Delimiters
    • Intro to Chinese Punctuation with Computer Language Syntax
    Perspectives

    For other notes and conveniences about dealing with brackets in emacs,
    see:

    • Emacs: Defining Keys to Navigate Brackets
    • “extend-selection†at A Text Editor Feature: Extend Selection by
    Semantic Unit
    • “select-text-in-quote†at Suggestions on Emacs's mark-word
    Command

    ----------------------------------------------------------------
    Solution

    Here's outline of steps.

    • Go thru the file char by char, find a bracket char.
    • Check if the one on stack is a matching opening char. If so
    remove it. Else, push the current onto the stack.
    • Repeat the above till no more bracket char in the file.
    • If the stack is not empty, then the file got mismatched
    brackets. Report it.
    • Do the above on all files.

    Here's some interesting use of lisp features to implement the above.

    ------------------------------------------------
    Define Matching Pair Chars as “alistâ€

    We begin by defining the chars we want to check, as a “association
    list†(aka “alistâ€). Like this:

    (setq matchPairs '(
    ("(" . ")")
    ("{" . "}")
    ("[" . "]")
    ("“" . "â€")
    ("‹" . "›")
    ("«" . "»")
    ("ã€" . "】")
    ("〖" . "〗")
    ("〈" . "〉")
    ("《" . "》")
    ("「" . "ã€")
    ("『" . "ã€")
    )
    )

    If you care only to check for curly quotes, you can remove elements
    above. This is convenient because some files necessarily have
    mismatched pairs such as the parenthesis, because that char is used
    for many non-bracketing purposes (e.g. ASCII smiley).

    A “alist†in lisp is basically a list of pairs (called key and value),
    with the ability to search for a key or a value. The first element of
    a pair is called its key, the second element is its value. Each pair
    is a “consâ€, like this: (cons mykey myvalue), which can also be
    written using this syntax: (mykey . myvalue) for more easy reading.

    The purpose of lisp's “alist†is similar to Python's dictionary or
    Pretty Home Page's array. It is also similar to hashmap, except that
    alist can have duplicate keys, can search by values, maintains order,
    and alist is not intended for massive number of elements. Elisp has a
    hashmap datatype if you need that. (See: Emacs Lisp Tutorial: Hash
    Table.)

    (info "(elisp) Association Lists")

    ------------------------------------------------
    Generate Regex String from alist

    To search for a set of chars in emacs, we can read the buffer char-by-
    char, or, we can simply use “search-forward-regexpâ€. To usethat,
    first we need to generate a regex string from our matchPairs alist.

    First, we defines/declare the string. Not a necessary step, but we do
    it for clarity.

    (setq searchRegex "")

    Then we go thru the matchPairs alist. For each pair, we use “car†and
    “cdr†to get the chars and “concat†it to the string. Like this:

    (mapc
    (lambda (mypair) ""
    (setq searchRegex (concat searchRegex (regexp-quote (car mypair))
    "|" (regexp-quote (cdr mypair)) "|") )
    )
    matchPairs)

    Then we remove the ending “|â€.

    (setq searchRegex (substring searchRegex 0 -1)) ; remove the ending
    “|â€

    Then, change | it to \\|. In elisp regex, the | is literal. The “regex
    or†is \|. And if you are using regex in elisp, elisp does not havea
    special regex string syntax, it only understands normal strings. So,
    to feed to regex \|, you need to espace the first backslash. So, your
    regex needs to have \\|. Here's how we do it:

    (setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
    t)) ; change | to \\| for regex “or†operation

    You could shorten the above into just 2 lines by using \\| in the
    “mapc†step and not as a extra step of replacing | by \\|.

    See also: emacs regex tutorial.

    ------------------------------------------------
    Implement Stack Using Lisp List

    Stack is done using lisp's list. e.g. '(1 2 3). The bottom of stack is
    the first element. To add to the stack, do it like this: (setq mystack
    (cons newitem mystack)). To remove a item from stack is this: (setq
    mystack (cdr mystack)). The stack begin as a empty list: '().

    For each element in the stack, we need the char and also its position,
    so that we can report the position if the file does have mismatched
    pairs.

    We use a vector as entries for the stack. Each entry is like this:
    (vector char pos). (See: Emacs Lisp Tutorial: List & Vector.)

    Here's how to fetch a char from stack bottom, check if current char
    matches, push to stack, pop from stack.

    ; check if current char is a closing char and is in our match pairs
    alist.
    ; use “rassoc†to check alist's set of “valuesâ€.
    ; It returns the first key/value pair found, or nil
    (rassoc char matchPairs)

    ; add to stack
    (setq myStack (cons (vector char pos) myStack) )

    ; pop stack
    (setq myStack (cdr myStack) )

    ------------------------------------------------
    Complete Code

    Here's the complete code.

    ;; -*- coding: utf-8 -*-
    ;; 2011-07-15
    ;; go thru a file, check if all brackets are properly matched.
    ;; e.g. good: (…{…}… “…â€â€¦)
    ;; bad: ( [)]
    ;; bad: ( ( )

    (setq inputFile "xx_test_file.txt" ) ; a test file.
    (setq inputDir "~/web/xahlee_org/") ; must end in slash

    (defvar matchPairs '() "a alist. For each pair, the car is opening
    char, cdr is closing char.")
    (setq matchPairs '(
    ("(" . ")")
    ("{" . "}")
    ("[" . "]")
    ("“" . "â€")
    ("‹" . "›")
    ("«" . "»")
    ("ã€" . "】")
    ("〖" . "〗")
    ("" . "")
    ("" . "")
    ("「" . "ã€")
    ("『" . "ã€")
    )
    )

    (defvar searchRegex "" "regex string of all pairs to search.")
    (setq searchRegex "")
    (mapc
    (lambda (mypair) ""
    (setq searchRegex (concat searchRegex (regexp-quote (car mypair))
    "|" (regexp-quote (cdr mypair)) "|") )
    )
    matchPairs)

    (setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
    t)) ; remove the ending “|â€

    (setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
    t)) ; change | to \\| for regex “or†operation

    (defun my-process-file (fpath)
    "process the file at fullpath fpath ..."
    (let (myBuffer myStack ξchar ξpos)

    (setq myStack '() ) ; each element is a vector [char position]
    (setq ξchar "") ; the current char found

    (when t
    ;; (not (string-match "/xx" fpath)) ; in case you want to skip
    certain files

    (setq myBuffer (get-buffer-create " myTemp"))
    (set-buffer myBuffer)
    (insert-file-contents fpath nil nil nil t)

    (goto-char 1)
    (setq case-fold-search t)
    (while (search-forward-regexp searchRegex nil t)
    (setq ξpos (point) )
    (setq ξchar (buffer-substring-no-properties ξpos (- ξpos
    1)) )

    ;; (princ (format "-----------------------------\nfound char:
    %s\n" ξchar) )

    (let ((isClosingCharQ nil) (matchedOpeningChar nil) )
    (setq isClosingCharQ (rassoc ξchar matchPairs))
    (when isClosingCharQ (setq matchedOpeningChar (car
    isClosingCharQ) ) )

    ;; (princ (format "isClosingCharQ is: %s\n"
    isClosingCharQ) )
    ;; (princ (format "matchedOpeningChar is: %s\n"
    matchedOpeningChar) )

    (if
    (and
    (car myStack) ; not empty
    (equal (elt (car myStack) 0) matchedOpeningChar )
    )
    (progn
    ;; (princ (format "matched this bottom item on stack:
    %s\n" (car myStack)) )
    (setq myStack (cdr myStack) )
    )
    (progn
    ;; (princ (format "did not match this bottom item on
    stack: %s\n" (car myStack)) )
    (setq myStack (cons (vector ξchar ξpos) myStack) ) )
    )
    )
    ;; (princ "current stack: " )
    ;; (princ myStack )
    ;; (terpri )
    )

    (when (not (equal myStack nil))
    (princ "Error file: ")
    (princ fpath)
    (print (car myStack) )
    )
    (kill-buffer myBuffer)
    )
    ))

    (require 'find-lisp)

    (let (outputBuffer)
    (setq outputBuffer "*xah match pair output*" )
    (with-output-to-temp-buffer outputBuffer
    ;; (my-process-file inputFile)
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.txt$"))
    (princ "Done deal!")
    )
    )

    I added many comments and debug code for easy understanding. If you
    are not familiar with the many elisp idioms such as opening file,
    buffers, printing to output, see: Emacs Lisp Idioms (for writing
    interactive commands) â—‡ Text Processing with Emacs Lisp Batch Style..

    To run the code, simply open it in emacs. Edit the line at the top for
    “inputDirâ€. Then call “eval-bufferâ€.

    Here's a sample output:

    Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
    Hettie_Potter_orig.txt
    [")" 3625]
    Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
    Hettie_Potter.txt
    [")" 2338]
    Error file: c:/Users/h3/web/xahlee_org/p/arabian_nights/xx/v1fn.txt
    ["â€" 185795]
    Done deal!

    The weird ξ you see in my code is greek x. I use unicode char in
    variable name for experimental purposes. You can just ignore it. (See:
    Programing Style: Variable Naming: English Words Considered Harmful.)

    ------------------------------------------------
    Advantages of Emacs Lisp

    Note that the great advantage of using elisp for text processing,
    instead of {perl, python, ruby, …} is that many things are taken care
    by the emacs environment.

    I don't need to write code to declare file's encoding (emacs
    automatically detects). No reading file is involved. Just open, save,
    or move thru characters. No code needed for doing safety backup. (the
    var “make-backup-files†controls that). You can easily openthe files
    by its path with a click or key press. I can add just 2 lines so that
    clicking on the error char in the output jumps to the location in the
    file.

    Any elisp script you write inside emacs automatically become extension
    of emacs and can be used in a interactive way.

    This problem is posted to a few comp.lang newsgroups as a fun
    challenge. You can see several solutions in python, ruby, perl, common
    lisp, at: a little parsing challenge ☺ (2011-07-17) @ Source
    groups.google.com.

    Xah
     
    Xah Lee, Jul 20, 2011
    #15
  16. Xah Lee

    Uri Guttman Guest

    a better parsing challenge. how can you parse usenet to keep this troll
    from posting on the wrong groups on usenet? first one to do so, wins the
    praise of his peers. 2nd one to do it makes sure the filter stays in
    place. all the rest will be rewarded by not seeing the troll anymore.

    anyone who actually engages in a thread with the troll should parse
    themselves out of existance.

    uri

    --
    Uri Guttman -- uri AT perlhunter DOT com --- http://www.perlhunter.com --
    ------------ Perl Developer Recruiting and Placement Services -------------
    ----- Perl Code Review, Architecture, Development, Training, Support -------
     
    Uri Guttman, Jul 20, 2011
    #16
  17. Xah Lee

    rusi Guest

    On Jul 20, 9:31 pm, "Uri Guttman" <> wrote:
    > a better parsing challenge. how can you parse usenet to keep this troll
    > from posting on the wrong groups on usenet? first one to do so, wins the
    > praise of his peers. 2nd one to do it makes sure the filter stays in
    > place. all the rest will be rewarded by not seeing the troll anymore.
    >
    > anyone who actually engages in a thread with the troll should parse
    > themselves out of existance.


    Goedelian paradox: Is this thread in existence?
     
    rusi, Jul 20, 2011
    #17
  18. >>>>> "Uri" == Uri Guttman <> writes:

    Uri> a better parsing challenge. how can you parse usenet to keep this troll
    Uri> from posting on the wrong groups on usenet? first one to do so, wins the
    Uri> praise of his peers. 2nd one to do it makes sure the filter stays in
    Uri> place. all the rest will be rewarded by not seeing the troll anymore.

    Uri> anyone who actually engages in a thread with the troll should parse
    Uri> themselves out of existance.

    Since the newsgroups: line is not supposed to have spaces in it, that
    makes both his post and your post invalid. Hence, filter on invalid
    posts.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.posterous.com/ for Smalltalk discussion
     
    Randal L. Schwartz, Jul 20, 2011
    #18
  19. On Thu, 21 Jul 2011 02:31 am Uri Guttman wrote:

    >
    > a better parsing challenge. how can you parse usenet to keep this troll
    > from posting on the wrong groups on usenet? first one to do so, wins the
    > praise of his peers. 2nd one to do it makes sure the filter stays in
    > place. all the rest will be rewarded by not seeing the troll anymore.
    >
    > anyone who actually engages in a thread with the troll should parse
    > themselves out of existance.


    Somebody should make one of those "Douchebag foo" image memes. As in,


    Douchebag Poster

    - Complains about people engaging with cross-posting trolls

    - Cross-posts to five newsgroups while engaging with troll




    --
    Steven
     
    Steven D'Aprano, Jul 21, 2011
    #19
  20. Xah Lee

    Uri Guttman Guest

    >>>>> "SD" == Steven D'Aprano <> writes:

    SD> On Thu, 21 Jul 2011 02:31 am Uri Guttman wrote:
    >>
    >> a better parsing challenge. how can you parse usenet to keep this troll
    >> from posting on the wrong groups on usenet? first one to do so, wins the
    >> praise of his peers. 2nd one to do it makes sure the filter stays in
    >> place. all the rest will be rewarded by not seeing the troll anymore.
    >>
    >> anyone who actually engages in a thread with the troll should parse
    >> themselves out of existance.


    SD> Somebody should make one of those "Douchebag foo" image memes. As in,


    SD> Douchebag Poster

    SD> - Complains about people engaging with cross-posting trolls

    SD> - Cross-posts to five newsgroups while engaging with troll

    how would you then ejimicate all the groups to ignore the troll? the
    troll has been doing this for years and many ignore/killfile him. yet he
    always lands a few fish who xpost replies. he thrives on this.

    uri

    --
    Uri Guttman -- uri AT perlhunter DOT com --- http://www.perlhunter.com --
    ------------ Perl Developer Recruiting and Placement Services -------------
    ----- Perl Code Review, Architecture, Development, Training, Support -------
     
    Uri Guttman, Jul 21, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ThaDoctor
    Replies:
    3
    Views:
    407
    Alan Woodland
    Sep 28, 2007
  2. Xah Lee

    a little parsing challenge ☺

    Xah Lee, Jul 17, 2011, in forum: Python
    Replies:
    70
    Views:
    1,316
    John O'Hagan
    Jul 25, 2011
  3. henon

    [Q] little regexp challenge

    henon, Oct 16, 2003, in forum: Ruby
    Replies:
    5
    Views:
    134
    Robert Klemme
    Oct 17, 2003
  4. Intransition
    Replies:
    21
    Views:
    639
    Josh Cheek
    Jun 11, 2011
  5. Xah Lee
    Replies:
    21
    Views:
    853
    Robert Klemme
    Jul 22, 2011
Loading...

Share This Page