regular expressions and matching delimeters

Discussion in 'Perl Misc' started by hymie!, May 21, 2014.

  1. hymie!

    hymie! Guest


    I may be asking the wrong question, so I'll start here:

    Is it possible, through regular expressions or some other method,
    to parse a string based on matching delimeters?

    The "string" that I have is actually a variable declaration for a
    Javascript program. I don't want to actually *run* Javascript. All
    I want is the data, and right now, this is the only way I can get the
    data. It looks something like this:

    var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
    "loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
    "loc":"Room 101"}];

    So I can't just look for {(.*?)} because the braces will not
    necessarily be a matched pair. I want to ensure that I can pull out an
    entire record, and then pull entire fields out of the record. I'm also
    not in a position to guarantee any specific maximum level of nesting.

    My vi clone can find matching braces and brackets and parentheses,
    so I know it's **possible**. The question is, am **I** good enough
    to do it? :)

    Can somebody give me the push I need to work this out?

    hymie!, May 21, 2014
    1. Advertisements

  2. This looks very suspiciously like JSON ('Javascript Object
    Notation'). Unsurprisingly, there's a module for dealing with that (the
    first one I found),

    NB: I can't comment on the code itself.
    Rainer Weikusat, May 21, 2014
    1. Advertisements

  3. hymie! wrote:
    This is Usenet. Please fix that.
    The RHS is most certainly JSON (JavaScript Object Notation) data. This code
    only makes sense in client-side context; very likely it has been generated
    by server-side code. So your situation is unclear.
    Whenever it occurs to you that a single application of a single regular
    expression could be sufficient to parse a word from a context-free language,
    you should review your Chomsky hierarchy. That said, Perl supports an
    extension of regular expressions that can parse recursive structures as far
    as stack and memory permits. RTFM.
    To both questions: Improbable, but not impossible.

    Signatures are to be delimited with “-- †(hyphen-hyphen-space)
    Thomas 'PointedEars' Lahn, May 21, 2014
  4. It's attribution *line*, _not_ attribution novel. There is no crosspost, so
    there is no need to specify the newsgroup. I use Reply-To so that I am less
    spammed there; in the best case, only e-mails from real people using real
    newsreaders would go there. Thanks to clueless idiots like you, crawlers
    can now just harvest that carefully hidden address on any Web site mirroring
    this newsgroup and spam me. FOAD.
    Wrong, there is no real name. Impolite.
    Internet is the thing with cables. Usenet is the thing with *people*.
    Next time, read and post with your mind switched on, if any. TIA.
    I could not care less what pseudonymous wannabes like you call my claims.
    If you had actually *read* what I referred to (a decade of work now) you
    would have spared us reading and me replying to your stupid posting.
    Which part of “properly delimited signature†did you not understand?
    They are standards-compliant, and customary here.
    Thomas 'PointedEars' Lahn, May 22, 2014
  5. Is it? Can you explain?

    I had a use-case to parse (and then interpret) a very simple lisp-like
    language and I thought I'd give Perl's self-referential patterns a try.
    It turned out to provide a very simple solution.

    Ben Bacarisse, May 22, 2014
  6. hymie!

    Justin C Guest

    Absolutely nothing worth reading at all.

    Justin C, May 22, 2014
  7. hymie!

    Justin C Guest

    I think the suggestion of "madness" is because it's been done before
    and the truly sensible method would be to use a module and just get
    one with what you really want to be doing. I believe Text::Balanced
    may also work for you if Parse::RecDescent hasn't solved the problem

    Justin C, May 22, 2014
  8. OK, but (forgive me) that's standard advice. I got the feeling that
    something more was being suggested, specifically aimed at Perl's
    non-regular pattern matching.
    Ben Bacarisse, May 22, 2014
  9. The given case was somewhat different from that, namely, an
    array-literal written in Javascript notation which contained 'Javascript
    object literals' (equivalent to 'Perl hashes' for this case), which, in
    turn, contained other object literals and other array literals
    containing object literals, which, in turn, contained ... and so on.

    There's the additional issue of quoting in here because there's exactly
    one (AFAIK) sensible quoting syntax on this planet, namely, the one used
    in HTML, which guarantees that 'special characters' don't appear
    literally inside quoted constructs and whose quoted strings can thus be
    analyzed by looking for the next ", but nobody uses that, likely
    because that would make too much sense.

    It is presumably possible to create a description of an automaton
    capable of analyzing this correctly using the Perl 'regex' sub-language
    but that's going to end up as insanely complex (but surely
    'compressed'!) way to solve a relatively simple problem. Using a
    recursive-descent parser which, in turn, uses regexes for lexical
    analysis, will end up being more code but it will also be a lot more
    accessible and flexible code and while 'optimizing this to the hilt in pure
    Perl' may count as 'seriously manly deed', the result is going to be
    beaten hands down by a much less "clevificient" C implementation and it
    won't be necessary, anyway.
    Rainer Weikusat, May 22, 2014
  10. Yes, I was not advocating for it in this case. I thought the comment
    about madness was general and suggested something I should know about
    Perl's supra-regular expressions.

    That's a good point.
    It matters less in some contexts, which might explain the persistence of
    "traditional" quoting in, say, programming languages.

    Ben Bacarisse, May 22, 2014
  11. I thought so, too.
    Thomas 'PointedEars' Lahn, May 22, 2014
  12. How so? The JSON grammar is well-defined [1]; it is a subset of the
    ECMAScript grammar. The regular expression for JSON string literals
    therefore is rather simple and straightforward:

    my $json_string = qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

    AISB, it is possible to parse such language with regular expressions; it is
    just not (reasonably) possible with only one application of one regular
    expression. Indeed, efficient parsers do support and use regular
    expressions in their *lexer*.

    [1] <>
    Thomas 'PointedEars' Lahn, May 22, 2014
  13. hymie!

    hymie! Guest

    In our last episode, the evil Dr. Lacto had captured our hero,
    Thanks for the tip.

    hymie!, May 22, 2014
  14. In case \-escaping hadn't been used for quoting the delimiter, this could be
    reduced to

    $json_string = qr/"[^"]*"/

    if the purpose was just to analyze Javascript 'object literals'.
    Rainer Weikusat, May 22, 2014
  15. Your point being? Even Perl recognizes the need for escape sequences like
    \" in string literals. You fail to realize that HTML’s way of "escaping"
    has a drawback, too: “&amp;â€, and the frequent syntax error of “unrecognized
    entity reference†(and the requirement of an error correction in parsers to
    cope with that) when the author did not intend an entity reference in the
    first place. There is nothing sane about this way either, it is just a
    different one.
    Thomas 'PointedEars' Lahn, May 22, 2014
  16. That should be easy to gather from the text I wrote on this so far.

    BTW: Antwort zwecklos.
    Rainer Weikusat, May 22, 2014
  17. That should be easy to gather from the text I wrote on this so far.
    Rainer Weikusat, May 22, 2014
  18. But it is not easy because you are actually not making a point. You have
    only provided a not very convincing argument for your humble opinion.

    Programming languages are different from markup languages, and so are their
    escape mechanisms. I have explained to you why the HTML way is not “[the]
    one sensible quoting on this planetâ€, why it is _not_ better than the
    ECMAScript/Perl way /per se/; it is just – in your words – a different form
    of senselessness.

    If in your formal language string values must be delimited by a non-
    whitespace character (YAML e.g. is different), you have only one out of

    One, not to allow delimiters within the delimited string at all, thereby
    severely limiting the string values that can be expressed in your language.

    Two, to allow for delimiters within the delimited string to be escaped in an
    escape sequence that contains the delimiter (simplest case: preceded by
    another character, say backslash) if they should lose their special meaning.

    Three, to provide an escape sequence for the delimiter that does not contain
    the delimiter. HTML and XML implement this one with the entity reference
    “&…;†(whereas the trailing “;†has been made optional in HTML).

    Now, the problems with quoting by entity reference are just not as obvious
    as with quoting by prefix character. Here is an example to make it obvious
    to you, hopefully:

    <a href="/?foo=bar&baz=bla">…</a>

    is a *syntax error" in HTML because “&baz†is an “unknown entity referenceâ€.
    But the author did not intend an entity reference in the first place, they
    just wanted to delimit parts of the query-part of the URI-reference with
    “&â€. They can work around this issue if they are aware of the error (for
    example, through <>):

    <a href="/?foo=bar&amp;baz=bla">…</a>

    But if they are not, parsers would have to work around the problem; they
    would have to check against a table of entities in order to determine that
    the syntactical entity reference could not reasonably have been intended to
    be such one. And as HTML parsers in particular are built for backwards-
    compability and robustness, and they do just that, the seemingly more simple
    approach of not allowing delimiters within the escape sequence quickly
    becomes more complicated for parsing than most people realize.
    Wanting to ignore reality is your problem, not mine.
    Thomas 'PointedEars' Lahn, May 22, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.