regular expressions and matching delimeters

hymie! · May 21, 2014

Greetings.

I may be asking the wrong question, so I'll start here:

Is it possible, through regular expressions or some other method,
to parse a string based on matching delimeters?

The "string" that I have is actually a variable declaration for a
Javascript program. I don't want to actually *run* Javascript. All
I want is the data, and right now, this is the only way I can get the
data. It looks something like this:

var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Larry","id":"2"}],
"loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Tom","id":"3"}],
"loc":"Room 101"}];

So I can't just look for {(.*?)} because the braces will not
necessarily be a matched pair. I want to ensure that I can pull out an
entire record, and then pull entire fields out of the record. I'm also
not in a position to guarantee any specific maximum level of nesting.

My vi clone can find matching braces and brackets and parentheses,
so I know it's **possible**. The question is, am **I** good enough
to do it?

Can somebody give me the push I need to work this out?

--hymie! http://lactose.homelinux.net/~hymie (e-mail address removed)
-------------------------------------------------------------------------------

Rainer Weikusat · May 21, 2014

I may be asking the wrong question, so I'll start here:

Is it possible, through regular expressions or some other method,
to parse a string based on matching delimeters?

The "string" that I have is actually a variable declaration for a
Javascript program. I don't want to actually *run* Javascript. All
I want is the data, and right now, this is the only way I can get the
data. It looks something like this:

var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Larry","id":"2"}],
"loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Tom","id":"3"}],
"loc":"Room 101"}];

This looks very suspiciously like JSON ('Javascript Object
Notation'). Unsurprisingly, there's a module for dealing with that (the
first one I found),

http://search.cpan.org/~makamaka/JSON-2.53/lib/JSON.pm

NB: I can't comment on the code itself.

Thomas 'PointedEars' Lahn · May 21, 2014

hymie! wrote:
^^^^^^
This is Usenet. Please fix that.

Is it possible, through regular expressions or some other method,
to parse a string based on matching delimeters?
Yes.

The "string" that I have is actually a variable declaration for a
Javascript program.

There is no Javascript. said:
I don't want to actually *run* Javascript. All I want is the data, and
right now, this is the only way I can get the data.
Unlikely.

It looks something like this:

var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Larry","id":"2"}],
"loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Tom","id":"3"}],
"loc":"Room 101"}];

The RHS is most certainly JSON (JavaScript Object Notation) data. This code
only makes sense in client-side context; very likely it has been generated
by server-side code. So your situation is unclear.

So I can't just look for {(.*?)} because the braces will not
necessarily be a matched pair. I want to ensure that I can pull out an
entire record, and then pull entire fields out of the record. I'm also
not in a position to guarantee any specific maximum level of nesting.

Whenever it occurs to you that a single application of a single regular
expression could be sufficient to parse a word from a context-free language,
you should review your Chomsky hierarchy. That said, Perl supports an
extension of regular expressions that can parse recursive structures as far
as stack and memory permits. RTFM.

My vi clone can find matching braces and brackets and parentheses,
so I know it's **possible**. The question is, am **I** good enough
to do it? Can somebody give me the push I need to work this out?

To both questions: Improbable, but not impossible.

--hymie! http://lactose.homelinux.net/~hymie
(e-mail address removed)

Signatures are to be delimited with â€œ-- â€ (hyphen-hyphen-space)

Thomas 'PointedEars' Lahn · May 22, 2014

Eli said:
In comp.lang.perl.misc, Thomas 'PointedEars' Lahn [â€¦] wrote:

It's attribution *line*, _not_ attribution novel. There is no crosspost, so
there is no need to specify the newsgroup. I use Reply-To so that I am less
spammed there; in the best case, only e-mails from real people using real
newsreaders would go there. Thanks to clueless idiots like you, crawlers
can now just harvest that carefully hidden address on any Web site mirroring
this newsgroup and spam me. FOAD.

There is nothing wrong there.

Wrong, there is no real name. Impolite.

The original header was:
[â€¦]
Which is prefectly legit.

Internet is the thing with cables. Usenet is the thing with *people*.

Regular expressions can match based on "matching delimeters" but not
on arbritrarily nested "matching delimeters". (What perl's "regular
expressions" can do is more than "regular", but it is ill-advised to
try to code for that madness.)

Next time, read and post with your mind switched on, if any. TIA.

[â€¦]

There is no Javascript. <http://PointedEars.de/es-matrix>

Click to expand...

:r! lynx -source -dump http://PointedEars.de/es-matrix | grep -c
:javascript
292

A quick perusal of the source of your webpages shows that you use
javascript in them, so I call bullshit on your claim. When your site
has <script type="text/[SOMETHING BESIDES JAVASCRIPT]"> tags, please
do tell us. Until then your semantic games are tiresome.

I could not care less what pseudonymous wannabes like you call my claims.
If you had actually *read* what I referred to (a decade of work now) you
would have spared us reading and me replying to your stupid posting.

Elijah
------

Which part of â€œproperly delimited signatureâ€ did you not understand?

also not a fan of seeing two images encoded in Usenet post headers

They are standards-compliant, and customary here.

Ben Bacarisse · May 22, 2014

Eli the Bearded said:
Regular expressions can match based on "matching delimeters" but not
on arbritrarily nested "matching delimeters". (What perl's "regular
expressions" can do is more than "regular", but it is ill-advised to
try to code for that madness.)

Is it? Can you explain?

I had a use-case to parse (and then interpret) a very simple lisp-like
language and I thought I'd give Perl's self-referential patterns a try.
It turned out to provide a very simple solution.

<snip>

Justin C · May 22, 2014

Absolutely nothing worth reading at all.

Justin.

Justin C · May 22, 2014

Is it? Can you explain?

I had a use-case to parse (and then interpret) a very simple lisp-like
language and I thought I'd give Perl's self-referential patterns a try.
It turned out to provide a very simple solution.

I think the suggestion of "madness" is because it's been done before
and the truly sensible method would be to use a module and just get
one with what you really want to be doing. I believe Text::Balanced
may also work for you if Parse::RecDescent hasn't solved the problem
already.

Justin.

Ben Bacarisse · May 22, 2014

Justin C said:
I think the suggestion of "madness" is because it's been done before
and the truly sensible method would be to use a module and just get
one with what you really want to be doing. I believe Text::Balanced
may also work for you if Parse::RecDescent hasn't solved the problem
already.

OK, but (forgive me) that's standard advice. I got the feeling that
something more was being suggested, specifically aimed at Perl's
non-regular pattern matching.

Rainer Weikusat · May 22, 2014

Ben Bacarisse said:
Is it? Can you explain?

I had a use-case to parse (and then interpret) a very simple lisp-like
language and I thought I'd give Perl's self-referential patterns a try.
It turned out to provide a very simple solution.

The given case was somewhat different from that, namely, an
array-literal written in Javascript notation which contained 'Javascript
object literals' (equivalent to 'Perl hashes' for this case), which, in
turn, contained other object literals and other array literals
containing object literals, which, in turn, contained ... and so on.

There's the additional issue of quoting in here because there's exactly
one (AFAIK) sensible quoting syntax on this planet, namely, the one used
in HTML, which guarantees that 'special characters' don't appear
literally inside quoted constructs and whose quoted strings can thus be
analyzed by looking for the next ", but nobody uses that, likely
because that would make too much sense.

It is presumably possible to create a description of an automaton
capable of analyzing this correctly using the Perl 'regex' sub-language
but that's going to end up as insanely complex (but surely
'compressed'!) way to solve a relatively simple problem. Using a
recursive-descent parser which, in turn, uses regexes for lexical
analysis, will end up being more code but it will also be a lot more
accessible and flexible code and while 'optimizing this to the hilt in pure
Perl' may count as 'seriously manly deed', the result is going to be
beaten hands down by a much less "clevificient" C implementation and it
won't be necessary, anyway.

Ben Bacarisse · May 22, 2014

Rainer Weikusat said:
The given case was somewhat different from that,

Yes, I was not advocating for it in this case. I thought the comment
about madness was general and suggested something I should know about
Perl's supra-regular expressions.

[...] there's exactly
one (AFAIK) sensible quoting syntax on this planet, namely, the one used
in HTML, which guarantees that 'special characters' don't appear
literally inside quoted constructs and whose quoted strings

That's a good point.

[...] can thus be
analyzed by looking for the next ", but nobody uses that, likely
because that would make too much sense.

It matters less in some contexts, which might explain the persistence of
"traditional" quoting in, say, programming languages.

<snip>

Thomas 'PointedEars' Lahn · May 22, 2014

Justin said:
[Absolutely nothing worth reading at all.]

I thought so, too.

Thomas 'PointedEars' Lahn · May 22, 2014

Ben said:
Rainer Weikusat said:

[...] there's exactly one (AFAIK) sensible quoting syntax on this planet,
namely, the one used in HTML, which guarantees that 'special characters'
don't appear literally inside quoted constructs and whose quoted strings

Click to expand...

That's a good point.

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string = qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

AISB, it is possible to parse such language with regular expressions; it is
just not (reasonably) possible with only one application of one regular
expression. Indeed, efficient parsers do support and use regular
expressions in their *lexer*.

[1] <http://json.org/>

hymie! · May 22, 2014

In our last episode, the evil Dr. Lacto had captured our hero,

(e-mail address removed) (hymie!) writes:

var list = [{"item":1,"tags":["tag1","tag2"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Larry","id":"2"}],
"loc":"Room 100"}, {"item":2,"tags":["tag2","tag3"],"day":"Friday",
"people":[{"name":"Joe","id":"1"},{"name":"Tom","id":"3"}],
"loc":"Room 101"}];

Click to expand...

This looks very suspiciously like JSON ('Javascript Object
Notation'). Unsurprisingly, there's a module for dealing with that (the
first one I found),

Thanks for the tip.

--hymie! http://lactose.homelinux.net/~hymie (e-mail address removed)

Rainer Weikusat · May 22, 2014

Thomas 'PointedEars' Lahn said:
Ben said:

Rainer Weikusat said:

[...] there's exactly one (AFAIK) sensible quoting syntax on this planet,
namely, the one used in HTML, which guarantees that 'special characters'
don't appear literally inside quoted constructs and whose quoted strings

Click to expand...

That's a good point.

Click to expand...

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string = qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

In case \-escaping hadn't been used for quoting the delimiter, this could be
reduced to

$json_string = qr/"[^"]*"/

if the purpose was just to analyze Javascript 'object literals'.

Thomas 'PointedEars' Lahn · May 22, 2014

Rainer said:
Thomas 'PointedEars' Lahn said:

Ben said:

[...] there's exactly one (AFAIK) sensible quoting syntax on this
[planet,
namely, the one used in HTML, which guarantees that 'special
characters' don't appear literally inside quoted constructs and whose
quoted strings

That's a good point.

Click to expand...

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string =
qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

Click to expand...

In case \-escaping hadn't been used for quoting the delimiter, this could
be reduced to

$json_string = qr/"[^"]*"/

if the purpose was just to analyze Javascript 'object literals'.

Your point being? Even Perl recognizes the need for escape sequences like
\" in string literals. You fail to realize that HTMLâ€™s way of "escaping"
has a drawback, too: â€œ&â€, and the frequent syntax error of â€œunrecognized
entity referenceâ€ (and the requirement of an error correction in parsers to
cope with that) when the author did not intend an entity reference in the
first place. There is nothing sane about this way either, it is just a
different one.

Rainer Weikusat · May 22, 2014

Thomas 'PointedEars' Lahn said:
Rainer said:

Thomas 'PointedEars' Lahn said:

Ben Bacarisse wrote:
[...] there's exactly one (AFAIK) sensible quoting syntax on this
[planet,
namely, the one used in HTML, which guarantees that 'special
characters' don't appear literally inside quoted constructs and whose
quoted strings

That's a good point.

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string =
qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

Click to expand...

In case \-escaping hadn't been used for quoting the delimiter, this could
be reduced to

$json_string = qr/"[^"]*"/

if the purpose was just to analyze Javascript 'object literals'.

Click to expand...

Your point being?

That should be easy to gather from the text I wrote on this so far.

BTW: Antwort zwecklos.

Rainer Weikusat · May 22, 2014

Thomas 'PointedEars' Lahn said:
Rainer said:

Thomas 'PointedEars' Lahn said:

Ben Bacarisse wrote:
[...] there's exactly one (AFAIK) sensible quoting syntax on this
[planet,
namely, the one used in HTML, which guarantees that 'special
characters' don't appear literally inside quoted constructs and whose
quoted strings

That's a good point.

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string =
qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

Click to expand...

In case \-escaping hadn't been used for quoting the delimiter, this could
be reduced to

$json_string = qr/"[^"]*"/

if the purpose was just to analyze Javascript 'object literals'.

Click to expand...

Your point being?

That should be easy to gather from the text I wrote on this so far.

Thomas 'PointedEars' Lahn · May 22, 2014

Rainer said:
Thomas 'PointedEars' Lahn said:

Rainer said:

Ben Bacarisse wrote:
[...] there's exactly one (AFAIK) sensible quoting syntax on this
[planet, namely, the one used in HTML, which guarantees that 'special
characters' don't appear literally inside quoted constructs and whose
quoted strings

That's a good point.

How so? The JSON grammar is well-defined [1]; it is a subset of the
ECMAScript grammar. The regular expression for JSON string literals
therefore is rather simple and straightforward:

my $json_string =
qr/"([^"\\\p{C}]|\\["\\\/bfnrt]|\\u[\dA-Fa-f]{4})*"/a;

In case \-escaping hadn't been used for quoting the delimiter, this
could be reduced to

$json_string = qr/"[^"]*"/

if the purpose was just to analyze Javascript 'object literals'.

Click to expand...

Your point being?

Click to expand...

That should be easy to gather from the text I wrote on this so far.

But it is not easy because you are actually not making a point. You have
only provided a not very convincing argument for your humble opinion.

Programming languages are different from markup languages, and so are their
escape mechanisms. I have explained to you why the HTML way is not â€œ[the]
one sensible quoting on this planetâ€, why it is _not_ better than the
ECMAScript/Perl way /per se/; it is just â€“ in your words â€“ a different form
of senselessness.

If in your formal language string values must be delimited by a non-
whitespace character (YAML e.g. is different), you have only one out of
choices:

One, not to allow delimiters within the delimited string at all, thereby
severely limiting the string values that can be expressed in your language.

Two, to allow for delimiters within the delimited string to be escaped in an
escape sequence that contains the delimiter (simplest case: preceded by
another character, say backslash) if they should lose their special meaning.

Three, to provide an escape sequence for the delimiter that does not contain
the delimiter. HTML and XML implement this one with the entity reference
â€œ&â€¦;â€ (whereas the trailing â€œ;â€ has been made optional in HTML).

Now, the problems with quoting by entity reference are just not as obvious
as with quoting by prefix character. Here is an example to make it obvious
to you, hopefully:

<a href="/?foo=bar&baz=bla">â€¦</a>

is a *syntax error" in HTML because â€œ&bazâ€ is an â€œunknown entity referenceâ€.
But the author did not intend an entity reference in the first place, they
just wanted to delimit parts of the query-part of the URI-reference with
â€œ&â€. They can work around this issue if they are aware of the error (for
example, through <http://validator.w3.org/>):

<a href="/?foo=bar&baz=bla">â€¦</a>

But if they are not, parsers would have to work around the problem; they
would have to check against a table of entities in order to determine that
the syntactical entity reference could not reasonably have been intended to
be such one. And as HTML parsers in particular are built for backwards-
compability and robustness, and they do just that, the seemingly more simple
approach of not allowing delimiters within the escape sequence quickly
becomes more complicated for parsing than most people realize.

BTW: Antwort zwecklos.

Wanting to ignore reality is your problem, not mine.

Matching and Regular Expressions	4	Nov 20, 2006
Python Regular Expressions	4	Jun 22, 2011
know-how(-not) about regular expressions	11	Feb 12, 2010
FAQ 6.1 How can I hope to use regular expressions without creating illegible and unmaintainable code	0	Feb 25, 2011
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
Trouble with regular expressions	6	Feb 7, 2009
Java Script Regular Expressions	4	Feb 13, 2008
Can regular expressions be used to choose among several imperfectmatches?	1	Nov 18, 2008

regular expressions and matching delimeters

hymie!

Rainer Weikusat

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

Ben Bacarisse

Justin C

Justin C

Ben Bacarisse

Rainer Weikusat

Ben Bacarisse

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

hymie!

Rainer Weikusat

Thomas 'PointedEars' Lahn

Rainer Weikusat

Rainer Weikusat

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads