confused constructing a regex

L

leeg

I have an input file of a format that looks something like this:

{
foo = (
{
bar = "baz";
wibble = WOBBLE;
},
{
bar = "barney";
wibble = JELLY;
}
);
someKey = someValue;
someArray = (value1, value2);
blankDict = {};
};

I've noticed (and at the time was fairly proud of said epiphany) that
this is almost a declaration of an anonymous hash and with a little
tweaking I could eval it as such. However, I need to quote it properly,
and despite a number of attempts can't construct a regex that will do it.
I want to search for a list of characters which are not the various
formatting characters [^\(\){};,=] *and* are not already surrounded by
quotes, and then surround them by quotes.

I thought of:
$line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;
but this converts the above into:
{
foo" "
{
bar = "baz";
wibble"E"
},
{
bar = "barney";
wibble"Y"
}
);
someKey"e"
someArray"1"value2);
blankDict" "
}

so isn't what I want. What I especially can't determine is why " =
someValue;" for instance would be replaced by "e". Could someone offer
some assistance?

Ta,

leeg.
 
D

Dave

leeg said:
I have an input file of a format that looks something like this:

{
foo = (
{
bar = "baz";
wibble = WOBBLE;
},
{
bar = "barney";
wibble = JELLY;
}
);
someKey = someValue;
someArray = (value1, value2);
blankDict = {};
};

I've noticed (and at the time was fairly proud of said epiphany) that this
is almost a declaration of an anonymous hash and with a little tweaking I
could eval it as such. However, I need to quote it properly, and despite
a number of attempts can't construct a regex that will do it.
I want to search for a list of characters which are not the various
formatting characters [^\(\){};,=] *and* are not already surrounded by
quotes, and then surround them by quotes.

I thought of:
$line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;
but this converts the above into:
{
foo" "
{
bar = "baz";
wibble"E"
},
{
bar = "barney";
wibble"Y"
}
);
someKey"e"
someArray"1"value2);
blankDict" "
}

so isn't what I want. What I especially can't determine is why " =
someValue;" for instance would be replaced by "e". Could someone offer
some assistance?

Ta,

leeg.

Your regex removes the "formatting characters" before and after the non
formatting characters block. Use lookahead/behind or capture them and put
them in. Also you are only capturing one character of the string you are
trying to quote, capture the whole string. i.e.:
$line =~ s/([\s\(\){};,=]+)([^"\(\){};,=]+)([\s\(\){};,=]+)/$1"$2"$3/g;

I'm not saying this will do what you want as I haven't looked into it in
detail, but it is clear that your original regex is deleting info that you
want to keep.

Dave
 
P

pikus

Oops, Im thinking that should have been:

s/= ([^"].*)([;\n])/= "$1"$2/g;

I changed it to check for existing quotes... :)
 
L

leeg

pikus said:
would this work?

s/= (.*)([;\n])/= "$1"$2/g;

Maybe you were overthinking it?
Sadly not; my example data were too clean. For instance:
{className=PCPerson;name=PCPerson;},
would be (and indeed is) valid input, and should lead to:
{"className"="PCPerson";"name"="PCPerson";},
I've got something that works on my example data, but haven't fully
tested elsewhere, in this form:
$line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
as I say I haven't completely tested it but it gets the job done in
simple cases.
 
F

Fabian Pilkowski

* leeg said:
I have an input file of a format that looks something like this:

{
foo = (
{
bar = "baz";
wibble = WOBBLE;
},
{
bar = "barney";
wibble = JELLY;
}
);
someKey = someValue;
someArray = (value1, value2);
blankDict = {};
};

I've noticed (and at the time was fairly proud of said epiphany) that
this is almost a declaration of an anonymous hash and with a little
tweaking I could eval it as such. However, I need to quote it properly,
and despite a number of attempts can't construct a regex that will do it.
I want to search for a list of characters which are not the various
formatting characters [^\(\){};,=] *and* are not already surrounded by
quotes, and then surround them by quotes.

Well, with your given example, I'd do something like

my $data = do { local $/; <DATA> };
$data =~ s/(["']?)(\w+)\1?/'$2'/g; # fix quotes
$data =~ y/();=/[],,/; # fix arrays and lists

Afterwards, you could eval() it.

regards,
fabian
 
L

leeg

leeg said:
I've got something that works on my example data, but haven't fully
tested elsewhere, in this form:
$line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
as I say I haven't completely tested it but it gets the job done in
simple cases.

But not in complex cases. Perhaps if I quote some real data it would help:

{
attributes = (
{
columnName = id;
externalType = INT;
name = id;
valueClassName = NSNumber;
valueType = i;
},
{
columnName = "type_code";
externalType = INT;
name = typeCode;
valueClassName = NSNumber;
valueType = i;
}
);
attributesUsedForLocking = (id);
className = PCObject;
classProperties = (typeCode, id);
fetchSpecificationDictionary = {};
internalInfo = {"_clientClassPropertyNames" = (Attribute); };
isAbstractEntity = Y;
name = PCObject;
primaryKeyAttributes = (id);
}

and now what I get after evaluating:
$line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
$line =~ y/();=/[],,/;
and sticking a semicolon on the end:

{
"attributes" , [
{
"columnName" , id,
"externalType" , "INT",
"name" , id,
"valueClassName" , "NSNumber",
"valueType" , i,
},
{
"columnName" , "type_code",
"externalType" , "INT",
"name" , "typeCode",
"valueClassName" , "NSNumber",
"valueType" , i,
}
],
"attributesUsedForLocking" , [id],
"className" , "PCObject",
"classProperties" , ["typeCode", id],
"fetchSpecificationDictionary" , {},
"internalInfo" , {"_clientClassPropertyNames" , ["Attribute"], },
"isAbstractEntity" , Y,
"name" , "PCObject",
"primaryKeyAttributes" , [id],
}
;

so it looks like cases where something is adjoining any of ( ) ; or $ my
regex isn't catching. :-(
 
T

Tad McClellan

leeg said:
I have an input file of a format that looks something like this:
^^^^^^^^^^^^^^

The devil is in the details with regexes, so "something like" is
likely not good enough to get a useable answer.

Can there be spaces in the already-quoted strings? Your example
has none like that.

Can declarations be broken across lines? eg:

someArray = (value1,
value2);

Can you have values on the RHS that you do NOT what to quote?

etc...

{
foo = (
{
bar = "baz";
wibble = WOBBLE;
},
{
bar = "barney";
wibble = JELLY;
}
);
someKey = someValue;
someArray = (value1, value2);
blankDict = {};
};


That looks pretty Formal (as in Formal Methods).

Is it a "little language"?

If so, then find the grammar for it (or write one for it).


You might be able to get the LHS(s) handled by a simple

s/ = / => /;

and let perl autoquote for you.

You'll need to change (some of?) the parens to squares for
anonymous array elements.

this is almost a declaration of an anonymous hash and with a little
tweaking I could eval it as such.
Could someone offer
some assistance?


It would become Real Easy if you had a grammar for the data, then
you could simply write a parser for the grammar.

Got a grammar?
 
A

Anno Siegel

leeg said:
I have an input file of a format that looks something like this:

{
foo = (
{
bar = "baz";
wibble = WOBBLE;
},
{
bar = "barney";
wibble = JELLY;
}
);
someKey = someValue;
someArray = (value1, value2);
blankDict = {};
};

I've noticed (and at the time was fairly proud of said epiphany) that
this is almost a declaration of an anonymous hash and with a little
tweaking I could eval it as such. However, I need to quote it properly,
and despite a number of attempts can't construct a regex that will do it.
I want to search for a list of characters which are not the various
formatting characters [^\(\){};,=] *and* are not already surrounded by
quotes, and then surround them by quotes.

You have more things to change before the expression above is a
Perl-parseable data definition. You'll have to change parentheses () to
brackets [], equal signs = to (fat) commas =>, and most (but not all)
semicolons to commas.
I thought of:
$line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;
^^^^ ^^^^ ^^^^
No need to escape (), they're not special in a character class.
but this converts the above into:
{
foo" "
{
bar = "baz";
wibble"E"
},
{
bar = "barney";
wibble"Y"
}
);
someKey"e"
someArray"1"value2);
blankDict" "
}

Huh? It doesn't do that for me, and it can't, though it doesn't do what
you want either.
so isn't what I want. What I especially can't determine is why " =
someValue;" for instance would be replaced by "e".

No idea.

Distinguishing quoted words from unquoted ones with a regex isn't trivial
(as you have seen). As usual, the solution is to use Perl's other features
to keep the regular expressions simple.

In this case, we could split on quoted words (recognizing *them* isn't
hard), keeping the delimiters. That splits the string into quote-free
parts and quoted words that separate them.

Next, walk through the list, leaving the quoted parts alone, but adding
quotes to *every* word in the quote-free regions. Again, this isn't hard.

Finally, join it all together again.

$text = join '',
map { s/(\w+)/"$1"/g unless /^"/; $_}
split /("\w*?")/s, $text;

I works on well-formed expressions only. Unbalanced quotes confuse it,
and quoted non-words probably too.

Anno
 
L

leeg

Tad said:
^^^^^^^^^^^^^^

The devil is in the details with regexes, so "something like" is
likely not good enough to get a useable answer.

No, but I can't completely define the syntax of the input data so
implementing "something like" it then fixing failures is the best I can do.
Can there be spaces in the already-quoted strings? Your example
has none like that.

Yes there can; there can even be important characters (e.g. ()) in the
quoted strings, I'll sort those out by transliterating anything that's
left after I've parsed the data.
Can declarations be broken across lines? eg:

someArray = (value1,
value2);

Yes, and the example included hashes declared thus.
Can you have values on the RHS that you do NOT what to quote?

No, as everything can be eval-ed into a string and then dealt with
'upstream', as it were.
etc...
[...]


You might be able to get the LHS(s) handled by a simple

s/ = / => /;

and let perl autoquote for you.

You'll need to change (some of?) the parens to squares for
anonymous array elements.
Yes, I've sorted that bit with some transliteration, thanks.
It would become Real Easy if you had a grammar for the data, then
you could simply write a parser for the grammar.

Yup.

Got a grammar?
Nope. :-(
 
T

Tad McClellan

What generates the data?

implementing "something like" it then fixing failures is the best I can do.


Not necessarily.

It is the best you *know how* to do.

Nope. :-(


Then write a grammar for it, it looks a rather simple language to me.

It will be easier to guess at a grammar and then fix failures that
to do it with pattern matching.
 
L

leeg

Tad said:
What generates the data?

Apple/NeXT's EOModeller application. It's an old variant of the plist
format (before XML came along).
Not necessarily.

It is the best you *know how* to do.

Actually, you'll have noticed that I don't even know how to do that :)
Then write a grammar for it, it looks a rather simple language to me.

It will be easier to guess at a grammar and then fix failures that
to do it with pattern matching.

Perhaps, I don't know how to write a grammar engine either.... :-(
 
A

Anno Siegel

leeg said:
Apple/NeXT's EOModeller application. It's an old variant of the plist
format (before XML came along).

Then look at the module Mac::propertyList. It may be the solution,
but even if it isn't you may be able to steal some useful stuff from
it.

[...]

Anno
 
L

leeg

Anno said:
Then look at the module Mac::propertyList. It may be the solution,
but even if it isn't you may be able to steal some useful stuff from
it.

It isn't, as it only deals with the XML format. I am in e-mail contact
with its author regarding my plists though ;-)

Cheers.
 
A

Anno Siegel

leeg said:
Anno Siegel wrote:

It isn't, as it only deals with the XML format. I am in e-mail contact
with its author regarding my plists though ;-)

Okay...

Each time I come across this thread I'm more convinced that the right
way to go about this is to write a real parser. The process of tweaking
things while you discover more variants of the format will be *much*
easier when you have a Parse::RecDescent (say) grammar to tweak instead
of one or more monster-regexes.

You *will* have to spend an afternoon or so acquainting yourself with
Parse::RecDescent, but it will pay. Write one or two very simple
grammars of your own before trying to tackle full property lists.
Something like parsing numeric expressions made out of + - * / ( ) and
integers is a good start.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top