confused constructing a regex

Discussion in 'Perl Misc' started by leeg, Jun 1, 2005.

  1. leeg

    leeg Guest

    I have an input file of a format that looks something like this:

    {
    foo = (
    {
    bar = "baz";
    wibble = WOBBLE;
    },
    {
    bar = "barney";
    wibble = JELLY;
    }
    );
    someKey = someValue;
    someArray = (value1, value2);
    blankDict = {};
    };

    I've noticed (and at the time was fairly proud of said epiphany) that
    this is almost a declaration of an anonymous hash and with a little
    tweaking I could eval it as such. However, I need to quote it properly,
    and despite a number of attempts can't construct a regex that will do it.
    I want to search for a list of characters which are not the various
    formatting characters [^\(\){};,=] *and* are not already surrounded by
    quotes, and then surround them by quotes.

    I thought of:
    $line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;
    but this converts the above into:
    {
    foo" "
    {
    bar = "baz";
    wibble"E"
    },
    {
    bar = "barney";
    wibble"Y"
    }
    );
    someKey"e"
    someArray"1"value2);
    blankDict" "
    }

    so isn't what I want. What I especially can't determine is why " =
    someValue;" for instance would be replaced by "e". Could someone offer
    some assistance?

    Ta,

    leeg.
    leeg, Jun 1, 2005
    #1
    1. Advertising

  2. leeg

    Dave Guest

    "leeg" <> wrote in message
    news:d7kfpe$422$...
    >I have an input file of a format that looks something like this:
    >
    > {
    > foo = (
    > {
    > bar = "baz";
    > wibble = WOBBLE;
    > },
    > {
    > bar = "barney";
    > wibble = JELLY;
    > }
    > );
    > someKey = someValue;
    > someArray = (value1, value2);
    > blankDict = {};
    > };
    >
    > I've noticed (and at the time was fairly proud of said epiphany) that this
    > is almost a declaration of an anonymous hash and with a little tweaking I
    > could eval it as such. However, I need to quote it properly, and despite
    > a number of attempts can't construct a regex that will do it.
    > I want to search for a list of characters which are not the various
    > formatting characters [^\(\){};,=] *and* are not already surrounded by
    > quotes, and then surround them by quotes.
    >
    > I thought of:
    > $line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;
    > but this converts the above into:
    > {
    > foo" "
    > {
    > bar = "baz";
    > wibble"E"
    > },
    > {
    > bar = "barney";
    > wibble"Y"
    > }
    > );
    > someKey"e"
    > someArray"1"value2);
    > blankDict" "
    > }
    >
    > so isn't what I want. What I especially can't determine is why " =
    > someValue;" for instance would be replaced by "e". Could someone offer
    > some assistance?
    >
    > Ta,
    >
    > leeg.


    Your regex removes the "formatting characters" before and after the non
    formatting characters block. Use lookahead/behind or capture them and put
    them in. Also you are only capturing one character of the string you are
    trying to quote, capture the whole string. i.e.:
    $line =~ s/([\s\(\){};,=]+)([^"\(\){};,=]+)([\s\(\){};,=]+)/$1"$2"$3/g;

    I'm not saying this will do what you want as I haven't looked into it in
    detail, but it is clear that your original regex is deleting info that you
    want to keep.

    Dave
    Dave, Jun 1, 2005
    #2
    1. Advertising

  3. leeg

    pikus Guest

    would this work?

    s/= (.*)([;\n])/= "$1"$2/g;

    Maybe you were overthinking it?
    pikus, Jun 1, 2005
    #3
  4. leeg

    pikus Guest

    Oops, Im thinking that should have been:

    s/= ([^"].*)([;\n])/= "$1"$2/g;

    I changed it to check for existing quotes... :)
    pikus, Jun 1, 2005
    #4
  5. leeg

    leeg Guest

    pikus wrote:
    > would this work?
    >
    > s/= (.*)([;\n])/= "$1"$2/g;
    >
    > Maybe you were overthinking it?
    >

    Sadly not; my example data were too clean. For instance:
    {className=PCPerson;name=PCPerson;},
    would be (and indeed is) valid input, and should lead to:
    {"className"="PCPerson";"name"="PCPerson";},
    I've got something that works on my example data, but haven't fully
    tested elsewhere, in this form:
    $line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
    as I say I haven't completely tested it but it gets the job done in
    simple cases.
    leeg, Jun 1, 2005
    #5
  6. * leeg schrieb:

    > I have an input file of a format that looks something like this:
    >
    > {
    > foo = (
    > {
    > bar = "baz";
    > wibble = WOBBLE;
    > },
    > {
    > bar = "barney";
    > wibble = JELLY;
    > }
    > );
    > someKey = someValue;
    > someArray = (value1, value2);
    > blankDict = {};
    > };
    >
    > I've noticed (and at the time was fairly proud of said epiphany) that
    > this is almost a declaration of an anonymous hash and with a little
    > tweaking I could eval it as such. However, I need to quote it properly,
    > and despite a number of attempts can't construct a regex that will do it.
    > I want to search for a list of characters which are not the various
    > formatting characters [^\(\){};,=] *and* are not already surrounded by
    > quotes, and then surround them by quotes.


    Well, with your given example, I'd do something like

    my $data = do { local $/; <DATA> };
    $data =~ s/(["']?)(\w+)\1?/'$2'/g; # fix quotes
    $data =~ y/();=/[],,/; # fix arrays and lists

    Afterwards, you could eval() it.

    regards,
    fabian
    Fabian Pilkowski, Jun 1, 2005
    #6
  7. leeg

    pikus Guest

    I see where I went wrong.
    /me drinks more coffee.
    ::eyes crack open a tad::
    pikus, Jun 1, 2005
    #7
  8. leeg

    leeg Guest

    leeg wrote:

    > I've got something that works on my example data, but haven't fully
    > tested elsewhere, in this form:
    > $line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
    > as I say I haven't completely tested it but it gets the job done in
    > simple cases.


    But not in complex cases. Perhaps if I quote some real data it would help:

    {
    attributes = (
    {
    columnName = id;
    externalType = INT;
    name = id;
    valueClassName = NSNumber;
    valueType = i;
    },
    {
    columnName = "type_code";
    externalType = INT;
    name = typeCode;
    valueClassName = NSNumber;
    valueType = i;
    }
    );
    attributesUsedForLocking = (id);
    className = PCObject;
    classProperties = (typeCode, id);
    fetchSpecificationDictionary = {};
    internalInfo = {"_clientClassPropertyNames" = (Attribute); };
    isAbstractEntity = Y;
    name = PCObject;
    primaryKeyAttributes = (id);
    }

    and now what I get after evaluating:
    $line =~ s/(?<![\w"])(\w[^\(\){};=,]+\w)(?![\w"])/"$1"/g;
    $line =~ y/();=/[],,/;
    and sticking a semicolon on the end:

    {
    "attributes" , [
    {
    "columnName" , id,
    "externalType" , "INT",
    "name" , id,
    "valueClassName" , "NSNumber",
    "valueType" , i,
    },
    {
    "columnName" , "type_code",
    "externalType" , "INT",
    "name" , "typeCode",
    "valueClassName" , "NSNumber",
    "valueType" , i,
    }
    ],
    "attributesUsedForLocking" , [id],
    "className" , "PCObject",
    "classProperties" , ["typeCode", id],
    "fetchSpecificationDictionary" , {},
    "internalInfo" , {"_clientClassPropertyNames" , ["Attribute"], },
    "isAbstractEntity" , Y,
    "name" , "PCObject",
    "primaryKeyAttributes" , [id],
    }
    ;

    so it looks like cases where something is adjoining any of ( ) ; or $ my
    regex isn't catching. :-(
    leeg, Jun 1, 2005
    #8
  9. leeg <> wrote:

    > I have an input file of a format that looks something like this:

    ^^^^^^^^^^^^^^

    The devil is in the details with regexes, so "something like" is
    likely not good enough to get a useable answer.

    Can there be spaces in the already-quoted strings? Your example
    has none like that.

    Can declarations be broken across lines? eg:

    someArray = (value1,
    value2);

    Can you have values on the RHS that you do NOT what to quote?

    etc...


    > {
    > foo = (
    > {
    > bar = "baz";
    > wibble = WOBBLE;
    > },
    > {
    > bar = "barney";
    > wibble = JELLY;
    > }
    > );
    > someKey = someValue;
    > someArray = (value1, value2);
    > blankDict = {};
    > };



    That looks pretty Formal (as in Formal Methods).

    Is it a "little language"?

    If so, then find the grammar for it (or write one for it).


    You might be able to get the LHS(s) handled by a simple

    s/ = / => /;

    and let perl autoquote for you.

    You'll need to change (some of?) the parens to squares for
    anonymous array elements.


    > this is almost a declaration of an anonymous hash and with a little
    > tweaking I could eval it as such.


    > Could someone offer
    > some assistance?



    It would become Real Easy if you had a grammar for the data, then
    you could simply write a parser for the grammar.

    Got a grammar?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jun 1, 2005
    #9
  10. leeg

    Anno Siegel Guest

    leeg <> wrote in comp.lang.perl.misc:
    > I have an input file of a format that looks something like this:
    >
    > {
    > foo = (
    > {
    > bar = "baz";
    > wibble = WOBBLE;
    > },
    > {
    > bar = "barney";
    > wibble = JELLY;
    > }
    > );
    > someKey = someValue;
    > someArray = (value1, value2);
    > blankDict = {};
    > };
    >
    > I've noticed (and at the time was fairly proud of said epiphany) that
    > this is almost a declaration of an anonymous hash and with a little
    > tweaking I could eval it as such. However, I need to quote it properly,
    > and despite a number of attempts can't construct a regex that will do it.
    > I want to search for a list of characters which are not the various
    > formatting characters [^\(\){};,=] *and* are not already surrounded by
    > quotes, and then surround them by quotes.


    You have more things to change before the expression above is a
    Perl-parseable data definition. You'll have to change parentheses () to
    brackets [], equal signs = to (fat) commas =>, and most (but not all)
    semicolons to commas.

    >
    > I thought of:
    > $line =~ s/[\s\(\){};,=]+([^"\(\){};,=])+[\s\(\){};,=]+/"$1"/g;

    ^^^^ ^^^^ ^^^^
    No need to escape (), they're not special in a character class.

    > but this converts the above into:
    > {
    > foo" "
    > {
    > bar = "baz";
    > wibble"E"
    > },
    > {
    > bar = "barney";
    > wibble"Y"
    > }
    > );
    > someKey"e"
    > someArray"1"value2);
    > blankDict" "
    > }


    Huh? It doesn't do that for me, and it can't, though it doesn't do what
    you want either.

    > so isn't what I want. What I especially can't determine is why " =
    > someValue;" for instance would be replaced by "e".


    No idea.

    Distinguishing quoted words from unquoted ones with a regex isn't trivial
    (as you have seen). As usual, the solution is to use Perl's other features
    to keep the regular expressions simple.

    In this case, we could split on quoted words (recognizing *them* isn't
    hard), keeping the delimiters. That splits the string into quote-free
    parts and quoted words that separate them.

    Next, walk through the list, leaving the quoted parts alone, but adding
    quotes to *every* word in the quote-free regions. Again, this isn't hard.

    Finally, join it all together again.

    $text = join '',
    map { s/(\w+)/"$1"/g unless /^"/; $_}
    split /("\w*?")/s, $text;

    I works on well-formed expressions only. Unbalanced quotes confuse it,
    and quoted non-words probably too.

    Anno
    Anno Siegel, Jun 1, 2005
    #10
  11. leeg

    leeg Guest

    Tad McClellan wrote:
    > leeg <> wrote:
    >
    >
    >>I have an input file of a format that looks something like this:

    >
    > ^^^^^^^^^^^^^^
    >
    > The devil is in the details with regexes, so "something like" is
    > likely not good enough to get a useable answer.
    >


    No, but I can't completely define the syntax of the input data so
    implementing "something like" it then fixing failures is the best I can do.

    > Can there be spaces in the already-quoted strings? Your example
    > has none like that.
    >


    Yes there can; there can even be important characters (e.g. ()) in the
    quoted strings, I'll sort those out by transliterating anything that's
    left after I've parsed the data.

    > Can declarations be broken across lines? eg:
    >
    > someArray = (value1,
    > value2);


    Yes, and the example included hashes declared thus.

    >
    > Can you have values on the RHS that you do NOT what to quote?
    >


    No, as everything can be eval-ed into a string and then dealt with
    'upstream', as it were.

    > etc...
    >

    [...]
    >
    >
    > You might be able to get the LHS(s) handled by a simple
    >
    > s/ = / => /;
    >
    > and let perl autoquote for you.
    >
    > You'll need to change (some of?) the parens to squares for
    > anonymous array elements.
    >

    Yes, I've sorted that bit with some transliteration, thanks.
    >
    >
    > It would become Real Easy if you had a grammar for the data, then
    > you could simply write a parser for the grammar.
    >


    Yup.

    > Got a grammar?
    >

    Nope. :-(
    leeg, Jun 1, 2005
    #11
  12. leeg <> wrote:
    > Tad McClellan wrote:
    >> leeg <> wrote:
    >>
    >>
    >>>I have an input file of a format that looks something like this:



    What generates the data?


    > implementing "something like" it then fixing failures is the best I can do.



    Not necessarily.

    It is the best you *know how* to do.


    >> It would become Real Easy if you had a grammar for the data, then
    >> you could simply write a parser for the grammar.
    >>

    >
    > Yup.
    >
    >> Got a grammar?
    >>

    > Nope. :-(



    Then write a grammar for it, it looks a rather simple language to me.

    It will be easier to guess at a grammar and then fix failures that
    to do it with pattern matching.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jun 1, 2005
    #12
  13. leeg

    leeg Guest

    Tad McClellan wrote:
    > leeg <> wrote:
    >
    >>Tad McClellan wrote:
    >>
    >>>leeg <> wrote:
    >>>
    >>>
    >>>
    >>>>I have an input file of a format that looks something like this:

    >
    >
    >
    > What generates the data?
    >


    Apple/NeXT's EOModeller application. It's an old variant of the plist
    format (before XML came along).

    >
    >
    >>implementing "something like" it then fixing failures is the best I can do.

    >
    >
    >
    > Not necessarily.
    >
    > It is the best you *know how* to do.
    >


    Actually, you'll have noticed that I don't even know how to do that :)

    >
    >
    >>>It would become Real Easy if you had a grammar for the data, then
    >>>you could simply write a parser for the grammar.
    >>>

    >>
    >>Yup.
    >>
    >>
    >>>Got a grammar?
    >>>

    >>
    >>Nope. :-(

    >
    >
    >
    > Then write a grammar for it, it looks a rather simple language to me.
    >
    > It will be easier to guess at a grammar and then fix failures that
    > to do it with pattern matching.
    >
    >


    Perhaps, I don't know how to write a grammar engine either.... :-(
    leeg, Jun 1, 2005
    #13
  14. leeg

    Anno Siegel Guest

    leeg <> wrote in comp.lang.perl.misc:
    > Tad McClellan wrote:
    > > leeg <> wrote:
    > >>Tad McClellan wrote:
    > >>>leeg <> wrote:


    > >>>>I have an input file of a format that looks something like this:

    > >
    > > What generates the data?
    > >

    > Apple/NeXT's EOModeller application. It's an old variant of the plist
    > format (before XML came along).


    Then look at the module Mac::propertyList. It may be the solution,
    but even if it isn't you may be able to steal some useful stuff from
    it.

    [...]

    Anno
    Anno Siegel, Jun 2, 2005
    #14
  15. leeg

    leeg Guest

    Anno Siegel wrote:
    > leeg <> wrote in comp.lang.perl.misc:
    >
    >>Tad McClellan wrote:
    >>
    >>>leeg <> wrote:
    >>>
    >>>>Tad McClellan wrote:
    >>>>
    >>>>>leeg <> wrote:

    >
    >
    >>>>>>I have an input file of a format that looks something like this:
    >>>
    >>>What generates the data?
    >>>

    >>
    >>Apple/NeXT's EOModeller application. It's an old variant of the plist
    >>format (before XML came along).

    >
    >
    > Then look at the module Mac::propertyList. It may be the solution,
    > but even if it isn't you may be able to steal some useful stuff from
    > it.
    >


    It isn't, as it only deals with the XML format. I am in e-mail contact
    with its author regarding my plists though ;-)

    Cheers.
    leeg, Jun 2, 2005
    #15
  16. leeg

    Anno Siegel Guest

    leeg <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:


    > >>Apple/NeXT's EOModeller application. It's an old variant of the plist
    > >>format (before XML came along).

    > >
    > >
    > > Then look at the module Mac::propertyList. It may be the solution,
    > > but even if it isn't you may be able to steal some useful stuff from
    > > it.
    > >

    >
    > It isn't, as it only deals with the XML format. I am in e-mail contact
    > with its author regarding my plists though ;-)


    Okay...

    Each time I come across this thread I'm more convinced that the right
    way to go about this is to write a real parser. The process of tweaking
    things while you discover more variants of the format will be *much*
    easier when you have a Parse::RecDescent (say) grammar to tweak instead
    of one or more monster-regexes.

    You *will* have to spend an afternoon or so acquainting yourself with
    Parse::RecDescent, but it will pay. Write one or two very simple
    grammars of your own before trying to tackle full property lists.
    Something like parsing numeric expressions made out of + - * / ( ) and
    integers is a good start.

    Anno
    Anno Siegel, Jun 2, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leo Muller
    Replies:
    2
    Views:
    432
    Paul Glavich [MVP - ASP.NET]
    Jun 16, 2004
  2. Mohammad S Khan

    JDBC - Constructing a query realtime

    Mohammad S Khan, Oct 31, 2003, in forum: Java
    Replies:
    0
    Views:
    400
    Mohammad S Khan
    Oct 31, 2003
  3. Replies:
    3
    Views:
    3,267
    isitmeorthey
    Aug 29, 2005
  4. Archimede

    concurrency constructing objects

    Archimede, Nov 25, 2005, in forum: Java
    Replies:
    5
    Views:
    414
    Chris Uppal
    Dec 1, 2005
  5. Replies:
    3
    Views:
    746
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page