Need expert help matching a line

Discussion in 'Perl Misc' started by Ramon F Herrera, Sep 8, 2009.

  1. This is really a parsing question, but I figure that nobody knows more
    about regex and pattern matching than Perl programmers.

    I have many files which contain multiple lines of variable-value pair
    assignments. I need to break down each lines into its 3 constituent
    components.

    Variable Name = Variable Value

    IOW, each line contains 3 parts:

    VariableName
    Equal Sign
    VariableValue

    As opposed to the variable names used by many programming languages,
    my variable names accept embedded space.

    Here's some examples of the lines I am trying to match:

    My Favorite Baseball Player = George Herman "Babe" Ruth
    What did your do on Christmas = I rested, computed the % mortgage and
    visited my brother + sister.
    Favorite Curse = That umpire is a #&*%!

    What I need is a way to specify valid characters.

    VariableName: Alphanumeric (and perhaps underscore), blank space.
    VariableValue: Pretty much anything is valid on the RHS except an '='
    sign (I guess)

    Thanks for your kind assistance.

    -Ramon
    Ramon F Herrera, Sep 8, 2009
    #1
    1. Advertising

  2. On Sep 8, 8:23 am, Ramon F Herrera <> wrote:
    > This is really a parsing question, but I figure that nobody knows more
    > about regex and pattern matching than Perl programmers.
    >
    > I have many files which contain multiple lines of variable-value pair
    > assignments. I need to break down each lines into its 3 constituent
    > components.
    >
    > Variable Name = Variable Value
    >
    > IOW, each line contains 3 parts:
    >
    > VariableName
    > Equal Sign
    > VariableValue
    >
    > As opposed to the variable names used by many programming languages,
    > my variable names accept embedded space.
    >
    > Here's some examples of the lines I am trying to match:
    >
    > My Favorite Baseball Player = George Herman "Babe" Ruth
    > What did your do on Christmas = I rested, computed the % mortgage and
    > visited my brother + sister.
    > Favorite Curse = That umpire is a #&*%!
    >
    > What I need is a way to specify valid characters.
    >
    > VariableName: Alphanumeric (and perhaps underscore), blank space.
    > VariableValue: Pretty much anything is valid on the RHS except an '='
    > sign (I guess)
    >
    > Thanks for your kind assistance.
    >
    > -Ramon


    Just to make the exercise a little harder -and fun- the assignment
    syntax should be able to support continuation lines, where the RHS is
    very long:

    Describe your summer vacation = Well, we traveled to the beach
    and to the mountains, and debated whether we
    should go to the Grand Canyon and Niagara falls.
    The GPS you gave me turned out to be very useful!

    A continuation line always starts with blank space.

    TIA,

    -Ramon
    Ramon F Herrera, Sep 8, 2009
    #2
    1. Advertising

  3. On Sep 8, 9:28 am, Don Piven <> wrote:
    > Ramon F Herrera wrote:
    > > This is really a parsing question, but I figure that nobody knows more
    > > about regex and pattern matching than Perl programmers.

    >
    > perlre (the manpage for Perl regular expressions) is your friend.
    > Seriously.  It will answer all the questions you raised.



    Thanks Don, seriously.

    You are essentially telling me to RTFM. I have already RTFM.

    The question remains open...

    Thx,

    -Ramon
    Lucius Sanctimonious, Sep 8, 2009
    #3
  4. >>>>> "LS" == Lucius Sanctimonious <> writes:

    LS> You are essentially telling me to RTFM. I have already RTFM.

    Your question shows no evidence of this.

    LS> The question remains open...

    Post what you've already tried, and let us know what you're having
    problems with.

    Also, review the posting guidelines that are posted here frequently, or
    online at http://www.rehabitation.com/clpmisc/clpmisc_guidelines.html --
    they're a summary of what works best if you really want to get help,
    instead of just wanting to stir up drama.

    Charlton



    --
    Charlton Wilbur
    Charlton Wilbur, Sep 8, 2009
    #4
  5. Ramon F Herrera

    ccc31807 Guest

    CODE:
    use strict;
    use warnings;

    my ($var, $val);
    my %variables;
    while (<DATA>)
    {
    chomp;
    if (/=/) { ($var, $val) = split /=/; }
    elsif (/^ +\w+/) { $val .= $_; }
    else { next; }
    $var =~ s/^\s+//;
    $var =~ s/\s+$//;
    $variables{$var} = $val;
    }

    foreach my $key (keys %variables) { print "$key => $variables{$key}
    \n"; }
    exit(0);

    __DATA__
    My Favorite Baseball Player = George Herman "Babe" Ruth
    What did your do on Christmas = I rested, computed the % mortgage and
    visited my brother + sister.
    Describe your summer vacation = Well, we traveled to the beach
    and to the mountains, and debated whether we
    should go to the Grand Canyon and Niagara falls.
    The GPS you gave me turned out to be very useful!
    Favorite Curse = That umpire is a #&*%!

    OUTPUT:
    My Favorite Baseball Player => George Herman "Babe" Ruth
    Describe your summer vacation => Well, we traveled to the beach and
    to the mountains, and debated whether we should go to the Grand
    Canyon and Niagara falls. The GPS you gave me turned out to be very
    useful!
    Favorite Curse => That umpire is a #&*%!
    What did your do on Christmas => I rested, computed the % mortgage
    and visited my brother + sister.
    ccc31807, Sep 8, 2009
    #5
  6. Ramon F Herrera

    Guest

    On Tue, 8 Sep 2009 09:58:56 -0700 (PDT), ccc31807 <> wrote:

    >CODE:
    >use strict;
    >use warnings;
    >
    >my ($var, $val);

    = ('','');
    >my %variables;
    >while (<DATA>)
    >{
    > chomp;
    > if (/=/) { ($var, $val) = split /=/; }
    > elsif (/^ +\w+/) { $val .= $_; }
    > else { next; }
    > $var =~ s/^\s+//;
    > $var =~ s/\s+$//;
    > $variables{$var} = $val;
    >}
    >
    >foreach my $key (keys %variables) { print "$key => $variables{$key}
    >\n"; }
    >exit(0);
    >

    Looks good. I like the way you did this.
    Might need initial condition check
    elsif (/^ +\w+/ and length($var)) { $val .= $_; }

    -sln
    , Sep 8, 2009
    #6
  7. Ramon F Herrera

    Guest

    On Tue, 8 Sep 2009 05:23:32 -0700 (PDT), Ramon F Herrera <> wrote:

    >
    >This is really a parsing question, but I figure that nobody knows more
    >about regex and pattern matching than Perl programmers.
    >
    >I have many files which contain multiple lines of variable-value pair
    >assignments. I need to break down each lines into its 3 constituent
    >components.
    >
    >Variable Name = Variable Value
    >
    >IOW, each line contains 3 parts:
    >
    >VariableName
    >Equal Sign
    >VariableValue
    >
    >As opposed to the variable names used by many programming languages,
    >my variable names accept embedded space.
    >
    >Here's some examples of the lines I am trying to match:
    >
    >My Favorite Baseball Player = George Herman "Babe" Ruth
    >What did your do on Christmas = I rested, computed the % mortgage and
    >visited my brother + sister.
    >Favorite Curse = That umpire is a #&*%!
    >
    >What I need is a way to specify valid characters.
    >
    >VariableName: Alphanumeric (and perhaps underscore), blank space.
    >VariableValue: Pretty much anything is valid on the RHS except an '='
    >sign (I guess)
    >
    >Thanks for your kind assistance.
    >
    >-Ramon


    -sln

    use strict;
    use warnings;

    my $buf = '';

    while (<DATA>)
    {
    if (/=/ or eof) {
    if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
    {
    my ($var,$val) = ($1,$2);
    $val =~ s/\n +/\n/g;
    print "$var => $val\n\n";
    }
    $buf = '';
    }
    $buf .= $_;
    }
    __DATA__

    My Favorite Baseball Player = George Herman = "Babe" Ruth
    What did your do on Christmas = I rested, computed the % mortgage and
    visited my brother + sister.
    asdfasdf=
    Favorite Curse = That umpire is a #&*%!
    errnngsf
    sngdnsdg
    Describe your summer vacation = Well, we traveled to the beach
    and to the mountains, and debated whether we
    should go to the Grand Canyon and Niagara falls.
    The GPS you gave me turned out to be very useful!
    , Sep 8, 2009
    #7
  8. On Sep 8, 1:35 pm, wrote:
    > On Tue, 8 Sep 2009 05:23:32 -0700 (PDT), Ramon F Herrera <> wrote:
    >
    >
    >
    >
    >
    > >This is really a parsing question, but I figure that nobody knows more
    > >about regex and pattern matching than Perl programmers.

    >
    > >I have many files which contain multiple lines of variable-value pair
    > >assignments. I need to break down each lines into its 3 constituent
    > >components.

    >
    > >Variable Name = Variable Value

    >
    > >IOW, each line contains 3 parts:

    >
    > >VariableName
    > >Equal Sign
    > >VariableValue

    >
    > >As opposed to the variable names used by many programming languages,
    > >my variable names accept embedded space.

    >
    > >Here's some examples of the lines I am trying to match:

    >
    > >My Favorite Baseball Player = George Herman "Babe" Ruth
    > >What did your do on Christmas = I rested, computed the % mortgage and
    > >visited my brother + sister.
    > >Favorite Curse = That umpire is a #&*%!

    >
    > >What I need is a way to specify valid characters.

    >
    > >VariableName: Alphanumeric (and perhaps underscore), blank space.
    > >VariableValue: Pretty much anything is valid on the RHS except an '='
    > >sign (I guess)

    >
    > >Thanks for your kind assistance.

    >
    > >-Ramon

    >
    > -sln
    >
    > use strict;
    > use warnings;
    >
    > my $buf  = '';
    >
    > while (<DATA>)
    > {
    >         if (/=/ or eof) {
    >                 if ($buf =~ /\s*([\w ]+)\s*=\s*((?:.+(?:\n .+)*)|)/)
    >                 {
    >                         my ($var,$val) = ($1,$2);
    >                         $val =~ s/\n +/\n/g;
    >                         print "$var => $val\n\n";
    >                 }
    >                 $buf = '';
    >         }
    >         $buf .= $_;    }
    >
    > __DATA__
    >
    > My Favorite Baseball Player = George Herman =  "Babe" Ruth
    > What did your do on Christmas = I rested, computed the % mortgage and
    >  visited my brother + sister.
    >  asdfasdf=
    > Favorite Curse = That umpire is a #&*%!
    > errnngsf
    > sngdnsdg
    > Describe your summer vacation = Well, we traveled to the beach
    >   and to the mountains, and debated whether we
    >   should go to the Grand Canyon and Niagara falls.
    >   The GPS you gave me turned out to be very useful!



    Thank you, sln!

    I have to clarify that my program is not written in Perl (language
    that I haven't used in ages) but in C++. The reason I posted my
    question in this NG will be understood by reading this:

    http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax..html

    I am sticking with the default (Perl) Regex syntax.

    This is the relevant code that I have so far. As you can see it is
    rather simplistic. I am not implementing continuation lines yet.

    const string variable = "([\\w ]+)";
    const char equal_sign = '=';
    const string value = "([\\w ]+)";

    const string assignment = variable + equal_sign + value;

    The question that I have is this: how do I restrict the LHS to begin
    with an alphabetic characters? IOW: The LHS may contain blanks but
    they cannot be the first character of the line. I will also be
    accepting digits, periods and underscores on the LHS but again, the
    variable name cannot begin with any of them.

    TIA,

    -Ramon
    Ramon F Herrera, Sep 8, 2009
    #8
  9. On Sep 8, 6:20 pm, Ramon F Herrera <> wrote:
    > On Sep 8, 1:35 pm, wrote:
    >
    >
    >
    > > On Tue, 8 Sep 2009 05:23:32 -0700 (PDT), Ramon F Herrera <ra...@conexus..net> wrote:

    >
    > > >This is really a parsing question, but I figure that nobody knows more
    > > >about regex and pattern matching than Perl programmers.

    >
    > > >I have many files which contain multiple lines of variable-value pair
    > > >assignments. I need to break down each lines into its 3 constituent
    > > >components.

    >
    > > >Variable Name = Variable Value

    >
    > > >IOW, each line contains 3 parts:

    >
    > > >VariableName
    > > >Equal Sign
    > > >VariableValue

    >
    > > >As opposed to the variable names used by many programming languages,
    > > >my variable names accept embedded space.

    >
    > > >Here's some examples of the lines I am trying to match:

    >
    > > >My Favorite Baseball Player = George Herman "Babe" Ruth
    > > >What did your do on Christmas = I rested, computed the % mortgage and
    > > >visited my brother + sister.
    > > >Favorite Curse = That umpire is a #&*%!

    >
    > > >What I need is a way to specify valid characters.

    >
    > > >VariableName: Alphanumeric (and perhaps underscore), blank space.
    > > >VariableValue: Pretty much anything is valid on the RHS except an '='
    > > >sign (I guess)

    >
    > > >Thanks for your kind assistance.

    >
    > > >-Ramon

    >
    > > -sln

    >
    > > use strict;
    > > use warnings;

    >
    > > my $buf  = '';

    >
    > > while (<DATA>)
    > > {
    > >         if (/=/ or eof) {
    > >                 if ($buf =~ /\s*([\w ]+)\s*=\s*((?:..+(?:\n .+)*)|)/)
    > >                 {
    > >                         my ($var,$val) = ($1,$2);
    > >                         $val =~ s/\n +/\n/g;
    > >                         print "$var => $val\n\n";
    > >                 }
    > >                 $buf = '';
    > >         }
    > >         $buf .= $_;    }

    >
    > > __DATA__

    >
    > > My Favorite Baseball Player = George Herman =  "Babe" Ruth
    > > What did your do on Christmas = I rested, computed the % mortgage and
    > >  visited my brother + sister.
    > >  asdfasdf=
    > > Favorite Curse = That umpire is a #&*%!
    > > errnngsf
    > > sngdnsdg
    > > Describe your summer vacation = Well, we traveled to the beach
    > >   and to the mountains, and debated whether we
    > >   should go to the Grand Canyon and Niagara falls.
    > >   The GPS you gave me turned out to be very useful!

    >
    > Thank you, sln!
    >
    > I have to clarify that my program is not written in Perl (language
    > that I haven't used in ages) but in C++. The reason I posted my
    > question in this NG will be understood by reading this:
    >
    > http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/...
    >
    > I am sticking with the default (Perl) Regex syntax.
    >
    > This is the relevant code that I have so far. As you can see it is
    > rather simplistic. I am not implementing continuation lines yet.
    >
    > const string variable = "([\\w ]+)";
    > const char equal_sign = '=';
    > const string value    = "([\\w ]+)";
    >
    > const string assignment = variable + equal_sign + value;
    >
    > The question that I have is this: how do I restrict the LHS to begin
    > with an alphabetic characters? IOW: The LHS may contain blanks but
    > they cannot be the first character of the line. I will also be
    > accepting digits, periods and underscores on the LHS but again, the
    > variable name cannot begin with any of them.
    >
    > TIA,
    >
    > -Ramon


    I have made some progress here:

    const string variable = "(\\w+[\\w\\d\\. ]*)";
    const char equal_sign = '=';
    const string value = "(.+)";

    I think the above will cover most real cases, but not sure what will
    happen if the RHS contains an '=' sign?

    -RFH
    Ramon F Herrera, Sep 8, 2009
    #9
  10. Ramon F Herrera

    Guest

    On Tue, 8 Sep 2009 15:20:04 -0700 (PDT), Ramon F Herrera <> wrote:

    >On Sep 8, 1:35 pm, wrote:
    >> On Tue, 8 Sep 2009 05:23:32 -0700 (PDT), Ramon F Herrera <> wrote:
    >>

    <snip>
    >I have to clarify that my program is not written in Perl (language
    >that I haven't used in ages) but in C++. The reason I posted my
    >question in this NG will be understood by reading this:
    >
    >http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax.html
    >
    >I am sticking with the default (Perl) Regex syntax.
    >

    This uses Perl 5.8 as a reference to describe the syntax. That is its library default?
    http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    >This is the relevant code that I have so far. As you can see it is
    >rather simplistic. I am not implementing continuation lines yet.
    >
    >const string variable = "([\\w ]+)";
    >const char equal_sign = '=';
    >const string value = "([\\w ]+)";
    >
    >const string assignment = variable + equal_sign + value;
    >
    >The question that I have is this: how do I restrict the LHS to begin
    >with an alphabetic characters? IOW: The LHS may contain blanks but
    >they cannot be the first character of the line. I will also be
    >accepting digits, periods and underscores on the LHS but again, the
    >variable name cannot begin with any of them.
    >

    To just add those restrictions just requires this:
    const string variable = "([a-zA-Z][\\w. ]+)";

    There is nothing in the regex that is not in Perl 5.8, if
    thats what they will be using.

    -sln
    , Sep 8, 2009
    #10
  11. On Sep 8, 6:44 pm, wrote:

    > This uses Perl 5.8 as a reference to describe the syntax.
    > That is its library default?


    Don't know, but frankly, my expressions will be so simple (the only
    ones I am capable of writing :) that I doubt the syntax version will
    make any difference.

    All these questions are really to refresh my memory, since all this
    stuff is coming back to me. My biggie Perl pattern matching project
    was as follows. I used to manage a multi-thousand subscriber mailing
    list at MIT. Those days e-mail traffic could really bog down a server
    and network, and many of my subscribers graduated or went on a summer
    vacation, etc., and forgot to unsubscribe. What I did was to pattern-
    match every conceivable bounce received and extract the e-mail addres
    of the subscriber. There were lots of mail servers then, BITNET, UUCP,
    DECNET and many versions of sendmail. At least today e-mail addresses
    are pretty much standard.

    -Ramon
    Ramon F Herrera, Sep 9, 2009
    #11
  12. Ramon F Herrera

    Guest

    On Tue, 8 Sep 2009 15:38:21 -0700 (PDT), Ramon F Herrera <> wrote:

    >On Sep 8, 6:20 pm, Ramon F Herrera <> wrote:
    >> On Sep 8, 1:35 pm, wrote:
    >>
    >> > On Tue, 8 Sep 2009 05:23:32 -0700 (PDT), Ramon F Herrera <> wrote:

    >>

    >I have made some progress here:
    >
    >const string variable = "(\\w+[\\w\\d\\. ]*)";
    >const char equal_sign = '=';
    >const string value = "(.+)";
    >
    >I think the above will cover most real cases, but not sure what will
    >happen if the RHS contains an '=' sign?
    >
    >-RFH


    "(\\w+[\\w\\d\\. ]*)";
    ^ don't you want alpha first character?
    "(\\w+[\\w\\d\\. ]*)";
    ^ this is redundant

    Otherwise, it looks ok. Since Boost is using Perl 5.8, you may
    be able to do some validation and trimming all in the regex components.

    // VAR Capture: alpha start char, other chars alphanumeric, space and '.',
    // Trim (do not capture) trailing white spaces before 'equal_sign'
    const string variable = "([a-zA-Z](?:(?!\s*=)[\\w. ])*)";
    // Breakdown:
    // ( # start capture group
    // [a-zA-Z] # first char, alpha
    // (?: # pseudo group
    // (?!\s*=) # IF NOT whitespace(*) followed by equal sign
    // [\w. ] # AND this char is in this class
    // # THEN consume character
    // # ELSE fail (or trim) on this character
    // )* # end group, do none or many times
    // ) # finish capture, done once


    // Separator: whitespace, equal, whitespace (non-capture, considered trim)
    const char equal_sign = "\\s*=\\s*";

    // VAL Capture: any character up until a newline.
    // Trim (do not capture) trailing white spaces before either
    // equal sign (invalid separator), newline or end of string.
    const string value = "((?:(?!\s*(?:=|\n|$)).)+)";
    // Breakdown:
    // ( # start capture group
    // (?: # pseudo group
    // (?! # IF NOT
    // \s* # whitespace(*) followed by
    // (?:=|\n|$) # equal or newline or end of string
    // )
    // . # AND this char is not newline
    // # THEN consume character
    // # ELSE fail (or trim) on this character
    // )+ # end group, do once or many times
    // ) # finish capture, done once

    Combined it looks something like this -
    /([a-zA-Z](?:(?!\s*=)[\w. ])*)\s*=\s*((?:(?!\s*(?:=|\n|$)).)+)/

    I am guilty of too much info. It looks worse than it really is.
    Thanks for that Boost info.

    -sln
    , Sep 9, 2009
    #12
  13. Ramon F Herrera <> wrote:
    >
    >This is really a parsing question, but I figure that nobody knows more
    >about regex and pattern matching than Perl programmers.
    >
    >I have many files which contain multiple lines of variable-value pair
    >assignments. I need to break down each lines into its 3 constituent
    >components.
    >
    >Variable Name = Variable Value
    >
    >IOW, each line contains 3 parts:
    >
    >VariableName
    >Equal Sign
    >VariableValue


    See 'perldoc -f split':

    ($variable_name, $variable_value) = split (/=/, $line);

    As long as there isn't an equal sign in either name or value this will
    work just fine.

    jue
    Jürgen Exner, Sep 16, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shapper
    Replies:
    7
    Views:
    301
    shapper
    Feb 27, 2007
  2. need expert help

    , Feb 23, 2005, in forum: C Programming
    Replies:
    0
    Views:
    348
  3. Y2J
    Replies:
    5
    Views:
    322
    Victor Bazarov
    Aug 19, 2006
  4. Replies:
    10
    Views:
    610
    Jonathan N. Little
    Oct 20, 2007
  5. Bobby Chamness
    Replies:
    2
    Views:
    212
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page