Text parsing and substitution

Discussion in 'Perl Misc' started by maheshpop1@gmail.com, May 19, 2006.

  1. Guest

    Hi guys,

    I am doing this module where I am gonna change the following sentence

    "1:action=commit:user=joe:date=2005-02-02:"
    "2:action=checkout:user=mark:date=2005-02-03:"

    to something like
    " 1. Commits by user Joe on date 2005-02-02 "
    " 2. Checkouts by user Joe on date 2005-02-03"

    making the above text a little bit more readable to the user. I started
    of with a program which finds out the different key value pairs and
    and based on the values append/create a string with approriate words
    like

    pseudocode only

    parse the line,
    load a hashmap with the key, value pairs
    if(hash{action}=='commit') <---this is a mandatory field
    string.="Commits"
    if(defined hash{user})
    string.="by hash{user})
    if(defined hash{date})
    string.="on date hash{date}"
    ...................................
    ...................................
    if(hash{action}=='checkout') <---this is a mandatory field
    string.="Commits"
    if(defined hash{user})
    string.="by hash{user})
    if(defined hash{date})
    string.="on date hash{date}"
    ..............................................
    .............................................
    I was thinking this sort of logic but a little apprehensive how elastic
    it can be as I would be addressing so many actions and seperate if
    blocks for all of them. Any suggestions or ideas on how to better
    achieve what I want to do above.

    cheers,
    pop.
    , May 19, 2006
    #1
    1. Advertising

  2. Guest Guest

    wrote:
    : Hi guys,

    : I am doing this module where I am gonna change the following sentence

    : "1:action=commit:user=joe:date=2005-02-02:"
    : "2:action=checkout:user=mark:date=2005-02-03:"

    : to something like

    : " 1. Commits by user Joe on date 2005-02-02 "
    : " 2. Checkouts by user Joe on date 2005-02-03"


    Check whether all your data follow the same pattern and obey the same
    constraints. Apparently you'r doing something in fields here, so:

    $rawtext="1:action=commit:user=joe:date=2005-02-02:";
    ($no,$rawaction,$rawuser,$rawdate)=split(/:/,$rawtext);

    # Treat each raw element like this:
    ($nil,$user)=split(/=/,$rawuser);

    # Keep a hash for full user names (and for actions as well):

    %users(
    "joe" => "Joe",
    "dan" => "Daniel",
    ...
    );

    # Build your phrase in free English, like:

    print "On $date, user $users{$user} $actions{$action}...";

    Hth,

    Oliver.


    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 19, 2006
    #2
    1. Advertising

  3. <> wrote:

    > I am doing this module where I am gonna change the following sentence
    >
    > "1:action=commit:user=joe:date=2005-02-02:"
    > "2:action=checkout:user=mark:date=2005-02-03:"

    ^^^^
    > to something like
    > " 1. Commits by user Joe on date 2005-02-02 "
    > " 2. Checkouts by user Joe on date 2005-02-03"

    ^^^

    Why did mark's name change to Joe?

    Why a trailing space in the 1st one but not in the 2nd one?

    Why one space in the 1st one but 2 spaces in the 2nd one?

    Are those double quotes actually in your data, or are they
    meant to be "meta"?


    > pseudocode only



    Why?

    It takes only a tiny bit of effort to bypass the confusion
    caused by the pseudoness.

    The value of the answer you can expect to receive is directly
    proportional to the effort you put into forming your question...


    > if(hash{action}=='commit') <---this is a mandatory field



    if( $hash{$action} eq 'commit' ) <---this is a mandatory field

    There, that wasn't very hard now was it?


    > Any suggestions or ideas on how to better
    > achieve what I want to do above.


    ----------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    while ( <DATA> ) {
    chomp;
    chop; # don't need final colon
    my($num, %attrs) = split /[:=]/;
    $attrs{action} .= 's'; # pluralize
    s/(.)/\u$1/ for values %attrs; # upper case 1st letter
    printf "%2d. %s by user %s on date %s\n",
    $num, @attrs{ qw/action user date/ };
    }

    __DATA__
    1:action=commit:user=joe:date=2005-02-02:
    2:action=checkout:user=mark:date=2005-02-03:
    ----------------------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 19, 2006
    #3
  4. Dr.Ruud Guest

    schreef:

    > change the following sentence
    >
    > "1:action=commit:user=joe:date=2005-02-02:"
    > "2:action=checkout:user=mark:date=2005-02-03:"
    >
    > to something like
    > " 1. Commits by user Joe on date 2005-02-02 "
    > " 2. Checkouts by user Joe on date 2005-02-03"



    This assumes that the fields are allways in the same order:

    #!/usr/bin/perl
    use strict;
    use warnings;

    while ( <DATA> )
    {
    s{ ^ ([^:]+)
    : (action) = ([^:]+)
    : (user) = ([^:]+)
    : (date) = ([^:]+)
    :
    }
    {$1. \u$3s by $4 \u$5 on $6 $7}x
    and print
    }

    __DATA__
    1:action=commit:user=joe:date=2005-02-02:
    2:action=checkout:user=mark:date=2005-02-03:

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, May 19, 2006
    #4
  5. DJ Stunks Guest

    Tad McClellan wrote:
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > while ( <DATA> ) {
    > chomp;
    > chop; # don't need final colon


    not necessary, split will not include it as empty trailing fields are
    deleted.

    > my($num, %attrs) = split /[:=]/;


    very nice, I always seem to forget that you can initialize a hash with
    a list in that way.

    > $attrs{action} .= 's'; # pluralize
    > s/(.)/\u$1/ for values %attrs; # upper case 1st letter


    how about:
    ucfirst for values %attrs;

    > printf "%2d. %s by user %s on date %s\n",
    > $num, @attrs{ qw/action user date/ };
    > }
    >
    > __DATA__
    > 1:action=commit:user=joe:date=2005-02-02:
    > 2:action=checkout:user=mark:date=2005-02-03:
    > ----------------------------------


    -jp
    DJ Stunks, May 19, 2006
    #5
  6. DJ Stunks Guest

    DJ Stunks wrote:
    > Tad McClellan wrote:
    > > s/(.)/\u$1/ for values %attrs; # upper case 1st letter

    >
    > how about:
    > ucfirst for values %attrs;


    um.....?

    $_ = ucfirst for values %attrs;

    $credibility{jpeavy1}--;

    -jp
    DJ Stunks, May 19, 2006
    #6
  7. DJ Stunks <> wrote:
    > Tad McClellan wrote:



    >> s/(.)/\u$1/ for values %attrs; # upper case 1st letter

    >
    > how about:
    > ucfirst for values %attrs;



    That is a lot better than what I had...

    .... except that it doesn't work. :)

    $_ = ucfirst for values %attrs;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 19, 2006
    #7
  8. Guest Guest

    Tad McClellan <> wrote:

    : s/(.)/\u$1/ for values %attrs; # upper case 1st letter

    Couldn't this be simplified to:

    : s/./\u$&/ for values %attrs; # upper case 1st letter

    ?

    Oliver.
    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 19, 2006
    #8
  9. <-berlin.de> <-berlin.de> wrote:
    > Tad McClellan <> wrote:
    >
    >: s/(.)/\u$1/ for values %attrs; # upper case 1st letter
    >
    > Couldn't this be simplified to:
    >
    >: s/./\u$&/ for values %attrs; # upper case 1st letter
    >
    > ?



    Yes, but cycles are a terrible thing to waste.

    (See $& in perlvar.pod and elsewhere.)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 19, 2006
    #9
  10. Dr.Ruud Guest

    -berlin.de schreef:
    > Tad McClellan:


    >> s/(.)/\u$1/ for values %attrs; # upper case 1st letter

    >
    > Couldn't this be simplified to:
    >
    > s/./\u$&/ for values %attrs; # upper case 1st letter


    It is not simpler. It might be a tad slower.

    Alternatives:

    $_ = "\u$_" for values %attrs ;

    $_ = ucfirst for values %attrs ;

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, May 19, 2006
    #10
  11. Guest Guest

    Tad McClellan <> wrote:
    : >
    : >: s/(.)/\u$1/ for values %attrs; # upper case 1st letter
    : >
    : > Couldn't this be simplified to:
    : >
    : >: s/./\u$&/ for values %attrs; # upper case 1st letter
    : >

    : Yes, but cycles are a terrible thing to waste.

    : (See $& in perlvar.pod and elsewhere.)

    I thought this is only the case with "use English;", at least that's
    how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):

    <quote>
    Due to an unfortunate accident of Perl's implementation, "use English"
    imposes a considerable performance penalty on all regular expression
    matches in a program, regardless of whether they occur in the scope of
    "use English".
    </quote>

    I attributed the penalty to "use English" rather than to the regex
    implementation. I stand corrected.

    Nonetheless, one question may be allowed here: The OP's task was not
    very complicated. Let the quantity of his data be 10,000 lines, on
    anything faster than a x386 processor the performance penalty in this
    simple regex will be unnoticable, or not?

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 19, 2006
    #11
  12. Guest Guest

    Dr.Ruud <> wrote:

    : >> s/(.)/\u$1/ for values %attrs; # upper case 1st letter
    : >
    : > Couldn't this be simplified to:
    : >
    : > s/./\u$&/ for values %attrs; # upper case 1st letter

    : It is not simpler. It might be a tad slower.

    Taking you and Tad's hint to perlvar with regard to performance
    penalties I kludged a small script and ran it on my Mac mini:

    use strict;
    use warnings;
    # use English;
    for (my $i=1; $i<1000000; $i++) {
    $_='undecided';
    s/./\U$&/;
    # s/(.)/\U$1/;
    }

    which I ran with time, getting the following result:

    $ time perl testscript.pl

    real 0m7.609s
    user 0m7.372s
    sys 0m0.032s

    Then I modified the script:

    use strict;
    use warnings;
    # use English;
    for (my $i=1; $i<1000000; $i++) {
    $_='undecided';
    # s/./\U$&/;
    s/(.)/\U$1/;
    }

    and I get:

    $ time perl testscript.pl

    real 0m7.801s
    user 0m7.549s
    sys 0m0.030s

    I repeated the runs for a number of times; the deviations between each
    run were in the order of 1/100 of a second.

    I then tried "use English;" and replaced $& with $MATCH, but the results
    were only insignificantly slower than in the (.)/$<digit>-version.

    Is there anything where I have a fundamental misunderstanding, or has the
    severe performance penalty of which perlvar warns been weeded out in the
    perl code while never being purged from the documentation? Or is my example
    just a trivial exception?

    Oliver.
    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 19, 2006
    #12
  13. <-berlin.de> <-berlin.de> wrote:
    > Tad McClellan <> wrote:
    >: >
    >: >: s/(.)/\u$1/ for values %attrs; # upper case 1st letter
    >: >
    >: > Couldn't this be simplified to:
    >: >
    >: >: s/./\u$&/ for values %attrs; # upper case 1st letter
    >: >
    >
    >: Yes, but cycles are a terrible thing to waste.
    >
    >: (See $& in perlvar.pod and elsewhere.)
    >
    > I thought this is only the case with "use English;", at least that's
    > how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):
    >
    ><quote>
    > Due to an unfortunate accident of Perl's implementation, "use English"
    > imposes a considerable performance penalty on all regular expression
    > matches in a program, regardless of whether they occur in the scope of
    > "use English".
    ></quote>



    That _is_ misleading... until it leads to:

    There's a global variable in the perl source, called sawampersand.
    It gets set to true in that moment in which the parser sees one
    of $`, $', and $&. It never can be set to false again. Trying to
    set it to false breaks the handling of the $`, $&, and $'
    completely.

    If the global variable sawampersand is set to true, all subsequent
    RE operations will be accompanied by massive in-memory copying,
    because there is nobody in the perl source who could predict,
    when the (necessary) copy for the ampersand family will be
    needed. So all subsequent REs are considerable slower than
    necessary.

    There are at least three impacts for developers:

    * never use $& and friends in a library.
    * Don't "use English" in a library, because it contains the
    three bad fellows.

    ..... by virtue of the 2nd sentence following your quote above.


    > Nonetheless, one question may be allowed here: The OP's task was not
    > very complicated. Let the quantity of his data be 10,000 lines, on
    > anything faster than a x386 processor the performance penalty in this
    > simple regex will be unnoticable, or not?



    Even the primary docs for $& can dispatch that:

    The use of this variable anywhere in a program imposes a considerable
    performance penalty on all regular expression matches.
    ^^^
    ^^^

    Assuming that this is part of a significant program, then there are
    lots of pattern matchings going on, and *every one* of them (not
    just this 1 regex that actually makes use of it) gets slower.

    If you mention any of the 3 match variables anywhere in your program,
    *all* of your pattern matches get slower (because perl cannot safely
    apply the optimization of not maintaining the 3 of them).


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 19, 2006
    #13
  14. Ben Morrow Guest

    Quoth <-berlin.de>:
    > Tad McClellan <> wrote:
    > : Yes, but cycles are a terrible thing to waste.
    >
    > : (See $& in perlvar.pod and elsewhere.)
    >
    > I thought this is only the case with "use English;", at least that's
    > how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):
    >

    <snip>
    > I attributed the penalty to "use English" rather than to the regex
    > implementation. I stand corrected.


    See perlre, the paragraph beginning

    WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
    the program, it has to provide them for every pattern match. This may
    substantially slow your program.

    English.pm used to cause a general Rx slowdown as it made a use of $&
    (to alias it to $MATCH). As this is not generally useful, current
    versions don't do that if you ask them not to (with -no_match_vars).

    [side issue: my version of perldoc (Pod::perldoc v3.14), in my locale
    (en_GB.UTF-8), transforms the above quote variables to "\$\x{2018}" and
    "\$\x{2019}". In text marked (explicitly or implicitly by perldoc) with
    C<>, this is less than useful. Is it worth filing a bug?]

    > Nonetheless, one question may be allowed here: The OP's task was not
    > very complicated. Let the quantity of his data be 10,000 lines, on
    > anything faster than a x386 processor the performance penalty in this
    > simple regex will be unnoticable, or not?


    The point is not that it slows that regex down (indeed, s/(.)/\u$1/ has
    the same penalty) but that it slows down *every other regex in the
    program*. This can be significant, so using $& is a bad habit to get
    into, except for one-liners where it can really simplify some things.

    Ben

    --
    I've seen things you people wouldn't believe: attack ships on fire off
    the shoulder of Orion; I watched C-beams glitter in the dark near the
    Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
    Time to die.
    Ben Morrow, May 19, 2006
    #14
  15. Ben Morrow Guest

    Quoth <-berlin.de>:
    > Taking you and Tad's hint to perlvar with regard to performance
    > penalties I kludged a small script and ran it on my Mac mini:
    >
    > use strict;
    > use warnings;
    > # use English;
    > for (my $i=1; $i<1000000; $i++) {
    > $_='undecided';
    > s/./\U$&/;
    > # s/(.)/\U$1/;
    > }
    >
    > which I ran with time, getting the following result:


    I would suggest Benchmark.pm for benchmarking :). It is easier and more
    flexible than using time(1).

    <results snipped>

    > Then I modified the script:
    >
    > use strict;
    > use warnings;
    > # use English;
    > for (my $i=1; $i<1000000; $i++) {
    > $_='undecided';
    > # s/./\U$&/;
    > s/(.)/\U$1/;
    > }


    > I repeated the runs for a number of times; the deviations between each
    > run were in the order of 1/100 of a second.
    >
    > I then tried "use English;" and replaced $& with $MATCH, but the results
    > were only insignificantly slower than in the (.)/$<digit>-version.
    >
    > Is there anything where I have a fundamental misunderstanding, or has the
    > severe performance penalty of which perlvar warns been weeded out in the
    > perl code while never being purged from the documentation? Or is my example
    > just a trivial exception?


    Any match which uses capturing parens has the same penalty as using $&.
    It's the ones which *don't* which suffer if you use $&. See my post
    cross-thread, and perlre.

    Ben

    --
    And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
    * *
    Ben Morrow, May 19, 2006
    #15
  16. Guest Guest

    Tad McClellan <> wrote:
    : (Oliver quoted:)
    : ><quote>
    : > Due to an unfortunate accident of Perl's implementation, "use English"
    : > imposes a considerable performance penalty on all regular expression
    : > matches in a program, regardless of whether they occur in the scope of
    : > "use English".
    : ></quote>


    : That _is_ misleading... until it leads to:

    [substantial information snipped]

    Did you quote this verbatim from perlvar? Or from perlre? I ask because my
    copy of perlvar (Perl 5.8.6) ends the annotation on bugs with the phrase:

    <quote>
    See the Devel::SawAmpersand module documentation
    from CPAN ( http://www.cpan.org/modules/by-module/Devel/ ) for more
    information.
    </quote>

    I _must_ confess I was to tired yesterday night to look that document up.

    : * never use $& and friends in a library.
    : * Don't "use English" in a library, because it contains the
    : three bad fellows.

    : .... by virtue of the 2nd sentence following your quote above.


    : Even the primary docs for $& can dispatch that:

    : The use of this variable anywhere in a program imposes a considerable
    : performance penalty on all regular expression matches.
    : ^^^
    : ^^^

    : Assuming that this is part of a significant program, then there are
    : lots of pattern matchings going on, and *every one* of them (not
    : just this 1 regex that actually makes use of it) gets slower.

    So I have to craft a little test script myself in order to see the magnitude
    of penalty.

    Thank you very much for the insight!

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 20, 2006
    #16
  17. Guest Guest

    Ben Morrow <> wrote:

    : > use strict;
    : > use warnings;
    : > # use English;
    : > for (my $i=1; $i<1000000; $i++) {
    : > $_='undecided';
    : > s/./\U$&/;
    : > # s/(.)/\U$1/;
    : > }
    : >
    : > which I ran with time, getting the following result:

    : I would suggest Benchmark.pm for benchmarking :). It is easier and more
    : flexible than using time(1).

    Next time I'll do it. Using time(1) is just a die-hard habit of mine, born
    in the days when there was no Benchmark.pm module.

    : > I then tried "use English;" and replaced $& with $MATCH, but the results
    : > were only insignificantly slower than in the (.)/$<digit>-version.
    : >
    : Any match which uses capturing parens has the same penalty as using $&.
    : It's the ones which *don't* which suffer if you use $&. See my post
    : cross-thread, and perlre.

    Now I understand. It is not $& vs. $<digit>, but $& et collegae vs. rest
    of the world. Thank you!

    Oliver.
    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 20, 2006
    #17
  18. Anno Siegel Guest

    Tad McClellan <> wrote in comp.lang.perl.misc:
    > <> wrote:


    [...]

    > > pseudocode only

    >
    >
    > Why?
    >
    > It takes only a tiny bit of effort to bypass the confusion
    > caused by the pseudoness.


    Unfortunately, the label "pseudocode" is often used as a license to
    write anything that comes to mind and let the reader figure out how
    the parts fit together.

    Unless you are acquainted with a specific pseudo-language you use, writing
    decent pseudocode is *harder*, not easier, than using an existing language.
    You'll find yourself inventing the language as you go along. Language
    design is serious business, pseudo or not. You won't come up with anything
    consistent that way.

    Pseudocode is for books, not for casual communication.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, May 20, 2006
    #18
  19. Guest

    Anno Siegel ha escrito:

    > Tad McClellan <> wrote in comp.lang.perl.misc:
    > > <> wrote:

    >
    > [...]
    >
    > > > pseudocode only

    > >
    > >
    > > Why?
    > >
    > > It takes only a tiny bit of effort to bypass the confusion
    > > caused by the pseudoness.

    >
    > Unfortunately, the label "pseudocode" is often used as a license to
    > write anything that comes to mind and let the reader figure out how
    > the parts fit together.
    >
    > Unless you are acquainted with a specific pseudo-language you use, writing
    > decent pseudocode is *harder*, not easier, than using an existing language.
    > You'll find yourself inventing the language as you go along. Language
    > design is serious business, pseudo or not. You won't come up with anything
    > consistent that way.
    >
    > Pseudocode is for books, not for casual communication.
    >
    > Anno


    As Tad McClellan and you have mentioned that framing my question with
    alittle more effort would have been good. I agree.


    Thanks for the info folks
    cheers
    pop.
    , May 20, 2006
    #19
  20. <-berlin.de> <-berlin.de> wrote:
    > Tad McClellan <> wrote:
    >: (Oliver quoted:)
    >: ><quote>
    >: > Due to an unfortunate accident of Perl's implementation, "use English"
    >: > imposes a considerable performance penalty on all regular expression
    >: > matches in a program, regardless of whether they occur in the scope of
    >: > "use English".
    >: ></quote>
    >
    >
    >: That _is_ misleading... until it leads to:
    >
    > [substantial information snipped]
    >
    > Did you quote this verbatim from perlvar? Or from perlre? I ask because my
    > copy of perlvar (Perl 5.8.6) ends the annotation on bugs with the phrase:
    >
    ><quote>
    > See the Devel::SawAmpersand module documentation
    > from CPAN ( http://www.cpan.org/modules/by-module/Devel/ ) for more
    > information.
    ></quote>
    >
    > I _must_ confess I was to tired yesterday night to look that document up.



    Then I think you can probably guess where I quoted it from, eh?

    :)


    Anyway, _I_ think the issue deserves the "good treatment" in the
    std docs rather than by reference to something else that you have
    to go get...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 20, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kai Schlamp
    Replies:
    1
    Views:
    409
    Arne Vajhøj
    Mar 27, 2008
  2. SpreadTooThin

    about text substitution.

    SpreadTooThin, Nov 17, 2011, in forum: HTML
    Replies:
    3
    Views:
    445
    SpreadTooThin
    Nov 17, 2011
  3. Addy
    Replies:
    2
    Views:
    106
    Anno Siegel
    Aug 26, 2003
  4. Domenico Discepola

    Assistance parsing text file using Text::CSV_XS

    Domenico Discepola, Sep 1, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    442
    Domenico Discepola
    Sep 2, 2004
  5. pula58

    text substitution

    pula58, May 24, 2007, in forum: Perl Misc
    Replies:
    3
    Views:
    76
    pula58
    May 24, 2007
Loading...

Share This Page