Text parsing and substitution

maheshpop1 · May 19, 2006

Hi guys,

I am doing this module where I am gonna change the following sentence

"1:action=commit:user=joe:date=2005-02-02:"
"2:action=checkout:user=mark:date=2005-02-03:"

to something like
" 1. Commits by user Joe on date 2005-02-02 "
" 2. Checkouts by user Joe on date 2005-02-03"

making the above text a little bit more readable to the user. I started
of with a program which finds out the different key value pairs and
and based on the values append/create a string with approriate words
like

pseudocode only

parse the line,
load a hashmap with the key, value pairs
if(hash{action}=='commit') <---this is a mandatory field
string.="Commits"
if(defined hash{user})
string.="by hash{user})
if(defined hash{date})
string.="on date hash{date}"
...................................
...................................
if(hash{action}=='checkout') <---this is a mandatory field
string.="Commits"
if(defined hash{user})
string.="by hash{user})
if(defined hash{date})
string.="on date hash{date}"
..............................................
.............................................
I was thinking this sort of logic but a little apprehensive how elastic
it can be as I would be addressing so many actions and seperate if
blocks for all of them. Any suggestions or ideas on how to better
achieve what I want to do above.

cheers,
pop.

Guest · May 19, 2006

(e-mail address removed) wrote:
: Hi guys,

: I am doing this module where I am gonna change the following sentence

: "1:action=commit:user=joe:date=2005-02-02:"
: "2:action=checkout:user=mark:date=2005-02-03:"

: to something like

: " 1. Commits by user Joe on date 2005-02-02 "
: " 2. Checkouts by user Joe on date 2005-02-03"

Check whether all your data follow the same pattern and obey the same
constraints. Apparently you'r doing something in fields here, so:

$rawtext="1:action=commit:user=joe:date=2005-02-02:";
($no,$rawaction,$rawuser,$rawdate)=split(/:/,$rawtext);

# Treat each raw element like this:
($nil,$user)=split(/=/,$rawuser);

# Keep a hash for full user names (and for actions as well):

%users(
"joe" => "Joe",
"dan" => "Daniel",
...
);

# Build your phrase in free English, like:

print "On $date, user $users{$user} $actions{$action}...";

Hth,

Oliver.

Tad McClellan · May 19, 2006

I am doing this module where I am gonna change the following sentence

"1:action=commit:user=joe:date=2005-02-02:"
"2:action=checkout:user=mark:date=2005-02-03:" ^^^^
to something like
" 1. Commits by user Joe on date 2005-02-02 "
" 2. Checkouts by user Joe on date 2005-02-03"

^^^

Why did mark's name change to Joe?

Why a trailing space in the 1st one but not in the 2nd one?

Why one space in the 1st one but 2 spaces in the 2nd one?

Are those double quotes actually in your data, or are they
meant to be "meta"?

pseudocode only

Why?

It takes only a tiny bit of effort to bypass the confusion
caused by the pseudoness.

The value of the answer you can expect to receive is directly
proportional to the effort you put into forming your question...

if(hash{action}=='commit') <---this is a mandatory field

if( $hash{$action} eq 'commit' ) <---this is a mandatory field

There, that wasn't very hard now was it?

Any suggestions or ideas on how to better
achieve what I want to do above.

----------------------------------
#!/usr/bin/perl
use warnings;
use strict;

while ( <DATA> ) {
chomp;
chop; # don't need final colon
my($num, %attrs) = split /[:=]/;
$attrs{action} .= 's'; # pluralize
s/(.)/\u$1/ for values %attrs; # upper case 1st letter
printf "%2d. %s by user %s on date %s\n",
$num, @attrs{ qw/action user date/ };
}

__DATA__
1:action=commit:user=joe:date=2005-02-02:
2:action=checkout:user=mark:date=2005-02-03:

Dr.Ruud · May 19, 2006

(e-mail address removed) schreef:

change the following sentence

"1:action=commit:user=joe:date=2005-02-02:"
"2:action=checkout:user=mark:date=2005-02-03:"

to something like
" 1. Commits by user Joe on date 2005-02-02 "
" 2. Checkouts by user Joe on date 2005-02-03"

This assumes that the fields are allways in the same order:

#!/usr/bin/perl
use strict;
use warnings;

while ( <DATA> )
{
s{ ^ ([^:]+)
: (action) = ([^:]+)
: (user) = ([^:]+)
: (date) = ([^:]+)
:
}
{$1. \u$3s by $4 \u$5 on $6 $7}x
and print
}

__DATA__
1:action=commit:user=joe:date=2005-02-02:
2:action=checkout:user=mark:date=2005-02-03:

DJ Stunks · May 19, 2006

Tad said:
#!/usr/bin/perl
use warnings;
use strict;

while ( <DATA> ) {
chomp;
chop; # don't need final colon

not necessary, split will not include it as empty trailing fields are
deleted.

my($num, %attrs) = split /[:=]/;

very nice, I always seem to forget that you can initialize a hash with
a list in that way.

$attrs{action} .= 's'; # pluralize
s/(.)/\u$1/ for values %attrs; # upper case 1st letter

how about:
ucfirst for values %attrs;

printf "%2d. %s by user %s on date %s\n",
$num, @attrs{ qw/action user date/ };
}

__DATA__
1:action=commit:user=joe:date=2005-02-02:
2:action=checkout:user=mark:date=2005-02-03:
----------------------------------

-jp

DJ Stunks · May 19, 2006

DJ said:
how about:
ucfirst for values %attrs;

um.....?

$_ = ucfirst for values %attrs;

$credibility{jpeavy1}--;

-jp

Tad McClellan · May 19, 2006

DJ Stunks said:
Tad McClellan wrote:

how about:
ucfirst for values %attrs;

That is a lot better than what I had...

.... except that it doesn't work.

$_ = ucfirst for values %attrs;

Guest · May 19, 2006

: s/(.)/\u$1/ for values %attrs; # upper case 1st letter

Couldn't this be simplified to:

: s/./\u$&/ for values %attrs; # upper case 1st letter

?

Oliver.

Tad McClellan · May 19, 2006

: s/(.)/\u$1/ for values %attrs; # upper case 1st letter

Couldn't this be simplified to:

: s/./\u$&/ for values %attrs; # upper case 1st letter

?

Yes, but cycles are a terrible thing to waste.

(See $& in perlvar.pod and elsewhere.)

Dr.Ruud · May 19, 2006

(e-mail address removed)-berlin.de schreef:

Tad McClellan:

Couldn't this be simplified to:

s/./\u$&/ for values %attrs; # upper case 1st letter

It is not simpler. It might be a tad slower.

Alternatives:

$_ = "\u$_" for values %attrs ;

$_ = ucfirst for values %attrs ;

Guest · May 19, 2006

: >
: >: s/(.)/\u$1/ for values %attrs; # upper case 1st letter
: >
: > Couldn't this be simplified to:
: >
: >: s/./\u$&/ for values %attrs; # upper case 1st letter
: >

: Yes, but cycles are a terrible thing to waste.

: (See $& in perlvar.pod and elsewhere.)

I thought this is only the case with "use English;", at least that's
how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):

<quote>
Due to an unfortunate accident of Perl's implementation, "use English"
imposes a considerable performance penalty on all regular expression
matches in a program, regardless of whether they occur in the scope of
"use English".
</quote>

I attributed the penalty to "use English" rather than to the regex
implementation. I stand corrected.

Nonetheless, one question may be allowed here: The OP's task was not
very complicated. Let the quantity of his data be 10,000 lines, on
anything faster than a x386 processor the performance penalty in this
simple regex will be unnoticable, or not?

Oliver.

Guest · May 19, 2006

: >> s/(.)/\u$1/ for values %attrs; # upper case 1st letter
: >
: > Couldn't this be simplified to:
: >
: > s/./\u$&/ for values %attrs; # upper case 1st letter

: It is not simpler. It might be a tad slower.

Taking you and Tad's hint to perlvar with regard to performance
penalties I kludged a small script and ran it on my Mac mini:

use strict;
use warnings;
# use English;
for (my $i=1; $i<1000000; $i++) {
$_='undecided';
s/./\U$&/;
# s/(.)/\U$1/;
}

which I ran with time, getting the following result:

$ time perl testscript.pl

real 0m7.609s
user 0m7.372s
sys 0m0.032s

Then I modified the script:

use strict;
use warnings;
# use English;
for (my $i=1; $i<1000000; $i++) {
$_='undecided';
# s/./\U$&/;
s/(.)/\U$1/;
}

and I get:

$ time perl testscript.pl

real 0m7.801s
user 0m7.549s
sys 0m0.030s

I repeated the runs for a number of times; the deviations between each
run were in the order of 1/100 of a second.

I then tried "use English;" and replaced $& with $MATCH, but the results
were only insignificantly slower than in the (.)/$<digit>-version.

Is there anything where I have a fundamental misunderstanding, or has the
severe performance penalty of which perlvar warns been weeded out in the
perl code while never being purged from the documentation? Or is my example
just a trivial exception?

Oliver.

Tad McClellan · May 19, 2006

: >
: >: s/(.)/\u$1/ for values %attrs; # upper case 1st letter
: >
: > Couldn't this be simplified to:
: >
: >: s/./\u$&/ for values %attrs; # upper case 1st letter
: >

: Yes, but cycles are a terrible thing to waste.

: (See $& in perlvar.pod and elsewhere.)

I thought this is only the case with "use English;", at least that's
how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):

<quote>
Due to an unfortunate accident of Perl's implementation, "use English"
imposes a considerable performance penalty on all regular expression
matches in a program, regardless of whether they occur in the scope of
"use English".
</quote>

That _is_ misleading... until it leads to:

There's a global variable in the perl source, called sawampersand.
It gets set to true in that moment in which the parser sees one
of $`, $', and $&. It never can be set to false again. Trying to
set it to false breaks the handling of the $`, $&, and $'
completely.

If the global variable sawampersand is set to true, all subsequent
RE operations will be accompanied by massive in-memory copying,
because there is nobody in the perl source who could predict,
when the (necessary) copy for the ampersand family will be
needed. So all subsequent REs are considerable slower than
necessary.

There are at least three impacts for developers:

* never use $& and friends in a library.
* Don't "use English" in a library, because it contains the
three bad fellows.

..... by virtue of the 2nd sentence following your quote above.

Nonetheless, one question may be allowed here: The OP's task was not
very complicated. Let the quantity of his data be 10,000 lines, on
anything faster than a x386 processor the performance penalty in this
simple regex will be unnoticable, or not?

Even the primary docs for $& can dispatch that:

The use of this variable anywhere in a program imposes a considerable
performance penalty on all regular expression matches.
^^^
^^^

Assuming that this is part of a significant program, then there are
lots of pattern matchings going on, and *every one* of them (not
just this 1 regex that actually makes use of it) gets slower.

If you mention any of the 3 match variables anywhere in your program,
*all* of your pattern matches get slower (because perl cannot safely
apply the optimization of not maintaining the 3 of them).

Ben Morrow · May 19, 2006

Quoth said:
: Yes, but cycles are a terrible thing to waste.

: (See $& in perlvar.pod and elsewhere.)

I thought this is only the case with "use English;", at least that's
how I understood the "Bugs" section in perlvar (of Perl 5.8.6, that is):

I attributed the penalty to "use English" rather than to the regex
implementation. I stand corrected.

See perlre, the paragraph beginning

WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
the program, it has to provide them for every pattern match. This may
substantially slow your program.

English.pm used to cause a general Rx slowdown as it made a use of $&
(to alias it to $MATCH). As this is not generally useful, current
versions don't do that if you ask them not to (with -no_match_vars).

[side issue: my version of perldoc (Pod:

erldoc v3.14), in my locale
(en_GB.UTF-8), transforms the above quote variables to "\$\x{2018}" and
"\$\x{2019}". In text marked (explicitly or implicitly by perldoc) with

Nonetheless, one question may be allowed here: The OP's task was not
very complicated. Let the quantity of his data be 10,000 lines, on
anything faster than a x386 processor the performance penalty in this
simple regex will be unnoticable, or not?

The point is not that it slows that regex down (indeed, s/(.)/\u$1/ has
the same penalty) but that it slows down *every other regex in the
program*. This can be significant, so using $& is a bad habit to get
into, except for one-liners where it can really simplify some things.

Ben

Ben Morrow · May 19, 2006

Quoth said:
Taking you and Tad's hint to perlvar with regard to performance
penalties I kludged a small script and ran it on my Mac mini:

use strict;
use warnings;
# use English;
for (my $i=1; $i<1000000; $i++) {
$_='undecided';
s/./\U$&/;
# s/(.)/\U$1/;
}

which I ran with time, getting the following result:

I would suggest Benchmark.pm for benchmarking

. It is easier and more
flexible than using time(1).

Then I modified the script:

use strict;
use warnings;
# use English;
for (my $i=1; $i<1000000; $i++) {
$_='undecided';
# s/./\U$&/;
s/(.)/\U$1/;
}

I repeated the runs for a number of times; the deviations between each
run were in the order of 1/100 of a second.

I then tried "use English;" and replaced $& with $MATCH, but the results
were only insignificantly slower than in the (.)/$<digit>-version.

Is there anything where I have a fundamental misunderstanding, or has the
severe performance penalty of which perlvar warns been weeded out in the
perl code while never being purged from the documentation? Or is my example
just a trivial exception?

Any match which uses capturing parens has the same penalty as using $&.
It's the ones which *don't* which suffer if you use $&. See my post
cross-thread, and perlre.

Ben

Guest · May 20, 2006

: (Oliver quoted

: ><quote>
: > Due to an unfortunate accident of Perl's implementation, "use English"
: > imposes a considerable performance penalty on all regular expression
: > matches in a program, regardless of whether they occur in the scope of
: > "use English".
: ></quote>

: That _is_ misleading... until it leads to:

[substantial information snipped]

Did you quote this verbatim from perlvar? Or from perlre? I ask because my
copy of perlvar (Perl 5.8.6) ends the annotation on bugs with the phrase:

<quote>
See the Devel::SawAmpersand module documentation
from CPAN ( http://www.cpan.org/modules/by-module/Devel/ ) for more
information.
</quote>

I _must_ confess I was to tired yesterday night to look that document up.

: * never use $& and friends in a library.
: * Don't "use English" in a library, because it contains the
: three bad fellows.

: .... by virtue of the 2nd sentence following your quote above.

: Even the primary docs for $& can dispatch that:

: The use of this variable anywhere in a program imposes a considerable
: performance penalty on all regular expression matches.
: ^^^
: ^^^

: Assuming that this is part of a significant program, then there are
: lots of pattern matchings going on, and *every one* of them (not
: just this 1 regex that actually makes use of it) gets slower.

So I have to craft a little test script myself in order to see the magnitude
of penalty.

Thank you very much for the insight!

Oliver.

Guest · May 20, 2006

: > use strict;
: > use warnings;
: > # use English;
: > for (my $i=1; $i<1000000; $i++) {
: > $_='undecided';
: > s/./\U$&/;
: > # s/(.)/\U$1/;
: > }
: >
: > which I ran with time, getting the following result:

: I would suggest Benchmark.pm for benchmarking

. It is easier and more
: flexible than using time(1).

Next time I'll do it. Using time(1) is just a die-hard habit of mine, born
in the days when there was no Benchmark.pm module.

: > I then tried "use English;" and replaced $& with $MATCH, but the results
: > were only insignificantly slower than in the (.)/$<digit>-version.
: >
: Any match which uses capturing parens has the same penalty as using $&.
: It's the ones which *don't* which suffer if you use $&. See my post
: cross-thread, and perlre.

Now I understand. It is not $& vs. $<digit>, but $& et collegae vs. rest
of the world. Thank you!

Oliver.

Anno Siegel · May 20, 2006

Tad McClellan said:
[...]

pseudocode only

Click to expand...

Why?

It takes only a tiny bit of effort to bypass the confusion
caused by the pseudoness.

Unfortunately, the label "pseudocode" is often used as a license to
write anything that comes to mind and let the reader figure out how
the parts fit together.

Unless you are acquainted with a specific pseudo-language you use, writing
decent pseudocode is *harder*, not easier, than using an existing language.
You'll find yourself inventing the language as you go along. Language
design is serious business, pseudo or not. You won't come up with anything
consistent that way.

Pseudocode is for books, not for casual communication.

Anno

maheshpop1 · May 20, 2006

Anno Siegel ha escrito:

Tad McClellan said:
Tad McClellan said:

[...]

pseudocode only

Click to expand...

Why?

It takes only a tiny bit of effort to bypass the confusion
caused by the pseudoness.

Click to expand...

Unfortunately, the label "pseudocode" is often used as a license to
write anything that comes to mind and let the reader figure out how
the parts fit together.

Unless you are acquainted with a specific pseudo-language you use, writing
decent pseudocode is *harder*, not easier, than using an existing language.
You'll find yourself inventing the language as you go along. Language
design is serious business, pseudo or not. You won't come up with anything
consistent that way.

Pseudocode is for books, not for casual communication.

Anno

As Tad McClellan and you have mentioned that framing my question with
alittle more effort would have been good. I agree.

Thanks for the info folks
cheers
pop.

Tad McClellan · May 20, 2006

: (Oliver quoted
: ><quote>
: > Due to an unfortunate accident of Perl's implementation, "use English"
: > imposes a considerable performance penalty on all regular expression
: > matches in a program, regardless of whether they occur in the scope of
: > "use English".
: ></quote>

: That _is_ misleading... until it leads to:

[substantial information snipped]

Did you quote this verbatim from perlvar? Or from perlre? I ask because my
copy of perlvar (Perl 5.8.6) ends the annotation on bugs with the phrase:

<quote>
See the Devel::SawAmpersand module documentation
from CPAN ( http://www.cpan.org/modules/by-module/Devel/ ) for more
information.
</quote>

I _must_ confess I was to tired yesterday night to look that document up.

Then I think you can probably guess where I quoted it from, eh?

Anyway, _I_ think the issue deserves the "good treatment" in the
std docs rather than by reference to something else that you have
to go get...

Optparse to parsing Suggestions !!	0	May 30, 2014
FAQ 4.62 What's the difference between "delete" and "undef" with hashes?	0	Feb 5, 2011
Trouble with parsing text file and grabbing values needed	8	Jul 21, 2006
parsing tab and newline delimited text	6	Aug 4, 2010
[SUMMARY] Parsing JSON (#155)	12	Feb 7, 2008
Ideas for parsing this text?	7	Apr 24, 2008
HTML File Parsing	3	Oct 28, 2008
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

Text parsing and substitution

maheshpop1

Guest

Tad McClellan

Dr.Ruud

DJ Stunks

DJ Stunks

Tad McClellan

Guest

Tad McClellan

Dr.Ruud

Guest

Guest

Tad McClellan

Ben Morrow

Ben Morrow

Guest

Guest

Anno Siegel

maheshpop1

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads