parsing out all data between two words with multiple instances in a file.

KP · Feb 12, 2004

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

<blah>
<junk>
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
<blah>
<junk>
<morejunk)
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
</morejunk>
<blah>

Note: So afterwords the script would print something out to the
console like such.

....importantdata1
....importantdata2
....importantdata3
;
....importantdata1
....importantdata2
....importantdata3

Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.

Thanks

Ben Morrow · Feb 12, 2004

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.

Why don't you have a go, then we can help you improve it? Start by
reading up on the .. operator in perldoc perlop.

Alternatively, if your data actually is an XML file, you may find it
easier to use one of the XML parsing modules (I'd recommend
XML::LibXML for this sort of thing).

Ben

gnari · Feb 12, 2004

KP said:
I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

[snipped problem]

you forgot to tell us what you have tried, and why it failed.

gnari

KP · Feb 13, 2004

my $file;
my $fileinfo;

open (FILE, $FileHandle) || die;

while ($file = <FILE> )
{
if ($file =~ /.*<junk>.*/)
{
while ($fileinfo = <FILE>)
{
if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
{
$FileList = $FileList .';'. $1;
}
}last;
}
}
print $FileList;

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

<blah>
<junk>
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
<blah>
<junk>
<morejunk)
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
</morejunk>
<blah>

Note: So afterwords the script would print something out to the
console like such.

....importantdata1
....importantdata2
....importantdata3
;
....importantdata1
....importantdata2
....importantdata3

Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.

Ben Morrow · Feb 13, 2004

my $file;
my $fileinfo;

open (FILE, $FileHandle) || die;

Put $! and the name of the file in the error message.

while ($file = <FILE> )

This reads FILE a line at a time. This means you will get at most one

of your said:
{
if ($file =~ /.*<junk>.*/)
{
while ($fileinfo = <FILE>)
{
if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
{
$FileList = $FileList .';'. $1;
}
}last;
}
}
print $FileList;

You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Ben

Anno Siegel · Feb 13, 2004

Ben Morrow said:
Put $! and the name of the file in the error message.

This reads FILE a line at a time. This means you will get at most one

You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Well, for one it prints the delimiting "<pre>" and "</pre>", which
is unwanted.

If the ";" lines are allowed to follow every block (instead of appearing
only between blocks), there is no need for an auxiliary variable. So
I'd rewrite your solution like this:

my $from = qr/<pre>/;
my $to = qr|</pre>|;

while ( <FILE> ) {
if ( /$from/ .. /$to/ ) {
print unless /$from/ or /$to/;
print ";\n" if /$to/;
}
}

Or

/$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <DATA>;

Anno

Anno Siegel · Feb 13, 2004

Ben Morrow said:
Put $! and the name of the file in the error message.

This reads FILE a line at a time. This means you will get at most one

You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Well, for one it prints the delimiting "<pre>" and "</pre>", which
is unwanted.

If the ";" lines are allowed to follow every block (instead of appearing
only between blocks), there is no need for an auxiliary variable. So
I'd rewrite your solution like this:

my $from = qr/<pre>/;
my $to = qr|</pre>|;

while ( <FILE> ) {
if ( /$from/ .. /$to/ ) {
print unless /$from/ or /$to/;
print ";\n" if /$to/;
}
}

Or

/$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <FILE>;

Anno

Uri Guttman · Feb 13, 2004

AS> my $from = qr/<pre>/;
AS> my $to = qr|</pre>|;

AS> while ( <FILE> ) {
AS> if ( /$from/ .. /$to/ ) {
AS> print unless /$from/ or /$to/;
AS> print ";\n" if /$to/;
AS> }
AS> }

bah!! how many of you have ever seen or used the RETURN value from
scalar range?

while ( <FILE> ) {
if ( my $range = /$from/ .. /$to/ ) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

you can then put the regexes back in the .. line as you don't need them
again.

while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

uri

Ben Morrow · Feb 13, 2004

Uri Guttman said:
while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

Ah... thank you! I had just been reading about the return of .., and
was sure it could be used here... this prints an extra semi at the
start though.

Ben

Uri Guttman · Feb 13, 2004

BM> Ah... thank you! I had just been reading about the return of .., and
BM> was sure it could be used here... this prints an extra semi at the
BM> start though.

i wasn't sure of the requirements and i didn't check carefully. i just
wanted to show the use of the range value. it is dreadfully under
utilized. i have written many line by line parsers with similar logic.

uri

Anno Siegel · Feb 13, 2004

Uri Guttman said:
AS> my $from = qr/<pre>/;
AS> my $to = qr|</pre>|;

AS> while ( <FILE> ) {
AS> if ( /$from/ .. /$to/ ) {
AS> print unless /$from/ or /$to/;
AS> print ";\n" if /$to/;
AS> }
AS> }

bah!! how many of you have ever seen or used the RETURN value from
scalar range?

while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

Now you mention it, yes, there's that behavior, obviously meant to
cover cases like this.

".." is a bit like "split" in that it has a lot of special cases and
DWIMish behavior, to a degree that makes it hard to keep up with
everything. Thanks for pointing it out.

Anno

Reading data by words from a file in Linux system	15	Apr 4, 2009
Parsing a generic data file	13	Dec 14, 2007
A Unique XML Parsing Problem	5	Oct 24, 2010
Trouble with parsing text file and grabbing values needed	8	Jul 21, 2006
Text file parsing in ruby	4	Jan 24, 2007
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Performance File Parsing	1	Aug 18, 2006
[SUMMARY] Numbers Can Be Words (#133)	1	Aug 9, 2007

parsing out all data between two words with multiple instances in a file.

KP

Ben Morrow

gnari

KP

Ben Morrow

Anno Siegel

Anno Siegel

Uri Guttman

Ben Morrow

Uri Guttman

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads