parsing out all data between two words with multiple instances in a file.

K

KP

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

<blah>
<junk>
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
<blah>
<junk>
<morejunk)
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
</morejunk>
<blah>

Note: So afterwords the script would print something out to the
console like such.

....importantdata1
....importantdata2
....importantdata3
;
....importantdata1
....importantdata2
....importantdata3

Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.

Thanks
 
B

Ben Morrow

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.
Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.

Why don't you have a go, then we can help you improve it? Start by
reading up on the .. operator in perldoc perlop.

Alternatively, if your data actually is an XML file, you may find it
easier to use one of the XML parsing modules (I'd recommend
XML::LibXML for this sort of thing).

Ben
 
G

gnari

KP said:
I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

[snipped problem]

you forgot to tell us what you have tried, and why it failed.


gnari
 
K

KP

my $file;
my $fileinfo;

open (FILE, $FileHandle) || die;

while ($file = <FILE> )
{
if ($file =~ /.*<junk>.*/)
{
while ($fileinfo = <FILE>)
{
if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
{
$FileList = $FileList .';'. $1;
}
}last;
}
}
print $FileList;

I'm trying to find a way to parse out all data between two words
within a file that contains multiple instances where important data
would be extracted out. The data file would look like such.

<blah>
<junk>
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
<blah>
<junk>
<morejunk)
<pre>
....importantdata1
....importantdata2
....importantdata3
</pre>
</morejunk>
<blah>

Note: So afterwords the script would print something out to the
console like such.

....importantdata1
....importantdata2
....importantdata3
;
....importantdata1
....importantdata2
....importantdata3

Note: I would seprate those instances with a semi-colon. So later down
the road I could parse this data into seperate files.
 
B

Ben Morrow

my $file;
my $fileinfo;

open (FILE, $FileHandle) || die;

Put $! and the name of the file in the error message.
while ($file = <FILE> )

This reads FILE a line at a time. This means you will get at most one
of your said:
{
if ($file =~ /.*<junk>.*/)
{
while ($fileinfo = <FILE>)
{
if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
{
$FileList = $FileList .';'. $1;
}
}last;
}
}
print $FileList;

You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Ben
 
A

Anno Siegel

Ben Morrow said:
Put $! and the name of the file in the error message.


This reads FILE a line at a time. This means you will get at most one


You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Well, for one it prints the delimiting "<pre>" and "</pre>", which
is unwanted.

If the ";" lines are allowed to follow every block (instead of appearing
only between blocks), there is no need for an auxiliary variable. So
I'd rewrite your solution like this:

my $from = qr/<pre>/;
my $to = qr|</pre>|;

while ( <FILE> ) {
if ( /$from/ .. /$to/ ) {
print unless /$from/ or /$to/;
print ";\n" if /$to/;
}
}

Or

/$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <DATA>;

:)

Anno
 
A

Anno Siegel

Ben Morrow said:
Put $! and the name of the file in the error message.


This reads FILE a line at a time. This means you will get at most one


You want something more like (untested):

my $semi;
while (<FILE>) {

if (/<pre>/ .. m|</pre>|) {
if ($semi) {
print ";\n";
undef $semi;
}
print;
}

$semi = m|</pre>|;
}

Hmmm... I feel it should be possible to make than more elegant. Ah
well.

Well, for one it prints the delimiting "<pre>" and "</pre>", which
is unwanted.

If the ";" lines are allowed to follow every block (instead of appearing
only between blocks), there is no need for an auxiliary variable. So
I'd rewrite your solution like this:

my $from = qr/<pre>/;
my $to = qr|</pre>|;

while ( <FILE> ) {
if ( /$from/ .. /$to/ ) {
print unless /$from/ or /$to/;
print ";\n" if /$to/;
}
}

Or

/$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <FILE>;

:)

Anno
 
U

Uri Guttman

AS> my $from = qr/<pre>/;
AS> my $to = qr|</pre>|;

AS> while ( <FILE> ) {
AS> if ( /$from/ .. /$to/ ) {
AS> print unless /$from/ or /$to/;
AS> print ";\n" if /$to/;
AS> }
AS> }

bah!! how many of you have ever seen or used the RETURN value from
scalar range?

while ( <FILE> ) {
if ( my $range = /$from/ .. /$to/ ) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

you can then put the regexes back in the .. line as you don't need them
again.

while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

uri
 
B

Ben Morrow

Uri Guttman said:
while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

Ah... thank you! I had just been reading about the return of .., and
was sure it could be used here... this prints an extra semi at the
start though.

Ben
 
U

Uri Guttman

BM> Ah... thank you! I had just been reading about the return of .., and
BM> was sure it could be used here... this prints an extra semi at the
BM> start though.

i wasn't sure of the requirements and i didn't check carefully. i just
wanted to show the use of the range value. it is dreadfully under
utilized. i have written many line by line parsers with similar logic.

uri
 
A

Anno Siegel

Uri Guttman said:
AS> my $from = qr/<pre>/;
AS> my $to = qr|</pre>|;

AS> while ( <FILE> ) {
AS> if ( /$from/ .. /$to/ ) {
AS> print unless /$from/ or /$to/;
AS> print ";\n" if /$to/;
AS> }
AS> }

bah!! how many of you have ever seen or used the RETURN value from
scalar range?

while ( <FILE> ) {
if ( my $range = /<pre>/ .. m|</pre>|) {
print ";\n" and next if $range == 1 ;
print unless $range =~ /e/i ;
}
}

Now you mention it, yes, there's that behavior, obviously meant to
cover cases like this.

".." is a bit like "split" in that it has a lot of special cases and
DWIMish behavior, to a degree that makes it hard to keep up with
everything. Thanks for pointing it out.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top