parsing out all data between two words with multiple instances in a file.

Discussion in 'Perl Misc' started by KP, Feb 12, 2004.

  1. KP

    KP Guest

    I'm trying to find a way to parse out all data between two words
    within a file that contains multiple instances where important data
    would be extracted out. The data file would look like such.

    <blah>
    <junk>
    <pre>
    ....importantdata1
    ....importantdata2
    ....importantdata3
    </pre>
    <blah>
    <junk>
    <morejunk)
    <pre>
    ....importantdata1
    ....importantdata2
    ....importantdata3
    </pre>
    </morejunk>
    <blah>

    Note: So afterwords the script would print something out to the
    console like such.

    ....importantdata1
    ....importantdata2
    ....importantdata3
    ;
    ....importantdata1
    ....importantdata2
    ....importantdata3

    Note: I would seprate those instances with a semi-colon. So later down
    the road I could parse this data into seperate files.

    Thanks
     
    KP, Feb 12, 2004
    #1
    1. Advertising

  2. KP

    Ben Morrow Guest

    (KP) wrote:
    > I'm trying to find a way to parse out all data between two words
    > within a file that contains multiple instances where important data
    > would be extracted out. The data file would look like such.

    <snip>
    > Note: I would seprate those instances with a semi-colon. So later down
    > the road I could parse this data into seperate files.


    Why don't you have a go, then we can help you improve it? Start by
    reading up on the .. operator in perldoc perlop.

    Alternatively, if your data actually is an XML file, you may find it
    easier to use one of the XML parsing modules (I'd recommend
    XML::LibXML for this sort of thing).

    Ben

    --
    I've seen things you people wouldn't believe: attack ships on fire off the
    shoulder of Orion; I've watched C-beams glitter in the darkness near the
    Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
    Time to die. |-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-|
     
    Ben Morrow, Feb 12, 2004
    #2
    1. Advertising

  3. KP

    gnari Guest

    "KP" <> wrote in message
    news:...
    > I'm trying to find a way to parse out all data between two words
    > within a file that contains multiple instances where important data
    > would be extracted out. The data file would look like such.


    [snipped problem]

    you forgot to tell us what you have tried, and why it failed.


    gnari
     
    gnari, Feb 12, 2004
    #3
  4. KP

    KP Guest

    my $file;
    my $fileinfo;

    open (FILE, $FileHandle) || die;

    while ($file = <FILE> )
    {
    if ($file =~ /.*<junk>.*/)
    {
    while ($fileinfo = <FILE>)
    {
    if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
    {
    $FileList = $FileList .';'. $1;
    }
    }last;
    }
    }
    print $FileList;

    I'm trying to find a way to parse out all data between two words
    within a file that contains multiple instances where important data
    would be extracted out. The data file would look like such.

    <blah>
    <junk>
    <pre>
    ....importantdata1
    ....importantdata2
    ....importantdata3
    </pre>
    <blah>
    <junk>
    <morejunk)
    <pre>
    ....importantdata1
    ....importantdata2
    ....importantdata3
    </pre>
    </morejunk>
    <blah>

    Note: So afterwords the script would print something out to the
    console like such.

    ....importantdata1
    ....importantdata2
    ....importantdata3
    ;
    ....importantdata1
    ....importantdata2
    ....importantdata3

    Note: I would seprate those instances with a semi-colon. So later down
    the road I could parse this data into seperate files.
     
    KP, Feb 13, 2004
    #4
  5. KP

    Ben Morrow Guest

    (KP) wrote:
    > my $file;
    > my $fileinfo;
    >
    > open (FILE, $FileHandle) || die;


    Put $! and the name of the file in the error message.

    > while ($file = <FILE> )


    This reads FILE a line at a time. This means you will get at most one
    of your <pre> tags at once... so it isn't going to work.

    > {
    > if ($file =~ /.*<junk>.*/)
    > {
    > while ($fileinfo = <FILE>)
    > {
    > if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
    > {
    > $FileList = $FileList .';'. $1;
    > }
    > }last;
    > }
    > }
    > print $FileList;


    You want something more like (untested):

    my $semi;
    while (<FILE>) {

    if (/<pre>/ .. m|</pre>|) {
    if ($semi) {
    print ";\n";
    undef $semi;
    }
    print;
    }

    $semi = m|</pre>|;
    }

    Hmmm... I feel it should be possible to make than more elegant. Ah
    well.

    Ben

    --
    Like all men in Babylon I have been a proconsul; like all, a slave ... During
    one lunar year, I have been declared invisible; I shrieked and was not heard,
    I stole my bread and was not decapitated.
    ~ ~ Jorge Luis Borges, 'The Babylon Lottery'
     
    Ben Morrow, Feb 13, 2004
    #5
  6. KP

    Anno Siegel Guest

    Ben Morrow <> wrote in comp.lang.perl.misc:
    >
    > (KP) wrote:
    > > my $file;
    > > my $fileinfo;
    > >
    > > open (FILE, $FileHandle) || die;

    >
    > Put $! and the name of the file in the error message.
    >
    > > while ($file = <FILE> )

    >
    > This reads FILE a line at a time. This means you will get at most one
    > of your <pre> tags at once... so it isn't going to work.
    >
    > > {
    > > if ($file =~ /.*<junk>.*/)
    > > {
    > > while ($fileinfo = <FILE>)
    > > {
    > > if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
    > > {
    > > $FileList = $FileList .';'. $1;
    > > }
    > > }last;
    > > }
    > > }
    > > print $FileList;

    >
    > You want something more like (untested):
    >
    > my $semi;
    > while (<FILE>) {
    >
    > if (/<pre>/ .. m|</pre>|) {
    > if ($semi) {
    > print ";\n";
    > undef $semi;
    > }
    > print;
    > }
    >
    > $semi = m|</pre>|;
    > }
    >
    > Hmmm... I feel it should be possible to make than more elegant. Ah
    > well.


    Well, for one it prints the delimiting "<pre>" and "</pre>", which
    is unwanted.

    If the ";" lines are allowed to follow every block (instead of appearing
    only between blocks), there is no need for an auxiliary variable. So
    I'd rewrite your solution like this:

    my $from = qr/<pre>/;
    my $to = qr|</pre>|;

    while ( <FILE> ) {
    if ( /$from/ .. /$to/ ) {
    print unless /$from/ or /$to/;
    print ";\n" if /$to/;
    }
    }

    Or

    /$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <DATA>;

    :)

    Anno
     
    Anno Siegel, Feb 13, 2004
    #6
  7. KP

    Anno Siegel Guest

    Ben Morrow <> wrote in comp.lang.perl.misc:
    >
    > (KP) wrote:
    > > my $file;
    > > my $fileinfo;
    > >
    > > open (FILE, $FileHandle) || die;

    >
    > Put $! and the name of the file in the error message.
    >
    > > while ($file = <FILE> )

    >
    > This reads FILE a line at a time. This means you will get at most one
    > of your <pre> tags at once... so it isn't going to work.
    >
    > > {
    > > if ($file =~ /.*<junk>.*/)
    > > {
    > > while ($fileinfo = <FILE>)
    > > {
    > > if ($fileinfo =~ /.*<pre>(.*)<\/pre>)
    > > {
    > > $FileList = $FileList .';'. $1;
    > > }
    > > }last;
    > > }
    > > }
    > > print $FileList;

    >
    > You want something more like (untested):
    >
    > my $semi;
    > while (<FILE>) {
    >
    > if (/<pre>/ .. m|</pre>|) {
    > if ($semi) {
    > print ";\n";
    > undef $semi;
    > }
    > print;
    > }
    >
    > $semi = m|</pre>|;
    > }
    >
    > Hmmm... I feel it should be possible to make than more elegant. Ah
    > well.


    Well, for one it prints the delimiting "<pre>" and "</pre>", which
    is unwanted.

    If the ";" lines are allowed to follow every block (instead of appearing
    only between blocks), there is no need for an auxiliary variable. So
    I'd rewrite your solution like this:

    my $from = qr/<pre>/;
    my $to = qr|</pre>|;

    while ( <FILE> ) {
    if ( /$from/ .. /$to/ ) {
    print unless /$from/ or /$to/;
    print ";\n" if /$to/;
    }
    }

    Or

    /$from/ .. /$to/ and (/$from/ or print /$to/ ? ";\n" : $_) while <FILE>;

    :)

    Anno
     
    Anno Siegel, Feb 13, 2004
    #7
  8. KP

    Uri Guttman Guest

    Re: parsing out all data between two words with multiple instancesin a file.

    >>>>> "AS" == Anno Siegel <-berlin.de> writes:

    AS> my $from = qr/<pre>/;
    AS> my $to = qr|</pre>|;

    AS> while ( <FILE> ) {
    AS> if ( /$from/ .. /$to/ ) {
    AS> print unless /$from/ or /$to/;
    AS> print ";\n" if /$to/;
    AS> }
    AS> }

    bah!! how many of you have ever seen or used the RETURN value from
    scalar range?

    while ( <FILE> ) {
    if ( my $range = /$from/ .. /$to/ ) {
    print ";\n" and next if $range == 1 ;
    print unless $range =~ /e/i ;
    }
    }

    you can then put the regexes back in the .. line as you don't need them
    again.

    while ( <FILE> ) {
    if ( my $range = /<pre>/ .. m|</pre>|) {
    print ";\n" and next if $range == 1 ;
    print unless $range =~ /e/i ;
    }
    }

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Feb 13, 2004
    #8
  9. KP

    Ben Morrow Guest

    Re: parsing out all data between two words with multiple instancesin a file.

    Uri Guttman <> wrote:
    > while ( <FILE> ) {
    > if ( my $range = /<pre>/ .. m|</pre>|) {
    > print ";\n" and next if $range == 1 ;
    > print unless $range =~ /e/i ;
    > }
    > }


    Ah... thank you! I had just been reading about the return of .., and
    was sure it could be used here... this prints an extra semi at the
    start though.

    Ben

    --
    And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
    * *
     
    Ben Morrow, Feb 13, 2004
    #9
  10. KP

    Uri Guttman Guest

    Re: parsing out all data between two words with multiple instancesin a file.

    >>>>> "BM" == Ben Morrow <> writes:

    BM> Uri Guttman <> wrote:
    >> while ( <FILE> ) {
    >> if ( my $range = /<pre>/ .. m|</pre>|) {
    >> print ";\n" and next if $range == 1 ;
    >> print unless $range =~ /e/i ;
    >> }
    >> }


    BM> Ah... thank you! I had just been reading about the return of .., and
    BM> was sure it could be used here... this prints an extra semi at the
    BM> start though.

    i wasn't sure of the requirements and i didn't check carefully. i just
    wanted to show the use of the range value. it is dreadfully under
    utilized. i have written many line by line parsers with similar logic.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Feb 13, 2004
    #10
  11. KP

    Anno Siegel Guest

    Re: parsing out all data between two words with multiple instancesin a file.

    Uri Guttman <> wrote in comp.lang.perl.misc:
    > >>>>> "AS" == Anno Siegel <-berlin.de> writes:

    >
    > AS> my $from = qr/<pre>/;
    > AS> my $to = qr|</pre>|;
    >
    > AS> while ( <FILE> ) {
    > AS> if ( /$from/ .. /$to/ ) {
    > AS> print unless /$from/ or /$to/;
    > AS> print ";\n" if /$to/;
    > AS> }
    > AS> }
    >
    > bah!! how many of you have ever seen or used the RETURN value from
    > scalar range?
    >
    > while ( <FILE> ) {
    > if ( my $range = /<pre>/ .. m|</pre>|) {
    > print ";\n" and next if $range == 1 ;
    > print unless $range =~ /e/i ;
    > }
    > }


    Now you mention it, yes, there's that behavior, obviously meant to
    cover cases like this.

    ".." is a bit like "split" in that it has a lot of special cases and
    DWIMish behavior, to a degree that makes it hard to keep up with
    everything. Thanks for pointing it out.

    Anno
     
    Anno Siegel, Feb 13, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,140
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    398
    Barry Schwarz
    Oct 5, 2003
  3. kundan kumar
    Replies:
    4
    Views:
    1,318
    Kevin Spencer
    Oct 1, 2006
  4. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    448
    Daniel T.
    Feb 16, 2006
  5. BerlinBrown
    Replies:
    6
    Views:
    4,850
Loading...

Share This Page