Search/Replace text in XML file

Discussion in 'Perl Misc' started by Lax, Jan 9, 2008.

  1. Lax

    Lax Guest

    Hello all,
    I'm trying to search and replace the value of a tag in an xml file.
    I'm not in a position to use the usual XML parsers as the version of
    Perl I'm required to use
    doesnt contain any of the XML libraries. I can use Text::Balanced, but
    I want to deal with the xml file on a
    line-by-line basis, as the value of my tag could strecth over multiple-
    lines.

    Perl Version:
    This is perl, v5.8.7 built for sun4-solaris

    Sample xml file:
    -------------------------

    <project xmlns="xml:header">

    <version>1.0.0</version>

    <SomeTag>
    <version>invalid version</version>
    </SomeTag>


    <SomeAnotherTagNested1>
    <SomeAnotherTagNested2>
    <SomeAnotherTagNested3>
    <version>invalid version</version>
    </SomeAnotherTagNested3>
    </SomeAnotherTagNested2>
    </SomeAnotherTagNested1>

    <version>stand-alone, but not valid either</version>

    </project>

    -------------------------

    I only want the version tag when they're not enclosed in any other
    tags.
    I want to replace the 1.0.0 (an example value) with 2.0.0 on an stand-
    alone "version"'s first occurence.
    I came up with the following:

    --------------------

    #!/usr/local/bin/perl

    use strict ;
    use File::Copy ;

    die "Usage: replace.pl <xml file>!\n" unless ( $#ARGV == 0 ) ;
    my $file = shift ;

    open(IN,"$file") or die "Cant open file: $!\n" ;
    chomp(my @arr = <IN> ) ;
    close(IN) ;

    open(OUT,"> bak") or die "Cant open file: $!\n" ;

    # Two flags,
    # $tag_flag -- to check if we're inside a tag
    # $version_flag -- to check if we've replaced version tag already.

    my $tag_flag = "off" ;
    my $version_flag = "off" ;

    foreach my $line ( @arr )
    {
    # Dont consider the open and close of top-level <project> tag.
    if ( $line =~ /^\s*\<(\/)?project/ )
    {
    print OUT "$line\n" ;
    next ;
    }

    # Found <version>, replace version string if tag_flag is on and
    version_flag is off.
    elsif ( ($line =~ /^\s*\<version\>/) && ( $tag_flag eq "off" ) &&
    ( $version_flag eq "off" ) )
    {
    # print "Flag: $flag\n" ;
    print OUT "<version>2.0.0</version>\n" ;
    $tag_flag = "on" ;
    $version_flag = "on" ;
    }

    # Inside an open tag "<", tag_flag on.
    elsif ( ( $line =~ /^\s*\<.*\>/ ) && ( $line !~ /^\s*\<\/.*
    \>/ ) )
    {
    print OUT "$line\n" ;
    $tag_flag = "on" ;
    }

    # Inside a close tag "</", tag_flag on.
    elsif ( $line =~ /^\s*\<\/.*\>/ )
    {
    print OUT "$line\n" ;
    $tag_flag = "off" ;
    } else {
    print OUT "$line\n" ;
    }
    }
    close(OUT) ;

    # Move bak file to original

    ------------------------------------------

    The above script works, and a "diff bak <xml-file>" gives me the
    expected result when the stand-alone <version> is all on one line, I
    cant get this working when its extended over multiple-lines.

    Could anyone give me some pointers, please?

    Thanks,
    Lax
     
    Lax, Jan 9, 2008
    #1
    1. Advertising

  2. Lax

    Lax Guest

    On Jan 9, 2:21 pm, Lax <> wrote:
    >         # Found <version>, replace version string if tag_flag is on and
    > version_flag is off.
    >         # Inside an open tag "<", tag_flag on.
    >         # Inside a close tag "</", tag_flag on.


    Please ignore the inaccurate values for off/on in the comments, the
    code has proper values for the flags, sorry.

    Thanks,
    Lax
     
    Lax, Jan 9, 2008
    #2
    1. Advertising

  3. Jim Gibson wrote:
    > In article
    > <>,
    > Lax <> wrote:
    >
    >> Hello all,
    >> I'm trying to search and replace the value of a tag in an xml file.
    >> I'm not in a position to use the usual XML parsers as the version of
    >> Perl I'm required to use
    >> doesnt contain any of the XML libraries. I can use Text::Balanced, but
    >> I want to deal with the xml file on a
    >> line-by-line basis, as the value of my tag could strecth over multiple-
    >> lines.

    >
    > [data, program snipped]
    >
    >> ------------------------------------------
    >>
    >> The above script works, and a "diff bak <xml-file>" gives me the
    >> expected result when the stand-alone <version> is all on one line, I
    >> cant get this working when its extended over multiple-lines.
    >>
    >> Could anyone give me some pointers, please?

    >
    > Read the entire file into a single scalar:
    >
    > my $contents = do { local $/; <IN> };
    >
    > Then add the /s modifier to your regular expression so that the '.'
    > special character will match the newlines embedded in your string.
    >
    > See 'perldoc 'q entire' and 'perldoc perlre'.

    ITYM: perldoc -q entire



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Jan 9, 2008
    #3
  4. Lax <> wrote:

    > I'm trying to search and replace the value of a tag in an xml file.



    No you're not.

    You are trying to search and replace the value of an element in an xml file.

    See the XML FAQ:

    http://xml.silmaril.ie/authors/makeup/


    > Sample xml file:
    > -------------------------
    >
    ><project xmlns="xml:header">
    >
    > <version>1.0.0</version>
    >
    > <SomeTag>
    > <version>invalid version</version>
    > </SomeTag>
    >
    >
    > <SomeAnotherTagNested1>
    > <SomeAnotherTagNested2>
    > <SomeAnotherTagNested3>
    > <version>invalid version</version>
    > </SomeAnotherTagNested3>
    > </SomeAnotherTagNested2>
    > </SomeAnotherTagNested1>
    >
    > <version>stand-alone, but not valid either</version>
    >
    ></project>
    >
    > -------------------------
    >
    > I only want the version tag when they're not enclosed in any other
    > tags.



    It is not legal in XML for a tag to enclose any other tag.

    (tags start with a '<' and end with a '>')


    You must have meant "element" where you said "tag".

    In that case, there ARE NO version elements that are not enclosed
    in any other elements!


    > I want to replace the 1.0.0 (an example value) with 2.0.0



    That element is enclosed in the project element.


    > on an stand-
    > alone "version"'s first occurence.



    You want to replace the 1.0.0 with 2.0.0 on the first version element
    that is a child of the document element (the project element in this case).

    (in which case you have a poor example input, as a solution that
    operates on the first <version> anywhere in the file will work
    for that input...
    )

    > The above script works, and a "diff bak <xml-file>" gives me the
    > expected result when the stand-alone <version> is all on one line, I
    > cant get this working when its extended over multiple-lines.



    Extended over multiple lines in what manner? Like this:

    <version
    >1.0.0</version>


    or like

    <version>
    1.0.0</version>

    or like

    <version>
    1.0.0
    </version>


    ??

    Those all are legal XML, but none of them are equivalent, they each
    have different content.


    > Could anyone give me some pointers, please?



    If I could unambiguously figure out what you really want I probably could...


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad J McClellan, Jan 10, 2008
    #4
  5. Lax <> wrote:

    > #!/usr/local/bin/perl
    >
    > use strict ;



    You should always enable warnings when developing Perl code:

    use warnings;


    > die "Usage: replace.pl <xml file>!\n" unless ( $#ARGV == 0 ) ;



    That is more clearly written as:

    die "Usage: replace.pl <xml file>!\n" unless @ARGV == 1;


    > my $file = shift ;
    >
    > open(IN,"$file") or die "Cant open file: $!\n" ;



    perldoc -q vars

    What's wrong with always quoting "$vars"?

    open(IN, $file) or die "Cant open file: $!\n" ;

    (and nowadays you should use the 3-argument form of open() instead.)


    > chomp(my @arr = <IN> ) ;



    Here you remove the newline from every line, and below you add a
    newline to every line.

    Why remove them only to put them back?


    > foreach my $line ( @arr )



    If you are going to process the file line-by-line anyway, then why
    bother reading the entire file into memory when one line at a time
    in memory will work?

    while ( my $line = <IN> )


    > if ( $line =~ /^\s*\<(\/)?project/ )



    The parenthesis in that pattern serve no purpose, so why include them?

    Angle brackets are not special in regular expressions, so they
    do not need backslashing.

    If you choose some other delimiter for your match operator, then
    the slash will not need backslashing either:

    if ( $line =~ m#^\s*</?project# )


    > I
    > cant get this working when its extended over multiple-lines.



    Then don't process the file line-by-line.


    > Could anyone give me some pointers, please?



    perldoc -q match

    I'm having trouble matching over more than one line. What's wrong?


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad J McClellan, Jan 10, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dorsa
    Replies:
    0
    Views:
    462
    Dorsa
    Dec 23, 2003
  2. Replies:
    7
    Views:
    114
    Emmanuel Oga
    Apr 4, 2007
  3. Replies:
    5
    Views:
    512
    Paul Rudin
    Jul 31, 2012
  4. Chris Angelico
    Replies:
    9
    Views:
    233
    Andrew Cooper
    Jul 29, 2012
  5. Tim Chase
    Replies:
    10
    Views:
    381
    Robert Miles
    Aug 31, 2012
Loading...

Share This Page