Regular Expression for XML Parsing

Discussion in 'Perl Misc' started by tushar.saxena@gmail.com, Dec 27, 2007.

  1. Guest

    Hi,

    I have a set of XML files from which I need to extract some data. The
    format of the file is as follows :

    <tag1>
    <tag3>DATA1</tag3>
    </tag1>

    <tag2>
    <tag3>DATA2</tag3>
    </tag2>

    I need to extract the DATA part of the xml structure

    Note : tag3 can be contained either within tag1 or tag2, but I need to
    extract data only from tag1. i.e. DATA1 should be extracted, but not
    DATA2

    If I want to get both DATA1 and DATA2 I can use a simple regex like :

    if (($_ =~ /<tag3>(\w+)<\/tag3>/g))
    {
    print $1
    }

    But if I try to get only DATA1 (embedded within tag1) I try using
    something like this, but am unable to get it to work

    if (($_ =~ /<tag1>[\n\s\S\w\W]*<tag2>(\w+)<\/tag2>[\n\s\S\w\W]*<\/
    tag1>/g))
    {
    print $1
    }

    In this second case, the match itself fails.

    Any help would be appreciated !
    , Dec 27, 2007
    #1
    1. Advertising

  2. On wrote:
    >I have a set of XML files
    >I need to extract the DATA part of the xml structure
    >If I want to get both DATA1 and DATA2 I can use a simple regex like :


    It's a bad idea in the first place. XML is not a regular language, why would
    you use regular expressions to parse it?

    >Any help would be appreciated !


    Use a tool that is designed to parse XML like e.g. any of the XML parser
    modules on CPAN.

    jue
    Jürgen Exner, Dec 27, 2007
    #2
    1. Advertising

  3. Guest

    On 27 Dec, 20:59, wrote:
    > Hi,
    >
    > I have a set of XML files from which I need to extract some data. The
    > format of the file is as follows :
    >
    > <tag1>
    > <tag3>DATA1</tag3>
    > </tag1>
    >
    > <tag2>
    > <tag3>DATA2</tag3>
    > </tag2>
    >
    > I need to extract the DATA part of the xml structure
    >
    > Note : tag3 can be contained either within tag1 or tag2, but I need to
    > extract data only from tag1. i.e. DATA1 should be extracted, but not
    > DATA2
    >
    > If I want to get both DATA1 and DATA2 I can use a simple regex like :
    >
    > if (($_ =~ /<tag3>(\w+)<\/tag3>/g))
    > {
    > print $1
    >
    > }
    >
    > But if I try to get only DATA1 (embedded within tag1) I try using
    > something like this, but am unable to get it to work
    >
    > if (($_ =~ /<tag1>[\n\s\S\w\W]*<tag2>(\w+)<\/tag2>[\n\s\S\w\W]*<\/
    > tag1>/g))
    > {
    > print $1
    >
    > }
    >
    > In this second case, the match itself fails.
    >
    > Any help would be appreciated !


    $/ = "";

    while (<>) {
    if ( m{<tag1>.*?<tag3>(\w+)</tag3>.*?</tag1>}gs )
    {
    print "$1\n";
    }
    }
    , Dec 27, 2007
    #3
  4. <> wrote:

    > I have a set of XML files from which I need to extract some data. The
    > format of the file is as follows :
    >
    ><tag1>
    > <tag3>DATA1</tag3>
    ></tag1>
    >
    ><tag2>
    > <tag3>DATA2</tag3>
    ></tag2>



    I thought you said you had an XML file.

    That is not a valid XML file...


    > I need to extract the DATA part of the xml structure
    >
    > Note : tag3 can be contained either within tag1 or tag2, but I need to
    > extract data only from tag1. i.e. DATA1 should be extracted, but not
    > DATA2
    >
    > If I want to get both DATA1 and DATA2 I can use a simple regex like :



    Using a regular expression to "parse" a non-regular language is
    fraught with peril, and nearly always a Bad Idea.

    Use a module that understands XML for processing XML data.


    > Any help would be appreciated !



    Assuming that you have actual valid XML in $xml, then:

    use XML::Simple;

    my $ref = XMLin($xml);
    foreach my $child ( @{ $ref->{tag1} } ) {
    print "$child->{tag3}\n";
    }


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Dec 28, 2007
    #4
  5. On Thu, 27 Dec 2007 12:59:12 -0800 (PST),
    wrote:

    >Subject: Regular Expression for XML Parsing


    Nope. Perhaps a Regex for XML Parsing, in the Perl 6 acceptation of a
    "Regex" which is not assumed to be a "Regular Expression" any more.
    You will have to wait for quite a while, though...


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
    Michele Dondi, Dec 28, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,274
  2. Bill Chiu
    Replies:
    4
    Views:
    433
    Uwe Schnitker
    Sep 12, 2003
  3. ArdGre
    Replies:
    9
    Views:
    475
    Mike Schilling
    Jan 9, 2007
  4. Leif Wessman

    parsing XML using a regular expression

    Leif Wessman, Sep 8, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    264
    Tim Green
    Sep 9, 2004
  5. Erik Wasser
    Replies:
    5
    Views:
    438
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page