get rid of non xml compliant lines from a file

M

Mr_Noob

Hi all,

I try to write a perl script that would delete all non xml complient
lines (ie beginning with "<" and ending ">").
Here is what i succeded to put down so far :


sub delete_non_xml_lines
{
my $search = new File::List($xmldir);
my @files = @{ $search->find("textfile") };

foreach (@files)
{
my $file = $_;
open(FILE, "< $file") or die "Can't open $file : $!";
while(<FILE>)
{
print if $_ =~ />$/;
}
close FILE;
}
}


But how can I redirect the output for each processed file into an xml
file ?

thanks in advance for helping

Regards
 
R

RedGrittyBrick

Mr_Noob said:
I try to write a perl script that would delete all non xml complient
lines (ie beginning with "<" and ending ">").

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
<article>
<sect1>
<title>Observations on XML structure</title>
<para>This is a valid XML document.
Most of the lines don't start with an &lt; symbol.
Some of the lines don't end with an &gt; symbol.
Yet it is still valid XML.</para>
</sect1>
</article>
 
B

Ben Morrow

Quoth Mr_Noob said:
I try to write a perl script that would delete all non xml complient
lines (ie beginning with "<" and ending ">").
Here is what i succeded to put down so far :

sub delete_non_xml_lines
{
my $search = new File::List($xmldir);

Indirect object syntax (new Foo) is unreliable and can parse
incorrectly. Use

my $search = File::List->new($xmldir);

instead.
my @files = @{ $search->find("textfile") };

foreach (@files)
{
my $file = $_;

This is silly. Use

foreach my $file (@files) {

instead.
open(FILE, "< $file") or die "Can't open $file : $!";

It is safer to use lexical filehandles and three-arg open.

open(my $FILE, '<', $file) or die ...;

[...from below the code...]
But how can I redirect the output for each processed file into an xml
file ?

To write the output to a new file, you need

open(my $XML, '>', "$file.xml") or die ...;
select $XML;

Note that this will leave $XML selected as your default output
filehandle. If you are expecting to write to STDOUT later, you will need
to select it again. Alternatively, you could use SelectSaver:

my $ss = SelectSaver->new($XML);

which will re-select STDOUT when $ss goes out of scope.
while(<FILE>)
{
print if $_ =~ />$/;

$_ is the default match, so

print if />$/;
}
close FILE;

If you use lexical filehandles, there's no need to explicitly close
files opened for reading. Files opened for writing should be explicitly
closed, and the return value of close checked, to catch errors writing
(such as a full disk). close will return an error if any of the writes
failed, so there's no need to check each print (unless you are expecting
errors and want to abort early).

close $XML or die "can't write to $file.xml: $!";

Ben
 
S

szr

Ben said:
Indirect object syntax (new Foo) is unreliable and can parse
incorrectly. Use

I don't deny that that is god advise, though I personally have never had
any problems creating an option using "my $o = new Foo(...);" as opposed
to "my $o = Foo->new(...);"... as long as you know the potential
problems, they are easy to avoid. Namely, watch those parens :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top