another try

D

Darius

Hi,
Here goes again. Please excuse the repeat question. But then, I got no
response that could be 'matched' as appropriate :)

I have many lines in a bad xml file that are like the long one below:
<word_word1 string="start" date="2004-09-02 07:33:22" id="2033878"
word_id="2000589" get_id="8647" ><word name="MOVIE"><film
title="S"things Gotta Give" the_number="531780"
/></word></word_word1><film title="S'&quote Gotta Give"
the_number="531780" />

I don't want to try XML::parser yet, so, not caring about whether this
is an xml file or not, there are 2 occurences of "Gotta" in this
string and these 2
occurences need to be pre-fixed by "Somethings" between the first "=\"
just
before the "Gotta", and, the "Gotta"

So: ="S'&quote Gotta Give" should become ="Somethings Gotta Give"
and ="S"things Gotta Give" should become ------ditto-----
etc.
e.g ="Something's Gotta Give" should become ="Somethings Gotta Give"

The characters before the Gotta and first =\" just before it, are not
static
so I can't use lookbehinds :( or can I?

I tried this so far:
while( $line=~/(.*)(=\")(.*?)(Gotta)/g ){
print "\n$2$3$4\n";
}

which gave me:

="S'&quote Gotta

But I could't get the Gotta previous to this in the string :( and so I
am not able to repeat the while loop successfully.

Can anyone help me with this ? Thx
D
 
A

Anno Siegel

Darius said:
Hi,
Here goes again. Please excuse the repeat question. But then, I got no
response that could be 'matched' as appropriate :)

I have many lines in a bad xml file that are like the long one below:
<word_word1 string="start" date="2004-09-02 07:33:22" id="2033878"
word_id="2000589" get_id="8647" ><word name="MOVIE"><film
title="S"things Gotta Give" the_number="531780"
/></word></word_word1><film title="S'&quote Gotta Give"
the_number="531780" />

I don't want to try XML::parser yet,

That is a good reason to ask the newsgroup to go the hard way and
solve problems that have been solved elsewhere? Give a better one.
so, not caring about whether this
is an xml file or not,

We're programmers. If this is xml, that is valuable structure that
can be exploited. Ignoring it makes life harder than it has to be.
there are 2 occurences of "Gotta" in this
string and these 2
occurences need to be pre-fixed by "Somethings" between the first "=\"

See? It's the two occurrences that give you trouble. If the titles
were orderly parsed out, each would have only one occurrence. Problem
gone, right?
just
before the "Gotta", and, the "Gotta"

So: ="S'&quote Gotta Give" should become ="Somethings Gotta Give"
and ="S"things Gotta Give" should become ------ditto-----
etc.
e.g ="Something's Gotta Give" should become ="Somethings Gotta Give"

The characters before the Gotta and first =\" just before it, are not
static
so I can't use lookbehinds :( or can I?

No, not easily. Probably not at all.
I tried this so far:
while( $line=~/(.*)(=\")(.*?)(Gotta)/g ){
print "\n$2$3$4\n";
}

which gave me:

="S'&quote Gotta

But I could't get the Gotta previous to this in the string :( and so I
am not able to repeat the while loop successfully.

Can anyone help me with this ? Thx

You must describe in some more detail just what can come between
/="/ and /Gotta/. As long as you don't the problem is ill-defined
and you won't solve it. Hint: Try a string that doesn't contain
another /=/.

Anno
 
M

Mark Clements

Darius said:
I have many lines in a bad xml file that are like the long one below:
<word_word1 string="start" date="2004-09-02 07:33:22" id="2033878"
word_id="2000589" get_id="8647" ><word name="MOVIE"><film
title="S"things Gotta Give" the_number="531780"
/></word></word_word1><film title="S'&quote Gotta Give"
the_number="531780" />

Is this merely a learning exercise or do you in fact have a corrupted
xml file that you are trying to fix?

If the former then you need to read the other postings: there are
many(!) xml tools available to make your life easier. Messing around
with xml is not a good way of teaching yourself about regular expressions.

If the latter, is the correct film title always "Somethings gotta give"?
If the film title varies then how are you expecting to tell in each case
with what text the &quote or whatever needs to be replaced? Have you
considered restoring from backup?

A more specific subject may help with future postings.

Mark
 
D

Darius

Mark Clements said:
Is this merely a learning exercise or do you in fact have a corrupted
xml file that you are trying to fix?

This is merely a learning excersize. I have solved the problem with
arrays
but i just couldn't with reg exps.

If the former then you need to read the other postings: there are
many(!) xml tools available to make your life easier. Messing around
with xml is not a good way of teaching yourself about regular expressions.

I guess...
If the latter, is the correct film title always "Somethings gotta give"?
If the film title varies then how are you expecting to tell in each case
with what text the &quote or whatever needs to be replaced? Have you
considered restoring from backup?

yes it is always that. and it occurs just once in a line. i added it
twice
bcoz its merely a learning excersize for me now.. When it was urgent,
i used
arrays to solve it shamelessly:) but then looking at this example, and
also
as per Anno, i dont think its possible to use regex anyway.

I tried this out using some junk xml type lines:
first line has 2 occurrences, second line has 1.

line 1:<word_word1 string="start" date="2004-09-02 07:33:22"
id="2033878" word_id="2000589" get_id="8647" ><word name="MOVIE"><film
title="S"things Gotta Give" the_number="531780"
/></word></word_word1><film title="S'&quote Gotta Give"
the_number="531780" />

line 2:<one name="S'things Gotta Give" something="true" type="demand
xyz" system_number="531780"/>

my $line;
while(chomp($line=<>)){
$line=~/(.*?)(="S)(.*?)(Gotta)/gc;
$line=~s/$3/omethings /g;
$line=~/\G(.*?)(="S)(.*?)(Gotta)/gc;
$line=~s/$3/omethings /;
print "FINAL:$line\n";
}

gives me:
FINAL:<word_word1 string="start" date="2004-09-02 07:33:22"
id="2033878" word_id="2000589" get_id="8647" ><word name="MOVIE"><film
title="Somethings Gotta Give" the_number="531780"
/></word></word_word1><film title="S'&quote Gotta Give"
the_number="531780" />

FINAL:<one name="Somethings Gotta Give" something="true" type="demand
xyz" system_number="531780"/>

The second occurence on line 1 didn't get replaced. I have decided to
give up and read more now :)
Thanks
- Darius
Thanks Mark..
 
A

Anno Siegel

Darius said:

[text snipped and re-formatted for legibility. please keep your line
length below 72 characters]
When it was urgent, i used arrays to solve it shamelessly:) but then

How is "using arrays" an alternative to a regex solution?
looking at this example, and also as per Anno, i dont think its
possible to use regex anyway.

That's not what I said. In fact, if you had followed the hint
"try a string without another '=' in it" you might have found that

s/="[^=]*?Gotta/="Somethings Gotta/g;

does the right thing with your examples.

The problem is that it is an unreliable ad-hoc solution that works
in this case, but may not in reasonably similar cases.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top