finding string matches, in order, in a file

P

Peter Bailey

Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
do |match|

codes = $1
puts codes

Thanks,
Peter
 
W

William James

Peter said:
Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.
xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m
 
P

Peter Bailey

William said:
Peter said:
Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.
xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m


Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.
 
R

Robert Klemme

2007/9/18 said:
William said:
Peter said:
Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.
xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m


Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Still William's regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between <issueList> and <issue> that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert
 
P

Peter Bailey

Robert said:
Still William's regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between <issueList> and <issue> that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

Same old output. I'll look into REXML. I downloaded it. But, it's enough
for me to just learn Ruby. I don't know if I can handle yet another
scripting language. Anyway, thanks a lot.
-Peter
 
W

William James

The file may have multiple copies of some entries, and
your regexp may be botched.
xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
I don't like the looks of that regular expression. Try this one.
/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
..* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}
 
R

Robert Klemme

2007/9/18 said:
Same old output. I'll look into REXML. I downloaded it.

It's part of the standard distribution.
But, it's enough
for me to just learn Ruby. I don't know if I can handle yet another
scripting language. Anyway, thanks a lot.

Well, as William said: can you show a piece of the document you are
trying to match?

Kind regards

robert
 
P

Peter Bailey

William said:
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Believe me, I haven't given up. I need this to work! I really appreciate
your perseverance, though. Here's what I have now:

xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
*">(.*?)<\/issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I'm testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

<issueList>
<issue code="TRD">Trade (Domestic &amp; Foreign)</issue>
</issueList>

So, these words, "Trade (Domestic &amp; Foreign)" should be my first
entry in my array. But, it continues to come up with the word
"Immigration" as the first entry in the array, and that's way down on
line 358.

Thanks,
Peter
 
W

William James

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.
Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.
Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =
Try this:
%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>
<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>
}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Believe me, I haven't given up. I need this to work! I really appreciate
your perseverance, though. Here's what I have now:

xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
*">(.*?)<\/issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I'm testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

<issueList>
<issue code="TRD">Trade (Domestic &amp; Foreign)</issue>
</issueList>

So, these words, "Trade (Domestic &amp; Foreign)" should be my first
entry in my array. But, it continues to come up with the word
"Immigration" as the first entry in the array, and that's way down on
line 358.

Thanks,
Peter

During the posting process, your regexp was broken into
2 lines; when I corrected that, it worked.

Here I've slightly shortened it.

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic &amp; Foreign)</issue>
</issueList>

}.scan(
/<issueList>\s*<issue +code="[A-Z]{3}">(.*?)<\/issue>/m){
p $1
}

==== output ====
"\nI'm Issue XX, are you?\n"
"Trade (Domestic &amp; Foreign)"
==== end of output ====

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m
 
P

Peter Bailey

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Still no go, William. I tried your last phrase there, too.
 
W

William James

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Still no go, William. I tried your last phrase there, too.

You've got to track down what's going on.
Copy and paste the code below into a file.
(Don't even think about typing it in.)
Run the file. Is this the output?

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic &amp; Foreign)"

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what's already
in the string, maybe it differs somehow.
The output should now be:

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic &amp; Foreign)"
"Trade (Domestic &amp; Foreign)"

If it isn't, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic &amp; Foreign)</issue>
</issueList>


}.scan(
# Using extended-mode regular expression for clarity.
# Whitespace and comments are ignored.
%r{
<issuelist>
\s*
<issue[ \t]+code[ \t]*=[ \t]*"[^"]*"[ \t]*>
(.*?)
</issue>
}xmi
){ p $1 }
 
P

Peter Bailey

William said:
If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Still no go, William. I tried your last phrase there, too.

You've got to track down what's going on.
Copy and paste the code below into a file.
(Don't even think about typing it in.)
Run the file. Is this the output?

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic &amp; Foreign)"

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what's already
in the string, maybe it differs somehow.
The output should now be:

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic &amp; Foreign)"
"Trade (Domestic &amp; Foreign)"

If it isn't, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic &amp; Foreign)</issue>
</issueList>


}.scan(
# Using extended-mode regular expression for clarity.
# Whitespace and comments are ignored.
%r{
<issuelist>
\s*
<issue[ \t]+code[ \t]*=[ \t]*"[^"]*"[ \t]*>
(.*?)
</issue>
}xmi
){ p $1 }

I'm getting exactly what you predict. And, . . ., perhaps I haven't made
this clear, but, I am getting healthy output from my script. I'm getting
208 lines of data. Each line is an entry between the <issue> entries
I've described. They're just totally out of order! That's my problem.
Here are the first 5 entries I get back:
Immigration
Health Issues
Copyright/Patent/Trademark
Budget/Appropriations
Health Issues
...
 
W

William James

I'm getting exactly what you predict. And, . . ., perhaps I haven't made
this clear, but, I am getting healthy output from my script. I'm getting
208 lines of data. Each line is an entry between the <issue> entries
I've described. They're just totally out of order! That's my problem.
Here are the first 5 entries I get back:
Immigration
Health Issues
Copyright/Patent/Trademark
Budget/Appropriations
Health Issues
..

Very odd. "scan" will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

"Health Issues" appears twice; I presume the file contains
two entries.

You've probably had your editor do a string search to verify
that "Immigration" isn't the first entry in the file.
 
P

Peter Bailey

William said:
Very odd. "scan" will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

"Health Issues" appears twice; I presume the file contains
two entries.

You've probably had your editor do a string search to verify
that "Immigration" isn't the first entry in the file.

Yes, as I said yesterday, the word "Immigration" is on line 300+, and,
the first entry that should be seen is the "Trade" one.

Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There's a lot of it there. I've
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.
 
W

William James

Yes, as I said yesterday, the word "Immigration" is on line 300+, and,
the first entry that should be seen is the "Trade" one.

Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There's a lot of it there. I've
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.

"Health Issues" appears twice in your output. Returning to my
original
hypothesis, are you certain "Trade" doesn't occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that "Trade" is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.
 
W

William James

"Health Issues" appears twice in your output. Returning to my
original
hypothesis, are you certain "Trade" doesn't occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that "Trade" is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

A way to see if the reg.ex. is matching all entries.
grep -c '<issue ' thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.
 
P

Peter Bailey

William said:
A way to see if the reg.ex. is matching all entries.
grep -c '<issue ' thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.

Yes, when I grep my original xml file, I see 229 entries. But, with my
Ruby script, I only see 208. But, yes, many of these entries do repeat.
I'll look more closely at that first entry, the Trade one, to see what
might be different about it. Thanks.
 
P

Peter Bailey

William,
So, I found out what my problem was. And, yes, it's kind of embarassing.
It turns out that, at the top of my script, which I hadn't looked at in
days, I was actually parsing through multiple files, not one file. I was
looking at multiple xml files, not just the one.

So, I apologize to you. I really appreciate your doggedness in helping
me. You have the patience of Job. This forum's generosity astounds me,
and, you're a perfect example of why.

Cheers,
Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top