finding string matches, in order, in a file

Peter Bailey · Sep 18, 2007

Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
do |match|

codes = $1
puts codes

Thanks,
Peter

William James · Sep 18, 2007

Peter said:
Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Peter Bailey · Sep 18, 2007

William said:
Peter said:

Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

Click to expand...

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

Click to expand...

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Robert Klemme · Sep 18, 2007

2007/9/18 said:
William said:

Peter said:

Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

Click to expand...

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

Click to expand...

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Click to expand...

Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Still William's regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between <issueList> and <issue> that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

Peter Bailey · Sep 18, 2007

Robert said:
Still William's regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between <issueList> and <issue> that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

Same old output. I'll look into REXML. I downloaded it. But, it's enough
for me to just learn Ruby. I don't know if I can handle yet another
scripting language. Anyway, thanks a lot.
-Peter

William James · Sep 18, 2007

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

Click to expand...

Click to expand...

I don't like the looks of that regular expression. Try this one.

Click to expand...

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Click to expand...

Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
..* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Robert Klemme · Sep 18, 2007

2007/9/18 said:
Same old output. I'll look into REXML. I downloaded it.

It's part of the standard distribution.

But, it's enough
for me to just learn Ruby. I don't know if I can handle yet another
scripting language. Anyway, thanks a lot.

Well, as William said: can you show a piece of the document you are
trying to match?

Kind regards

robert

Peter Bailey · Sep 18, 2007

William said:
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

Click to expand...

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Believe me, I haven't given up. I need this to work! I really appreciate
your perseverance, though. Here's what I have now:

xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
*">(.*?)<\/issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I'm testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

So, these words, "Trade (Domestic & Foreign)" should be my first
entry in my array. But, it continues to come up with the word
"Immigration" as the first entry in the array, and that's way down on
line 358.

Thanks,
Peter

William James · Sep 18, 2007

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Click to expand...

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Click to expand...

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Click to expand...

Try this:

Click to expand...

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

Click to expand...

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

Click to expand...

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Click to expand...

Believe me, I haven't given up. I need this to work! I really appreciate
your perseverance, though. Here's what I have now:

xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
*">(.*?)<\/issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I'm testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

So, these words, "Trade (Domestic & Foreign)" should be my first
entry in my array. But, it continues to come up with the word
"Immigration" as the first entry in the array, and that's way down on
line 358.

Thanks,
Peter

During the posting process, your regexp was broken into
2 lines; when I corrected that, it worked.

Here I've slightly shortened it.

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

}.scan(
/<issueList>\s*<issue +code="[A-Z]{3}">(.*?)<\/issue>/m){
p $1
}

==== output ====
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
==== end of output ====

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Peter Bailey · Sep 18, 2007

If this still won't work on your file, could the file

be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Still no go, William. I tried your last phrase there, too.

William James · Sep 19, 2007

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Click to expand...

Still no go, William. I tried your last phrase there, too.

You've got to track down what's going on.
Copy and paste the code below into a file.
(Don't even think about typing it in.)
Run the file. Is this the output?

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what's already
in the string, maybe it differs somehow.
The output should now be:

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
"Trade (Domestic & Foreign)"

If it isn't, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

}.scan(
# Using extended-mode regular expression for clarity.
# Whitespace and comments are ignored.
%r{
<issuelist>
\s*
<issue[ \t]+code[ \t]*=[ \t]*"[^"]*"[ \t]*>
(.*?)
</issue>
}xmi
){ p $1 }

Peter Bailey · Sep 19, 2007

William said:
If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Click to expand...

Still no go, William. I tried your last phrase there, too.

Click to expand...

You've got to track down what's going on.
Copy and paste the code below into a file.
(Don't even think about typing it in.)
Run the file. Is this the output?

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what's already
in the string, maybe it differs somehow.
The output should now be:

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
"Trade (Domestic & Foreign)"

If it isn't, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

}.scan(
# Using extended-mode regular expression for clarity.
# Whitespace and comments are ignored.
%r{
<issuelist>
\s*
<issue[ \t]+code[ \t]*=[ \t]*"[^"]*"[ \t]*>
(.*?)
</issue>
}xmi
){ p $1 }

I'm getting exactly what you predict. And, . . ., perhaps I haven't made
this clear, but, I am getting healthy output from my script. I'm getting
208 lines of data. Each line is an entry between the <issue> entries
I've described. They're just totally out of order! That's my problem.
Here are the first 5 entries I get back:
Immigration
Health Issues
Copyright/Patent/Trademark
Budget/Appropriations
Health Issues
...

William James · Sep 19, 2007

I'm getting exactly what you predict. And, . . ., perhaps I haven't made
this clear, but, I am getting healthy output from my script. I'm getting
208 lines of data. Each line is an entry between the <issue> entries
I've described. They're just totally out of order! That's my problem.
Here are the first 5 entries I get back:
Immigration
Health Issues
Copyright/Patent/Trademark
Budget/Appropriations
Health Issues
..

Very odd. "scan" will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

"Health Issues" appears twice; I presume the file contains
two entries.

You've probably had your editor do a string search to verify
that "Immigration" isn't the first entry in the file.

Peter Bailey · Sep 19, 2007

William said:
Very odd. "scan" will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

"Health Issues" appears twice; I presume the file contains
two entries.

You've probably had your editor do a string search to verify
that "Immigration" isn't the first entry in the file.

Yes, as I said yesterday, the word "Immigration" is on line 300+, and,
the first entry that should be seen is the "Trade" one.

Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There's a lot of it there. I've
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.

William James · Sep 19, 2007

Yes, as I said yesterday, the word "Immigration" is on line 300+, and,
the first entry that should be seen is the "Trade" one.

Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There's a lot of it there. I've
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.

"Health Issues" appears twice in your output. Returning to my
original
hypothesis, are you certain "Trade" doesn't occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that "Trade" is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

William James · Sep 19, 2007

"Health Issues" appears twice in your output. Returning to my
original
hypothesis, are you certain "Trade" doesn't occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that "Trade" is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

A way to see if the reg.ex. is matching all entries.
grep -c '<issue ' thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.

Peter Bailey · Sep 19, 2007

William said:
A way to see if the reg.ex. is matching all entries.
grep -c '<issue ' thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.

Yes, when I grep my original xml file, I see 229 entries. But, with my
Ruby script, I only see 208. But, yes, many of these entries do repeat.
I'll look more closely at that first entry, the Trade one, to see what
might be different about it. Thanks.

Peter Bailey · Sep 20, 2007

William,
So, I found out what my problem was. And, yes, it's kind of embarassing.
It turns out that, at the top of my script, which I hadn't looked at in
days, I was actually parsing through multiple files, not one file. I was
looking at multiple xml files, not just the one.

So, I apologize to you. I really appreciate your doggedness in helping
me. You have the patience of Job. This forum's generosity astounds me,
and, you're a perfect example of why.

Cheers,
Peter

William James · Sep 20, 2007

William,
So, I found out what my problem was.

Good to hear. I hate to see bugs like this unsquashed.

Converting an Array to a String in JavaScript	7	Sep 22, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Problem Splitting Text String	2	Dec 29, 2022
Help in hangman game	1	Jul 24, 2023
finding a tag in a binary file	5	Feb 23, 2011
How to find multiple matches in a string	10	Apr 13, 2010
Iterating through a file, sticking iterated array entries in	2	Sep 20, 2007
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022

finding string matches, in order, in a file

Peter Bailey

William James

Peter Bailey

Robert Klemme

Peter Bailey

William James

Robert Klemme

Peter Bailey

William James

Peter Bailey

William James

Peter Bailey

William James

Peter Bailey

William James

William James

Peter Bailey

Peter Bailey

William James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads