String Matching Problem

M

Matt Brooks

I need a way to accomplish the following. I can't figure out an elegant
way, beating my head against the wall thinking about it.

I receive strings into a variable and want to check that string to see
if it contains at least the beginning of another string.


say original string is "CHLE,231,1", and i receive string I want to
check against it.

I guess you could say the strings are comma separated within each string
that I receive.

I need to know it matched if I receive "CHLE,231" or "CHLE,231,1" but if
I get "CHLE,23" or "CHLE,231,2" these should not be matches, and should
say nil.



For Example

blah = "CHLE,231,1"
=> "CHLE,231,1"

blah["CHLE,231,1"]
=> "CHLE,231,1" * GOOD

blah["CHLE,231"]
=> "CHLE,231" * GOOD

blah["CHLE,232"}
=> nil * GOOD

blah["CHLE,231,2"]
=> nil * GOOD

blah["CHLE,23"]
=> "CHLE,23" * BAD, I want this to be NIL


Thanks for the help!!!
Matt
 
M

Marnen Laibow-Koser

Matt said:
I need a way to accomplish the following. I can't figure out an elegant
way, beating my head against the wall thinking about it.

I receive strings into a variable and want to check that string to see
if it contains at least the beginning of another string.


say original string is "CHLE,231,1", and i receive string I want to
check against it.

I guess you could say the strings are comma separated within each string
that I receive.

I need to know it matched if I receive "CHLE,231" or "CHLE,231,1" but if
I get "CHLE,23" or "CHLE,231,2" these should not be matches, and should
say nil.



For Example

blah = "CHLE,231,1"
=> "CHLE,231,1"

blah["CHLE,231,1"]
=> "CHLE,231,1" * GOOD

blah["CHLE,231"]
=> "CHLE,231" * GOOD

blah["CHLE,232"}
=> nil * GOOD

blah["CHLE,231,2"]
=> nil * GOOD

blah["CHLE,23"]
=> "CHLE,23" * BAD, I want this to be NIL


Thanks for the help!!!
Matt

A possible solution (untested, and probably susceptible to lots of
improvement):

class String
def match_prefix(other)
SEPARATOR = ','
chunks = self.split SEPARATOR
input = other.split SEPARATOR
if input.size > chunks.size
# can't match if we have more chunks in the input
return false
else
# test the chunks we've got
input.size.each_with_index do |chunk, i|
if chunk != chunks
return false
end
end
return true
end
end

Best,
 
R

Robert Klemme

I need a way to accomplish the following. I can't figure out an elegant
way, beating my head against the wall thinking about it.

I receive strings into a variable and want to check that string to see
if it contains at least the beginning of another string.


say original string is "CHLE,231,1", and i receive string I want to
check against it.

I guess you could say the strings are comma separated within each string
that I receive.

I need to know it matched if I receive "CHLE,231" or "CHLE,231,1" but if
I get "CHLE,23" or "CHLE,231,2" these should not be matches, and should
say nil.



For Example

blah = "CHLE,231,1"
=> "CHLE,231,1"

blah["CHLE,231,1"]
=> "CHLE,231,1" * GOOD

blah["CHLE,231"]
=> "CHLE,231" * GOOD

blah["CHLE,232"}
=> nil * GOOD

blah["CHLE,231,2"]
=> nil * GOOD

blah["CHLE,23"]
=> "CHLE,23" * BAD, I want this to be NIL


Thanks for the help!!!

You can split your original string into an Array and start testing from
the first element on.

irb(main):001:0> check = 'CHLE,231,1'
=> "CHLE,231,1"
irb(main):002:0> prepared = check.split /,+/
=> ["CHLE", "231", "1"]

Cheers

robert
 
R

Richard Wiltshire

Robert said:
I guess you could say the strings are comma separated within each string
blah = "CHLE,231,1"

blah["CHLE,231,2"]
=> nil * GOOD

blah["CHLE,23"]
=> "CHLE,23" * BAD, I want this to be NIL


Thanks for the help!!!

You can split your original string into an Array and start testing from
the first element on.

irb(main):001:0> check = 'CHLE,231,1'
=> "CHLE,231,1"
irb(main):002:0> prepared = check.split /,+/
=> ["CHLE", "231", "1"]

Cheers

robert

Here is a quick implementation. I did not spend that much time on it so
there could be some holes. Looks like you could do it in one line if you
wanted. Enjoy - Richard

def check(s2)
s1 = "CHLE,231,1"
return nil if s2 == ""
a1 = s1.split /,/
a2 = s2.split /,/
a1[0...a2.size] == a2 ? s2 : nil
end

puts check("CHLE,231,1")
puts check("CHLE,231")
puts check("CHLE,23")
puts check("CHLE")
puts check("")

Output:
CHLE,231,1
CHLE,231
nil
CHLE
nil
 
T

Thairuby TH

Does this work?

irb(main):001:0> blah = "CHLE,231,1"
=> "CHLE,231,1"
irb(main):002:0> def blah.[](s)
irb(main):003:1> a = self.split(",")
irb(main):004:1> b = s.split(",")
irb(main):005:1> a[0, b.size] == b ? s : nil
irb(main):006:1> end
=> nil
irb(main):007:0> blah["CHLE,231,1"]
=> "CHLE,231,1"
irb(main):008:0> blah["CHLE,231"]
=> "CHLE,231"
irb(main):009:0> blah["CHLE,232"]
=> nil
irb(main):010:0> blah["CHLE,231,2"]
=> nil
irb(main):011:0> blah["CHLE,23"]
=> nil
 
B

Brian Candler

Matt said:
say original string is "CHLE,231,1", and i receive string I want to
check against it.

I guess you could say the strings are comma separated within each string
that I receive.

I need to know it matched if I receive "CHLE,231" or "CHLE,231,1" but if
I get "CHLE,23" or "CHLE,231,2" these should not be matches, and should
say nil.

blah = "CHLE,231,1"

incoming = "CHLE,231,1"
blah[/\A#{incoming}(\z|,)/]

incoming = "CHLE,231"
blah[/\A#{incoming}(\z|,)/]

incoming = "CHLE,232"
blah[/\A#{incoming}(\z|,)/]

incoming = "CHLE,231,2"
blah[/\A#{incoming}(\z|,)/]

incoming = "CHLE,23"
blah[/\A#{incoming}(\z|,)/]

(The result of the match sometimes include a trailing comma, but that
won't matter if it's just the match or no-match you require)
 
M

Matt Brooks

I appreciate everyone's input, It seems like they should all work, and
do, initially, but for my application each one seems to stop working for
another reason.... I am checking tons of messages, actually I am
searching a buffer of messages, about 600 messages, every time a message
is received, so if it takes to long to search the buffer, the new
received messages start piling up I guess and don't ever get in the
buffer to be searched, and so on, it is a vicious cycle I guess. I
think each of these ways takes too many resources, compared to
line[cmd], where line is my line received and cmd is the initial match I
am looking for. I suppose line[cmd] is really optimized for quick
checking.

I tried every way you all listed, and it works for a few messages, then
it seems to get bogged down and miss messages or something as the buffer
fills up and gets larger and larger to check each one in the buffer.

Even the one liner, line[/\A#{cmd}(\z|,)/], stops working pretty quick.

Odd problem, because when I throw my regular line[cmd] code in there, it
works like a charm continuously, except at the corner cases that
provoked this thread to start with, "the unavoidable partial match".

Hate to ask again, any more less intensive ways to do it? It will be
ran up to 600 times every tenth of a second probably. Keep in mind
sometimes, it only has to search the buffer for the first 10 or so
messages before it finds a match, but sometimes it must go all the way
deep into the buffer 500 or so to get a match, then it starts on the
next one... etc..

the code, with check and match_prefix methods defined as above posts...



i = 0
#iterate through buffer looking for message
while(i < @message_buffer.num_in_buf)
if found_message == false
line = @message_buffer.get(i)
#if check(line,cmd) #resource hog
#if line[/\A#{cmd}(\z|,)/] #resource hog
#if(line.match_prefix(cmd)) #resource hog
if line[cmd]
if ((cmd != "SDRI") and (cmd != "SDRL") and (cmd !=
"CHLL"))
write_log("<- #{@message_buffer.get(i)}",1)#DEBUG
write_log("="*15 + "Ascii Response Message: #{cmd}
Received" + "="*10 + "\n")
found_message = true
break #message was found, don't search buffer
anymore
end
end # if this line has the cmd
end
i += 1 #check next message in buffer
end # while whole buffer
 
R

Robert Klemme

2009/10/29 Matt Brooks said:
I appreciate everyone's input, It seems like they should all work, and
do, initially, but for my application each one seems to stop working for
another reason.... I am checking tons of messages, actually I am
searching a buffer of messages, about 600 messages, every time a message
is received, so if it takes to long to search the buffer, the new
received messages start piling up I guess and don't ever get in the
buffer to be searched, and so on, it is a vicious cycle I guess.

What does "received" mean? Is this some kind of network interface?
Why are you searching all the messages again as soon as a new message
arrives? Can you describe the real problem that you are trying to
solve?
=A0 I
think each of these ways takes too many resources, compared to
line[cmd], where line is my line received and cmd is the initial match I
am looking for. =A0I suppose line[cmd] is really optimized for quick
checking.

I tried every way you all listed, and it works for a few messages, then
it seems to get bogged down and miss messages or something as the buffer
fills up and gets larger and larger to check each one in the buffer.

Even the one liner, line[/\A#{cmd}(\z|,)/], stops working pretty quick.

That is a too foggy formulation for me. What does that mean? Any
errors, exceptions?
Odd problem, because when I throw my regular line[cmd] code in there, it
works like a charm continuously, except at the corner cases that
provoked this thread to start with, "the unavoidable partial match".

Hate to ask again, any more less intensive ways to do it? =A0It will be
ran up to 600 times every tenth of a second probably. =A0Keep in mind
sometimes, it only has to search the buffer for the first 10 or so
messages before it finds a match, but sometimes it must go all the way
deep into the buffer 500 or so to get a match, then it starts on the
next one... etc..

This sounds as having a linear list of entries is not an appropriate
data structure for the problem at hand. With the partial matching
needed terms "tree" and "trie" come to mind.

Please give more information about the problem you are trying to solve.

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Matt Brooks

Okay, thanks for asking.

I don't really know why it stops working with line[/\A#{cmd}(\z|,)/],
basically my program timesout on a message when it thinks it doesn't
receive it. And that is what happened after about 6 messages when I
used that approach.

Some background.

It is a network interface of sending ascii messages to a machine which
then sends certain ascii messages back, with some fields of the message
known, some unknown till it returns.

I can't simply send the message and wait for it to return, because I
have multiple threads sending different messages at the same time,
therefore, the order in which response messages return is unknown, so
they just go into an array of size 600, one message per index.

Each message is terminated with a newline, \n.

The order of operations it usually psuedo code, SEND(CHLE,23,4)
GETMESSAGE(CHLE,23). Because, CHLE,23 is all I am certain that will
return for that message, so i look for that.

Therefore this pseudo GETMESSAGE function reads from the tcp port I
opened up earlier, 1 character at a time, loading each into a temp
string, until a \n is seen, then it loads that string into the buffer[0]
spot, then checks if it happens to match the message I was waiting for
anyway, sometimes it does, in that case it returns that message.(Perfect
scenario, done, move on to sending next message...)

Other times it does not match, like when another thread sent a different
message and it was that threads return message, therefore it is just
thrown in the buffer, so later it can be found by that other thread. And
Immediately I search the whole buffer for a match o my message I am
looking for(what if the response was actually already in the buffer from
the other thread writing it previously into the buffer), so i look
through the buffer using the line[cmd] way. If it does NOT find it, I
go back to reading one char at a time until a \n is seen. If it does
find it, it returns it(like the perfect scenario earlier).

Since after all the above happens, Then presumably the other thread gets
it chance to find its return message, it reads till a \n, throw it in
buffer, checks if matches, if does awesome good timing, and would be
done. However, usually not a match right away so then looks through
buffer, because chances are the 1st thread wrote it in the buffer
already while it was looking for its specific response message.

This all works perfectly, except from time to time I get a false
positive on a match when using line[cmd]. When for example I am looking
for specifically "CHLE,23" and "CHLE,231" actually is returned.

I want CHLE,23,3,2,3,4,5,3,2 to say YES a match if I was looking for
"CHLE,23" but if CHLE,231,3,2,3,4,5,3,2 is returned and I was looking
for "CHLE,23" I want it to say NO not a match.



Also, incase something screws up with machine not returning a message,
after 10 seconds of not responding, it just returns an error message
that says Response never received for 10 seconds. This was what was
happening when I tried each of the methods suggested in this thread...
Therefore I was thinking maybe they are hogging resources and the
network buffer is filling up and missing messages while I am doing a
resource intensive operation on a buffer of 600 items... all theory
though, I just know line[cmd] way works, again except for false
positive.


Hopefully that makes sense, It is a little complicated which is why
initially I didn't go into why I have a buffer and all...

Ideas on how to prevent my false positive situation when using line[cmd]
way?

Thanks Again!!!
-Matt
 
R

Robert Klemme

2009/10/29 Matt Brooks said:
Hopefully that makes sense, It is a little complicated which is why
initially I didn't go into why I have a buffer and all...

Um, I have to say I'm a bit stumped. First of all, from your
explanation it is not clear to me about which of the two (?) processes
we are talking. It's easier (at least for me) if you use variables,
e.g. "I have a process A which opens a server socket. Process B
connects to A with 1 connection ... Process B is the one I implement
in Ruby and...".
Ideas on how to prevent my false positive situation when using line[cmd]
way?

Maybe. A few things strike me as odd:

- Several different threads seem to be reading from the same
connection. Not sure whether that's a good idea because that makes
synchronization necessary.

- Since several threads seem to be reading from the socket, any thread
may read another thread's answer which is why you need the buffer.

- No need to reach character by character until \n, you can use #each
or #gets for that which should be significantly faster.

I would start by only having one thread writing to the socket and one
thread reading from it. The writer reads messages from a queue where
they are placed by worker threads. How we connect the reader with the
worker threads so that each worker thread gets it's reply depends on
things that aren't clear to me yet.

Kind regards

robert
 
M

Matt Brooks

Robert said:
2009/10/29 Matt Brooks said:
Hopefully that makes sense, It is a little complicated which is why
initially I didn't go into why I have a buffer and all...

Um, I have to say I'm a bit stumped. First of all, from your
explanation it is not clear to me about which of the two (?) processes
we are talking. It's easier (at least for me) if you use variables,
e.g. "I have a process A which opens a server socket. Process B
connects to A with 1 connection ... Process B is the one I implement
in Ruby and...".
Ideas on how to prevent my false positive situation when using line[cmd]
way?

Maybe. A few things strike me as odd:

- Several different threads seem to be reading from the same
connection. Not sure whether that's a good idea because that makes
synchronization necessary.

- Since several threads seem to be reading from the socket, any thread
may read another thread's answer which is why you need the buffer.

- No need to reach character by character until \n, you can use #each
or #gets for that which should be significantly faster.

I would start by only having one thread writing to the socket and one
thread reading from it. The writer reads messages from a queue where
they are placed by worker threads. How we connect the reader with the
worker threads so that each worker thread gets it's reply depends on
things that aren't clear to me yet.

Kind regards

robert


I tried each, gets, and readline, none of which returned anything...
weird huh. It just sits there forever. Im still thinking about it...



You seem to understand the situation so that is good, none the less for
others.... to clarify on threads... a redo... with A and B




I can't simply send the message and wait for it to return, because I
have multiple threads, A and B, sending different messages at the same
time,
therefore, the order in which response messages return is unknown, so
they just go into an array of size 600, one message per index.

Each message is terminated with a newline, \n.

The order of operations it usually psuedo code, SEND(CHLE,23,4)
GETMESSAGE(CHLE,23). Because, CHLE,23 is all I am certain that will
return for that message, so i look for that.

Therefore this pseudo GETMESSAGE function reads from the tcp port I
opened up earlier, 1 character at a time, loading each into a temp
string, until a \n is seen, then it loads that string into the buffer[0]
spot, then checks if it happens to match the message I was waiting for
anyway, sometimes it does, in that case it returns that message.(Perfect
scenario, done, move on to sending next message...)

Other times it does not match, like when THREAD B sent a different
message and it was THREAD B's return message, therefore it is just
thrown in the buffer, so later it can be found by THREAD B. And
Immediately I, THREAD A, search the whole buffer for a match to my
message I am
looking for(what if the response was actually already in the buffer from
the THREAD B writing it previously into the buffer), so i look
through the buffer using the line[cmd] way. If it does NOT find it, I
go back to reading one char at a time until a \n is seen. If it does
find it, it returns it(like the perfect scenario earlier).

Since after all the above happens, Then presumably the THREAD B gets
its chance to find THREAD B's return message, it reads till a \n, throws
it in
buffer, checks if matches, if it does awesome good timing, and would be
done. However, usually not a match right away so then looks through
buffer, because chances are THREAD A wrote it in the buffer
already while it was looking for THREAD A's specific response message.

This all works perfectly, except from time to time I get a false
positive on a match when using line[cmd]. When for example I am looking
for specifically "CHLE,23" and "CHLE,231" actually is returned.

I want CHLE,23,3,2,3,4,5,3,2 to say YES a match if I was looking for
"CHLE,23" but if CHLE,231,3,2,3,4,5,3,2 is returned and I was looking
for "CHLE,23" I want it to say NO not a match.
 
M

Matt Brooks

Actually .each works, I am trying this now to see if speed is little
better at least...

Earlier I forgot to pass in the variable using ||

thanks for that tip! Still got my false positive problem...in the
meantime
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top