Parsing a serial stream too slowly

M

M.Pekala

Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...

When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:

Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$

I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...

def OnSerialRead(self, event):
text = event.data
self.sensorabuffer = self.sensorabuffer + text
self.sensorbbuffer = self.sensorbbuffer + text
self.sensorcbuffer = self.sensorcbuffer + text

if sensoraenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorAValue = s
self.sensorabuffer = ''

if sensorbenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorBValue = s
self.sensorbenable= ''

if sensorcenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorCValue = s
self.sensorcenable= ''

self.DisplaySensorReadings()

I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.
 
J

Jon Clements

Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...

When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:

Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$

I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...

def OnSerialRead(self, event):
        text = event.data
        self.sensorabuffer = self.sensorabuffer + text
        self.sensorbbuffer = self.sensorbbuffer + text
        self.sensorcbuffer = self.sensorcbuffer + text

        if sensoraenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self.sensor_enable_chkbox.GetValue():
                                        self.SensorAValue = s
                                self.sensorabuffer = ''

        if sensorbenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self.sensor_enable_chkbox.GetValue():
                                        self.SensorBValue = s
                                self.sensorbenable= ''

        if sensorcenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self.sensor_enable_chkbox.GetValue():
                                        self.SensorCValue = s
                                self.sensorcenable= ''

        self.DisplaySensorReadings()

I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.

You sure that's your code? Your re.search()'s are all the same.
 
M

M.Pekala

Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:
Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$
I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...
def OnSerialRead(self, event):
        text = event.data
        self.sensorabuffer = self.sensorabuffer + text
        self.sensorbbuffer = self.sensorbbuffer + text
        self.sensorcbuffer = self.sensorcbuffer + text
        if sensoraenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self..sensor_enable_chkbox.GetValue():
                                       self.SensorAValue = s
                                self.sensorabuffer = ''
        if sensorbenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self..sensor_enable_chkbox.GetValue():
                                       self.SensorBValue = s
                                self.sensorbenable= ''
        if sensorcenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
                        if sensorresult:
                                s = sensorresult.group(0)
                                s = s[2:-1]
                                if self..sensor_enable_chkbox.GetValue():
                                       self.SensorCValue = s
                                self.sensorcenable= ''
        self.DisplaySensorReadings()
I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.

You sure that's your code? Your re.search()'s are all the same.

Whoops you are right. the search for the second should be re.search(r'\
$B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*',
self.sensorcbuffer )
 
T

Thomas Rachel

Am 23.01.2012 22:48 schrieb M.Pekala:
Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...

When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:

Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$

I am fairly certain this means that my code is running too slow to
catch all the '$' markers.

This would just result in the receive buffer constantly growing.

Probably the thing with the RE which has been mentionned by Jon is the
cause.

But I have some remarks to your code.

First, you have code repetition. You could use functions to avoid this.

Second, you have discrepancies between your 3 blocks: with A, you work
with sensorabuffer, the others have sensor[bc]enable.

Third, if you have a buffer content of '$A1234$$B-10$$C987$', your "A
code" will match the whole buffer and thus do

# s = sensorresult.group(0) ->
s = '$A1234$$B-10$$C987$'
# s = s[2:-1]
s = '1234$$B-10$$C987'
# maybe put that into self.SensorAValue
self.sensorabuffer = ''


I suggest the following way to go:

* Process your data only once.
* Do something like

[...]
theonebuffer = '$A1234$$B-10$$C987$' # for now

while True:
sensorresult = re.search(r'\$(.)(.*?)\$(.*)', theonebuffer)
if sensorresult:
sensor, value, rest = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
theonebuffer = rest
else: break # out of the while

If you execute this code, you'll end with a self.sensorvalues of

{'A': '1234', 'C': '987', 'B': '-10'}

and a theonebuffer of ''.


Let's make another test with an incomplete sensor value.

theonebuffer = '$A1234$$B-10$$C987$$D65'

[code above]

-> the dict is the same, but theonebuffer='$D65'.

* Why did I do this? Well, you are constantly receiving data. I do this
with the hope that the $ terminating the D value is yet to come.

* Why does this happen? The regex does not match this incomplete packet,
the while loop terminates (resp. breaks) and the buffer will contain the
last known value.


But you might be right - speed might become a concern if you are
processing your data slower than they come along. Then your buffer fills
up and eventually kills your program due to full memory. As the buffer
fills up, the string copying becomes slower and slower, making things
worse. Whether this becomes relevant, you'll have to test.

BTW, as you use this one regex quite often, a way to speed up could be
to compile the regex. This will change your code to

sensorre = re.compile(r'\$(.)(.*?)\$(.*)')
theonebuffer = '$A1234$$B-10$$C987$' # for now

while True:
sensorresult = sensorre.search(theonebuffer)
if sensorresult:
sensor, value, rest = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
theonebuffer = rest
else: break # out of the while


And finally, you can make use of re.finditer() resp.
sensorre.finditer(). So you can do

sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
theonebuffer = '$A1234$$B-10$$C987$' # for now

sensorresult = None # init it for later
for sensorresult in sensorre.finditer(theonebuffer):
sensor, value = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
# and now, keep the rest
if sensorresult is not None:
# the for loop has done something - cut out the old stuff
# and keep a possible incomplete packet at the end
theonebuffer = theonebuffer[sensorresult.end():]

This removes the mentionned string copying as source of increased slowness.

HTH,

Thomas
 
N

Nick Dokos

M.Pekala said:
Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:
Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$
I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...
def OnSerialRead(self, event):
        text = event.data
        self.sensorabuffer = self.sensorabuffer+ text
        self.sensorbbuffer = self.sensorbbuffer+ text
        self.sensorcbuffer = self.sensorcbuffer+ text
        if sensoraenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
                       if sensorresult:
                               s = sensorresult.group(0)
                               s = s[2:-1]
                               if self.sensor_enable_chkbox.GetValue():
                                       self..SensorAValue = s
                               self.sensorabuffer = ''
        if sensorbenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
                       if sensorresult:
                               s = sensorresult.group(0)
                               s = s[2:-1]
                               if self.sensor_enable_chkbox.GetValue():
                                       self..SensorBValue = s
                               self.sensorbenable= ''
        if sensorcenable:
                sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
                       if sensorresult:
                               s = sensorresult.group(0)
                               s = s[2:-1]
                               if self.sensor_enable_chkbox.GetValue():
                                       self..SensorCValue = s
                               self.sensorcenable= ''
        self.DisplaySensorReadings()
I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.

You sure that's your code? Your re.search()'s are all the same.

Whoops you are right. the search for the second should be re.search(r'\
$B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*',
self.sensorcbuffer )

The regex is probably still wrong: r'\$A.*\$.*' will e.g. match all of
your initial example "$A1234$$B-10$$C987$", so s will lose the initial
and final '$' and end up as "1234$$B-10$$C987" - I doubt that's what you
want:
sensor_result = "$A123$$B456$$C789$$A456$"
r = re.search(r'\$A.*\$.*', sensor_result)
s = r.group(0)
s = s[2:-1]
s
'123$$B456$$C789$$A456'

Is this perhaps closer to what you want?
r = re.search(r'\$A[^$]+\$', sensor_result)
r.group(0) '$A123$'

I'm sure there are more problems too - e.g. why are there three buffers?
If they all start empty and get modified the same way, they will all
contain the same string - are they modified differently in the part of
the program you have not shown? They will presumably need to be trimmed
appropriately to indicate which part has been consumed already. And, as
somebody pointed out already, the searches should probably be against
the *buffer* variables rather than the *enable* variables.

Nick
 
C

Cameron Simpson

| Hello, I am having some trouble with a serial stream on a project I am
| working on. I have an external board that is attached to a set of
| sensors. The board polls the sensors, filters them, formats the
| values, and sends the formatted values over a serial bus. The serial
| stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
| value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
|
| When one sensor is running my python script grabs the data just fine,
| removes the formatting, and throws it into a text control box. However
| when 3 or more sensors are running, I get output like the following:
|
| Sensor 1: 373
| Sensor 2: 112$$M-160$G373
| Sensor 3: 763$$A892$
|
| I am fairly certain this means that my code is running too slow to
| catch all the '$' markers. Below is the snippet of code I believe is
| the cause of this problem...

Your code _is_ slow, but as you can see above you're not missing data,
you're gathering too much data.

Some point by point remarks below. The actual _bug_ is your use of ".*"
in your regexps. Some change suggestions below the code.

| def OnSerialRead(self, event):
| text = event.data
| self.sensorabuffer = self.sensorabuffer + text
| self.sensorbbuffer = self.sensorbbuffer + text
| self.sensorcbuffer = self.sensorcbuffer + text

Slow and memory wasteful. Supposing a sensor never reports? You will
accumulate an ever growing buffer string. And extending a string gets
expensive as it grows.

| if sensoraenable:
| sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )

Slow and buggy.

The slow: You're compiling the regular expression _every_ time you come
here (unless the re module caches things, which I seem to recall it may.
But that efficiency is only luck.

The bug: supposing you get multiple sensor reports, like this:

$A1$$B2$$C3$

Your regexp matches the whole thing! Because ".*" is greedy.
You want "[^$]*" - characters that are not a "$".


| if sensorresult:
| s = sensorresult.group(0)
| s = s[2:-1]
| if self.sensor_enable_chkbox.GetValue():
| self.SensorAValue = s
| self.sensorabuffer = ''

What if there are multiple values in the buffer? After fixing your
regexp you will now be throwing them away. Better to go:

self.sensorabuffer = self.sensorabuffer[sensorresult.end():]

[...]
| I think that regex is too slow for this operation, but I'm uncertain
| of another method in python that could be faster. A little help would
| be appreciated.

Regex _is_ slow. It is good for flexible lexing, but generally Not
Fast. It can be faster than in-Python lexing because the inner
interpreation of the regex is C code, but is often overkill when speed
matters. (Which you may find it does not for your app - fix the bugs
first and see how it behaves).

I would be making the following changes if it were me:

- keep only one buffer, and parse it into sensor "tokens"
pass each token to the right sensor as needed

- don't use regexps
this is a speed thing; if you code is more readable with regexps and
still faster enough you may not do this

To these ends, untested attempt 1 (one buffer, lex into tokens, still
using regexps):

re_token = re.compile( r'\$([A-Z])([^$]*)\$' )

def OnSerialRead(self, event):
# accessing a local var is quicker and more readable
buffer = self.buffer

text = event.data
buffer += text

m = re_token.search(buffer)
while m:
sensor, value = m.group(1), m.group(2)
buffer = buffer[m.end():]
if sensor == 'A':
# ...
elif sensor == 'B':
# ...
else:
warning("unsupported sensor: %s", sensor)

# stash the updated buffer for later
self.buffer = buffer

I'm assuming here that you can get noise in the serial stream. If you
are certain to get only clean "$Ax$" sequences and nothing else you can
make the code much simpler. And faster again.

Pretending clean data and no regexps:

def OnSerialRead(self, event):
# accessing a local var is quicker and more readable
buffer = self.buffer

text = event.data
buffer += text

while buffer:
if not buffer.startswith('$'):
raise ValueError("bad data in buffer! code rewrite needed!")
mark2 = buffer.find('$', 1)
if mark2 < 0:
# end of token not present
# look again later
break
token = buffer[1:mark2]
buffer = buffer[mark2+1:]

if not token:
raise ValueError("no data in packet!")
sensorm value = token[1], token[1:]

... hand off to sensors as above ...

Cheers,
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

If your new theorem can be stated with great simplicity, then there
will exist a pathological exception. - Adrian Mathesis
 
M

M.Pekala

| Hello, I am having some trouble with a serial stream on a project I am
| working on. I have an external board that is attached to a set of
| sensors. The board polls the sensors, filters them, formats the
| values, and sends the formatted values over a serial bus. The serial
| stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
| value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
|
| When one sensor is running my python script grabs the data just fine,
| removes the formatting, and throws it into a text control box. However
| when 3 or more sensors are running, I get output like the following:
|
| Sensor 1: 373
| Sensor 2: 112$$M-160$G373
| Sensor 3: 763$$A892$
|
| I am fairly certain this means that my code is running too slow to
| catch all the '$' markers. Below is the snippet of code I believe is
| the cause of this problem...

Your code _is_ slow, but as you can see above you're not missing data,
you're gathering too much data.

Some point by point remarks below. The actual _bug_ is your use of ".*"
in your regexps. Some change suggestions below the code.

| def OnSerialRead(self, event):
|       text = event.data
|       self.sensorabuffer = self.sensorabuffer + text
|       self.sensorbbuffer = self.sensorbbuffer + text
|       self.sensorcbuffer = self.sensorcbuffer + text

Slow and memory wasteful. Supposing a sensor never reports? You will
accumulate an ever growing buffer string. And extending a string gets
expensive as it grows.

|       if sensoraenable:
|               sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )

Slow and buggy.

The slow: You're compiling the regular expression _every_ time you come
here (unless the re module caches things, which I seem to recall it may.
But that efficiency is only luck.

The bug: supposing you get multiple sensor reports, like this:

  $A1$$B2$$C3$

Your regexp matches the whole thing! Because ".*" is greedy.
You want "[^$]*" - characters that are not a "$".

|                       if sensorresult:
|                               s = sensorresult.group(0)
|                               s = s[2:-1]
|                               if self.sensor_enable_chkbox.GetValue():
|                                      self.SensorAValue = s
|                               self.sensorabuffer = ''

What if there are multiple values in the buffer? After fixing your
regexp you will now be throwing them away. Better to go:

  self.sensorabuffer = self.sensorabuffer[sensorresult.end():]

[...]
| I think that regex is too slow for this operation, but I'm uncertain
| of another method in python that could be faster. A little help would
| be appreciated.

Regex _is_ slow. It is good for flexible lexing, but generally Not
Fast. It can be faster than in-Python lexing because the inner
interpreation of the regex is C code, but is often overkill when speed
matters. (Which you may find it does not for your app - fix the bugs
first and see how it behaves).

I would be making the following changes if it were me:

  - keep only one buffer, and parse it into sensor "tokens"
    pass each token to the right sensor as needed

  - don't use regexps
    this is a speed thing; if you code is more readable with regexps and
    still faster enough you may not do this

To these ends, untested attempt 1 (one buffer, lex into tokens, still
using regexps):

    re_token = re.compile( r'\$([A-Z])([^$]*)\$' )

    def OnSerialRead(self, event):
        # accessing a local var is quicker and more readable
        buffer = self.buffer

        text = event.data
        buffer += text

        m = re_token.search(buffer)
        while m:
            sensor, value = m.group(1), m.group(2)
            buffer = buffer[m.end():]
            if sensor == 'A':
                # ...
            elif sensor == 'B':
                # ...
            else:
                warning("unsupported sensor: %s", sensor)

        # stash the updated buffer for later
        self.buffer = buffer

I'm assuming here that you can get noise in the serial stream. If you
are certain to get only clean "$Ax$" sequences and nothing else you can
make the code much simpler. And faster again.

Pretending clean data and no regexps:

    def OnSerialRead(self, event):
        # accessing a local var is quicker and more readable
        buffer = self.buffer

        text = event.data
        buffer += text

        while buffer:
            if not buffer.startswith('$'):
                raise ValueError("bad data in buffer! code rewrite needed!")
            mark2 = buffer.find('$', 1)
            if mark2 < 0:
                # end of token not present
                # look again later
                break
            token = buffer[1:mark2]
            buffer = buffer[mark2+1:]

            if not token:
                raise ValueError("no data in packet!")
            sensorm value = token[1], token[1:]

            ... hand off to sensors as above ...

Cheers,
--
Cameron Simpson <[email protected]> DoD#743http://www.cskk.ezoshosting.com/cs/

If your new theorem can be stated with great simplicity, then there
will exist a pathological exception.    - Adrian Mathesis

Okay I altered my code a bit based on all of your suggestions:

self.sensor_re = re.compile(r'\$([A-E])([^$]*)\$')

def OnSerialRead(self, event):
text = event.data
buffer = self.sensorbuffer + text

if self.SensorLogging:
result = self.sensor_re.search(buffer)
if result:
sensor,value = result.groups()
buffer = buffer[result.end():]
if sensor == 'A' and self.sensora_chkbox.GetValue():
self.SensorAValue = value
elif sensor == 'B' and self.sensorb_chkbox.GetValue():
self.SensorBValue = value
elif sensor == 'C' and self.sensorc_chkbox.GetValue():
self.SensorCValue = value
elif sensor == D' and self.sensord_chkbox.GetValue():
self.SensorDValue = value
elif sensor == 'E' and self.sensore_chkbox.GetValue():
self.SensorEValue = value
self.LogSensor()
self.sensorbuffer = buffer

....and it works far better, in so much as the problems I was having
before seem to be gone. I'm probably going to change this to something
without re, though for the time being this will work for me. The
reason there was a whole lot of junk, such as the three buffers,
around this code was because I am somewhat of a novice python
programmer and I was just doing random stuff to see if it would stick.
The reason I was using regex in the first place was that I thought it
would be faster than the python code due its c origins, but apparently
that may not be the case. Regardless thank you all for your help.

Cheers,
Miles Pekala
 
S

Steven D'Aprano

| def OnSerialRead(self, event):
| text = event.data
| self.sensorabuffer = self.sensorabuffer + text
| self.sensorbbuffer = self.sensorbbuffer + text
| self.sensorcbuffer = self.sensorcbuffer + text

Slow and memory wasteful. Supposing a sensor never reports? You will
accumulate an ever growing buffer string. And extending a string gets
expensive as it grows.

I admit I haven't read this entire thread, but one thing jumps out at me.
It looks like the code is accumulating strings by repeated +
concatenation. This is risky.

In general, you should accumulate strings into a list buffer, then join
them into a single string in one call:

buffer = []
while something:
buffer.append(text)
return ''.join(buffer)


Use of repeated string addition risks slow quadratic behaviour. The OP is
reporting slow behaviour... alarms bells ring.

For anyone who doesn't understand what I mean about slow quadratic
behaviour, read this:

http://www.joelonsoftware.com/articles/fog0000000319.html

Recent versions of CPython includes an optimization which *sometimes* can
avoid this poor performance, but it can be defeated easily, and does not
apply to Jython and IronPython, so it is best to not rely on it.

I don't know whether this is the cause of the OP's slow behaviour, but it
is worth investigating. Especially since it is likely to not just be
slow, but SLLLLLOOOOOOWWWWWWWWW -- a bad quadratic algorithm can be tens
of thousands or millions of times slower than it need be.



[...]
The slow: You're compiling the regular expression _every_ time you come
here (unless the re module caches things, which I seem to recall it may.

It does.

But that efficiency is only luck.

More deliberate design than good luck :)

Nevertheless, good design would have you compile the regex once, and not
rely on the re module's cache.


[...]
Regex _is_ slow. It is good for flexible lexing, but generally Not Fast.

I hope I will never be mistaken for a re fanboy, but credit where credit
is due: if you need the full power of a regex, you almost certainly can't
write anything in Python that will beat the re module.

However, where regexes become a trap is that often people use them for
things which are best coded as simple Python tests that are much faster,
such as using a regex where a simple str.startswith() would do.
 
C

Cameron Simpson

| On Tue, 24 Jan 2012 10:49:41 +1100, Cameron Simpson wrote:
|
| > | def OnSerialRead(self, event):
| > | text = event.data
| > | self.sensorabuffer = self.sensorabuffer + text
| > | self.sensorbbuffer = self.sensorbbuffer + text
| > | self.sensorcbuffer = self.sensorcbuffer + text
| >
| > Slow and memory wasteful. Supposing a sensor never reports? You will
| > accumulate an ever growing buffer string. And extending a string gets
| > expensive as it grows.
|
| I admit I haven't read this entire thread, but one thing jumps out at me.
| It looks like the code is accumulating strings by repeated +
| concatenation. This is risky.
|
| In general, you should accumulate strings into a list buffer, then join
| them into a single string in one call:
|
| buffer = []
| while something:
| buffer.append(text)
| return ''.join(buffer)

Yeah, but the OP needs to examine the string after every packet arrival,
so he doesn't get to defer the joining of the strings.

| Use of repeated string addition risks slow quadratic behaviour. The OP is
| reporting slow behaviour... alarms bells ring.

He's _inferring_ slow behaviour from the wrong results he was getting.
In fact he doesn't have slow behaviour (in that python isn't slow enough
in his case to cause trouble).

[...snip references to more depth on string concatenation...]

| > The slow: You're compiling the regular expression _every_ time you come
| > here (unless the re module caches things, which I seem to recall it may.
|
| It does.
|
| > But that efficiency is only luck.
|
| More deliberate design than good luck :)

The luck is in his program benefiting from it, not the decision of the
re authors to cache the source->compiled mapping.

| > Regex _is_ slow. It is good for flexible lexing, but generally Not Fast.
|
| I hope I will never be mistaken for a re fanboy, but credit where credit
| is due: if you need the full power of a regex, you almost certainly can't
| write anything in Python that will beat the re module.

Indeed not. But many many uses of the re module are for trivial lexing
stuff that is way simpler than the complexities involved in regexps
themselves (parse regexp, compiler from parse, parse string from regexp
compilation).

And perhaps equally important, regexps are both cryptic and powerful, a
combination that makes it very easy to write either highly expensive
regexps or bighly buggy regexps - in his case a small regexp error was
his bug, and programs speed had nothing to do with it.

His program did have inefficiencies though, which were worth discussing.

Cheers,
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

There are two ways of dealing with this problem: one is complicated and
messy, and the other is simple and very elegant. We don't have much time
left, so I'll just show you the complicated and messy way.
- Richard Feynman, 1981
 
U

Ulrich Eckhardt

Am 23.01.2012 22:48, schrieb M.Pekala:
I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.

Regardless of the outcome here, I would say that your code is still a
bit wonky on the handling of partial data and gaps due to buffer
overflows in the serial port buffers.

I'd start with something like this:

import unittest

class SensorReader(object):
def __init__(self):
pass
def parse(self, buffer):
"""store and parse given input"""
pass
def get_sample(self):
"""retrieve a single sample parsed from input data

This returns a tuple containing the sensor name and the
value. In case some input data was not parseable, it
returns a tuple with an empty string and the unrecognized
data.

"""
pass

class Test(unittest.TestCase):
def setUp(self):
self.handler = SensorReader()

def test_empty_buffer(self):
self.handler.parse(buffer='')
r = self.handler.get_sample()
self.assertEqual(r, None)

def test_input_1(self):
self.handler.parse(buffer='$A1234$')
r = self.handler.get_sample()
self.assertEqual(r, ('A', 1234))

def test_input_2(self):
self.handler.parse(buffer='$A1234$$B5678$')
r = self.handler.get_sample()
self.assertEqual(r, ('A', 1234))
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))

def test_invalid_input(self):
self.handler.parse(buffer='234$$B5678$')
r = self.handler.get_sample()
self.assertEqual(r, ('', '234$'))
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))

def test_partial_input(self):
self.handler.parse(buffer='$B56')
r = self.handler.get_sample()
self.assertEqual(r, None)
self.handler.parse(buffer='78$$C90')
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))

Concerning the parsing itself, there is a "find" function in the str
class which you can use. Just search for dollar signs and take the data
in between. If it's empty, the first one is the end and the second one
the start of the next sample, so you can discard the content. Having
tests as above will make it a breeze for you to figure out actual
implementation. You could also start off with regexes and then switch to
linear scanning for speed.


Uli
 
T

Thomas Rachel

Am 24.01.2012 00:13 schrieb Thomas Rachel:

[sorry, my Thunderbird kills the indentation]
And finally, you can make use of re.finditer() resp.
sensorre.finditer(). So you can do

sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
theonebuffer = '$A1234$$B-10$$C987$' # for now

sensorresult = None # init it for later
for sensorresult in sensorre.finditer(theonebuffer):
sensor, value = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
# and now, keep the rest
if sensorresult is not None:
# the for loop has done something - cut out the old stuff
# and keep a possible incomplete packet at the end
theonebuffer = theonebuffer[sensorresult.end():]

This removes the mentionned string copying as source of increased slowness.

But it has one major flaw: If you lose synchronization, it may happen
that only the data *between* the packets is returned - which are mostly
empty strings.

So it would be wise to either change the firmware of the device to use
different characters for starting end ending a packet, or to return
every data between "$"s and discarding the empty strings.

As regexes might be overkill here, we could do

def splitup(theonebuffer):
l = theonebuffer.split("$")
for i in l[:-1]: yield i + "$"
if l: yield l[-1]


sensorvalues = {}
theonebuffer = '1garbage$A1234$$B-10$2garbage$C987$D3' # for now
for part in splitup(theonebuffer):
if not part.endswith("$"):
theonebuffer = part
break # it is the last one which is probably not a full packet
part = part[:-1] # cut off the $
if not part: continue # $$ -> gap between packets
# now part is the contents of one packet which may be valid or not.
# TODO: Do some tests - maybe for valid sensor names and values.
sensor = part[0]
value = part[1:]
sensorvalues[sensor] = value # add the "self." in your case -
# for now, it is better without

Now I get sensorvalues, theonebuffer as ({'1': 'garbage', 'A': '1234',
'2': 'garbage', 'B': '-10', 'C': '987'}, 'D3').

D3 is not (yet) a value; it might come out as D3, D342 or whatever, as
the packet is not complete yet.


Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top