Optimize a discovery algorithm

pozz · Jan 1, 2014

I'm programming a small 8-bit microcontroller (AVR from Atmel). This is
the single master in a 2-wires RS485 network where N slaves (AVR based)
are connected. In RS485 networks, only one transmitter can be active at
a time, otherwise the transmitted characters will arrive corrupted to
the receivers. Moreover, the characters transmitted from the active
transmitter are received by all receivers (all other nodes in the
network that are listening if they aren't transmitting).

I want to design and implement a discovery technique: the master should
be able to understand how many slaves are connected and assign them a
different address for further communication.
I can use the 32-bits serial number stored in a non volatile memory in
the slaves and master (each node on the bus has a different serial number).

The discovery algorithm I was thinking of is based on the following
consideration: if two or more slaves answer to a query originated by the
master, two cases could happen.

1. the answer messages are separated in time, so they arrive
uncorrupted to the master;
2. two or more answer messages overlap, so they could arrive
corrupted to the master.

The master is able to detect a corrupted message (through some
mechanisms, such as CRC).

The algorithm I'm implementing is based on binary search: the master
sends a broadcast query with an interval of serial numbers. All the
slaves with a serial number in the interval answers to the master. As
soon as the master receives a valid answer, it assign a new address to
the slave. When a slave receives a new address, it stops responding to
discovery queries from the master.

A test code for the algorithm is:

uint32_t rx_sn;

int
autodetect(void)
{
uint32_t mask = 0;
uint32_t value = 0;
uint32_t vbit = 0;
uint32_t mbit = 0;
int result = 0;
while (result == 0) {
int r = query(mask, value);
if (r == ANSWER_OK) {
assign_new_address(rx_sn);
mask = value = 0;
vbit = 0;
mbit = 0;
continue;
} else if (r == NO_ANSWER) {
if (vbit == 0) {
if (mask == 0) {
result = 1; /* No more slaves to detect */
} else {
value |= mbit;
vbit = 1;
continue;
}
} else {
result = -1; /* Error */
}
} else { /* r == GARBAGE */
if (mask == UINT32_MAX) {
result = -2; /* Error */
} else {
if (mbit) mbit <<= 1; else mbit = 1;
mask |= mbit;
vbit = 0;
}
}
}
return result;
}

The function query() simulates the transmitting on the bus of the
message "Are there any slave with serial numbers in the range defined by
this mask and this value?". The return value of query() could be
ANSWER_OK (just one slave has responded with a correct answer),
NO_ANSWER (no slave has responded at all), GARBAGE (two or more slave
have responded so a corrupted answer is received).
If the result is ANSWER_OK, the serial number of the slave that has
responded is stored in global variable rx_sn.

The function assign_new_address() simulates the transmitting on the bus
of the message "The slave with this serial number will use this address
for further communication. From now on, it won't answer anymore to
discovery queries".

The interval of serial numbers is defined by a 32-bits MASK and a
32-bits VALUE: a serial number SN is in the range if

(SN & MASK) == VALUE

If garbage is received, the algorithm tries to reduce the interval
fixing the value of the least significant bit, first to 0 and after to
1. If more than one slave answer, the successive left bit is fixed.

I think this algorithm works, but it is slow and not optimized. As you
can note, when a new slave is discovered, it starts from the beginning.
This approach will loose some time to sends the same query many times.

Suppose to have three slaves on the network with serial numbers
00000001, 00000003 and 00000007. The previous algorithm will produce 12
queries:

1. mask=00000000 value=00000000: GARBAGE
2. mask=00000001 value=00000000: NO ANSWER
3. mask=00000001 value=00000001: GARBAGE
4. mask=00000003 value=00000001: ANSWER OK
Assigning new address for SN 0x00000001
5. mask=00000000 value=00000000: GARBAGE
6. mask=00000001 value=00000000: NO ANSWER
7. mask=00000001 value=00000001: GARBAGE
8. mask=00000003 value=00000001: NO ANSWER
9. mask=00000003 value=00000003: GARBAGE
10. mask=00000007 value=00000003: ANSWER OK
Assigning new address for SN 0x00000003
11. mask=00000000 value=00000000: ANSWER OK
Assigning new address for SN 0x00000007
12. mask=00000000 value=00000000: NO ANSWER

The query 2 has no answer, so it's a lost of time to send again during
the algorithm (see query 6): it will never be answered.

I think this could be solved avoiding to restart the algorithm from the
beginning every time a new slave is found. I have the impression it
could be optimized with recursion, but I can't use recursion on AVR
small microcontroller (stack space is very small). I know every
recursive algorithm can be written without recursion, so it could be my
solution.

Is someone suggest an optimization of my technique?

Thank you.

pozz · Jan 2, 2014

Il 02/01/2014 03:05, Robert Wessel ha scritto:

I've posted this before in comp.arch.embedded,

Yes, I remember it. The thread you wrote in was originated always from
me. I learned many things reading that thread in comp.arch.embedded,
but now I've problem in implementing the algorithm in C.

but don't use a mask
and value, use a range instead:

With mask and value you don't have a mathematical range (min..max), but
always a group of serial numbers (serial numbers that have certain bits
in certain positions). Both tecnique are identical, I think.

I hope binary operations (AND) on 32-bits variables will be much more
simple for 8-bit core than mathematical comparisons..

It's only necessary to be able to reasonably reliably detect that
someone has answered. The proper way to do the query is to specify a
range in which you'd like devices to respond. For example, let's say
you have a sixteen bit serial number (too short, but it keeps the
example short). The first "Are You There" query would specify
0x0000-0xffff, so any device would respond. If the host got no
response, it's done. If it got a good response* (say 0x1234: "I'm
Here"), then it would record that and then re-query for the ranges
left and right of the response (IOW, 0x000-0x1233 and 0x1235-0xffff).
If it got a gibberish response (a collision), it would simply split
the range in two (IOW, a binary search) and re-query those subranges
(IOW, 0x0000-0x7fff and 0x8000-0xffff). The process repeats until all
queried subranges return nothing.

It's very similar to my algorithm except for two things.
You use a numerical min..max range and I use bitmask, but it is
equivalent (see above).
You don't repeat the entire process from the beginning, but you continue
querying remaining subranges until the last one. This is the complex
job I'm not able to make and it is my true question of my original post:
how to implement the discovery algorithm without needing to restart from
the beginning (and without using recursion). It's strictly a
programming problem, related to C language.

The performance is usually only slightly worse than linear with the
number of devices to find, although you can construct pathological
cases where it's O(n*lg(m)), where m is the number of possible serial
numbers. So much longer serial numbers have only modest impact.

Repeating the whole process several times can help compensate for less
than perfect collision detection.

If you don't have a reasonable chance of detecting a collision, the
problem is much harder.

*Either because that was the only device in the subrange, or it
"overpowered" all the other devices in the subrange, and the host got
a good message from it.

I agree with you. Indeed I have chosen to use your tecnique, but now I
have to implement it

pozz · Jan 2, 2014

Il 02/01/2014 02:51, Sam ha scritto:

FYI there was a thread in comp.arch.embedded in mid-December (e.g.
around 18th) called: "Re: RS485 bus and auto-addressing". You might
find some value in reading that one - even if that thread doesn't
completely match your intended algorithm (e.g. that was discussing
address allocation, whereas your slave devices seem to have pre-defined
addresses), some of the issues described there might be relevant.

Yes, I know very well that thread, because it was originated by me.

I learned many things by reading the answers to my post, and now I want
to implement the technique described here, inspired from the thread in
comp.arch.embedded.

Now the problem is strictly related to programming: how to optimize the
algorithm I have chosen, so reducing the number of queries on the bus
that are time consuming, without using recursion.

pozz · Jan 2, 2014

Il 02/01/2014 03:05, Robert Wessel ha scritto:

I've posted this before in comp.arch.embedded, but don't use a mask
and value, use a range instead:

Just for your curiosity. I was in doubt between your technique
(transformed with bitmask operations as I'm trying to do) and another
approach.

Because collision detection is problematic in RS485 bus, slaves answer
with a break character to queries and not with a data message. If there
are two or more slaves that answer, the break character is normally
detected by the master as there was just one slave.

This approach is more reliable for sure, but it costs much more queries
to detect the exact serial number. For example, if I have just one
slave on the bus, with your (and mine) approach a single query is
sufficient. With break characters I need about 32 queries, until I
remain with a one-element range.

I think our approach is good if auto-detection is controlled by the
user: a new slave is installed, the user push a button to start
auto-detection and check the result, otherwise it could restart again
the process.

The second approach could be better in situations where the
auto-detection isn't initiated by a user but automatically with a
regular frequency, it isn't user controlled (the result isn't checked by
a user) and the time needed to discover the new slave isn't critical.

Johannes Bauer · Jan 2, 2014

I hope binary operations (AND) on 32-bits variables will be much more
simple for 8-bit core than mathematical comparisons..

Definitely not. I'd guess exactly the same amount of clock cycles per
operation (four instructions and one conditional branch). You'll see it
if you look at the assembly output via objdump.

Regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

pozz · Jan 2, 2014

Il 02/01/2014 10:23, Johannes Bauer ha scritto:

Definitely not. I'd guess exactly the same amount of clock cycles per
operation (four instructions and one conditional branch). You'll see it
if you look at the assembly output via objdump.

You are right... maybe bitmask operations are even less efficient.

Anyway the optimization problem is the same... or similar.

Keith Thompson · Jan 2, 2014

Johannes Bauer said:
Definitely not. I'd guess exactly the same amount of clock cycles per
operation (four instructions and one conditional branch). You'll see it
if you look at the assembly output via objdump.

I'm not so sure about that.

If the CPU can only perform 8-bit operations (I have no idea whether
that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
operations. For arithmetic operations, it has to deal with carry bits
for addition and subtraction, and more complex interactions for
multiplication and division.

Bitwise operations are, on a small scale, "embarassingly parallel".
https://en.wikipedia.org/wiki/Embarassingly_parallel

Willem · Jan 2, 2014

Keith Thompson wrote:
)> On 02.01.2014 08:51, pozz wrote:
)>> I hope binary operations (AND) on 32-bits variables will be much more
)>> simple for 8-bit core than mathematical comparisons..
)>
)> Definitely not. I'd guess exactly the same amount of clock cycles per
)> operation (four instructions and one conditional branch). You'll see it
)> if you look at the assembly output via objdump.
)
) I'm not so sure about that.
)
) If the CPU can only perform 8-bit operations (I have no idea whether
) that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
) operations. For arithmetic operations, it has to deal with carry bits
) for addition and subtraction, and more complex interactions for
) multiplication and division.

The addition and subtraction operations on 8-bit CPU's (or any CPU for that
matter) already takes the carry bits into account, so 'it has to deal with
carry bits' has been built into the hardware.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

glen herrmannsfeldt · Jan 2, 2014

I'm not so sure about that.

If the CPU can only perform 8-bit operations (I have no idea whether
that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
operations. For arithmetic operations, it has to deal with carry bits
for addition and subtraction, and more complex interactions for
multiplication and division.

This is why some people like litte-endian systems. In this particular
case, of you execute four successive operations, each using the carry
generated by the previous (you also have a carry-in if needed)
You only need to increment an address, never decrement.

But past the 6502, I haven't been convinced. Note that the 6502
doesn't push the address of the following instruction on the stack
in the case of a subroutine call. (It hasn't computes the address
yet.)

But for anything past the 6502, and pretty much anything with
multiply, the advantage pretty much goes away. Well, if you
want the full description "Blaauw and Brooks" in their book
on computer architecture, explain it.

Bitwise operations are, on a small scale, "embarassingly parallel".
https://en.wikipedia.org/wiki/Embarassingly_parallel

If you are doing the operations sequentially, though, they aren't
parallel anyway.

-- glen

pozz · Jan 2, 2014

Il 02/01/2014 17:38, Keith Thompson ha scritto:

I'm not so sure about that.

If the CPU can only perform 8-bit operations (I have no idea whether
that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
operations. For arithmetic operations, it has to deal with carry bits
for addition and subtraction, and more complex interactions for
multiplication and division.

Bitwise operations are, on a small scale, "embarassingly parallel".
https://en.wikipedia.org/wiki/Embarassingly_parallel

Those are similar arguments I have to use bitwise operations. I tried
the following instructions with avr-gcc compiler (it's a compiler for
AVR8 core) and the result is the following.

eecfg.x, eecfg.y and eecfg.z are defined as uint32_t.

In the "bitwise method", the test is:

if ((eecfg.x & eecfg.y) == eecfg.z) {
13e: 40 91 da 00 lds r20, 0x00DA
142: 50 91 db 00 lds r21, 0x00DB
146: 60 91 dc 00 lds r22, 0x00DC
14a: 70 91 dd 00 lds r23, 0x00DD
14e: 80 91 d6 00 lds r24, 0x00D6
152: 90 91 d7 00 lds r25, 0x00D7
156: a0 91 d8 00 lds r26, 0x00D8
15a: b0 91 d9 00 lds r27, 0x00D9
15e: 48 23 and r20, r24
160: 59 23 and r21, r25
162: 6a 23 and r22, r26
164: 7b 23 and r23, r27
166: 80 91 de 00 lds r24, 0x00DE
16a: 90 91 df 00 lds r25, 0x00DF
16e: a0 91 e0 00 lds r26, 0x00E0
172: b0 91 e1 00 lds r27, 0x00E1
176: 48 17 cp r20, r24
178: 59 07 cpc r21, r25
17a: 6a 07 cpc r22, r26
17c: 7b 07 cpc r23, r27
17e: 09 f0 breq .+2 ; 0x182 <__stack+0x23>

In the "mathematical method", the test is:

if ((eecfg.x >= eecfg.y) && (eecfg.x <= eecfg.z)) {
13e: 40 91 d6 00 lds r20, 0x00D6
142: 50 91 d7 00 lds r21, 0x00D7
146: 60 91 d8 00 lds r22, 0x00D8
14a: 70 91 d9 00 lds r23, 0x00D9
14e: 80 91 da 00 lds r24, 0x00DA
152: 90 91 db 00 lds r25, 0x00DB
156: a0 91 dc 00 lds r26, 0x00DC
15a: b0 91 dd 00 lds r27, 0x00DD
15e: 48 17 cp r20, r24
160: 59 07 cpc r21, r25
162: 6a 07 cpc r22, r26
164: 7b 07 cpc r23, r27
166: 08 f4 brcc .+2 ; 0x16a <__stack+0xb>
168: a4 c0 rjmp .+328 ; 0x2b2 <__stack+0x153>
16a: 80 91 de 00 lds r24, 0x00DE
16e: 90 91 df 00 lds r25, 0x00DF
172: a0 91 e0 00 lds r26, 0x00E0
176: b0 91 e1 00 lds r27, 0x00E1
17a: 84 17 cp r24, r20
17c: 95 07 cpc r25, r21
17e: a6 07 cpc r26, r22
180: b7 07 cpc r27, r23
182: 08 f4 brcc .+2 ; 0x186 <__stack+0x27>

21 instructions for bitwise comparison versus 23 instructions for
mathematical comparison. There isn't a good reason to choose one method
for reducing instructions.

Keith Thompson · Jan 2, 2014

Willem said:
Keith Thompson wrote:
)> On 02.01.2014 08:51, pozz wrote:
)>> I hope binary operations (AND) on 32-bits variables will be much more
)>> simple for 8-bit core than mathematical comparisons..
)>
)> Definitely not. I'd guess exactly the same amount of clock cycles per
)> operation (four instructions and one conditional branch). You'll see it
)> if you look at the assembly output via objdump.
)
) I'm not so sure about that.
)
) If the CPU can only perform 8-bit operations (I have no idea whether
) that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
) operations. For arithmetic operations, it has to deal with carry bits
) for addition and subtraction, and more complex interactions for
) multiplication and division.

The addition and subtraction operations on 8-bit CPU's (or any CPU for that
matter) already takes the carry bits into account, so 'it has to deal with
carry bits' has been built into the hardware.

Fair enough; multi-word addition and subtraction (where a "word" is the
size that can be handled by a CPU instruction) are likely to be as fast
as multi-word bitwise operations, assuming that single-word ADD and XOR,
for example, take the same number of clock cycles.

Multiplication and division are another story.

pozz · Jan 2, 2014

Il 02/01/2014 23:05, Robert Wessel ha scritto:

The advantage of using ranges is that you can eliminate a found device
from a range fairly easily, and the continue checking on either side
of that found device.

I think it's the same for bitwise ranges.

I agree with you. Indeed I have chosen to use your tecnique, but now I
have to implement it

Click to expand...

Well, in pseudocode:

struct {first, last} stack[33];
// (stack levels need to match number of bits in serial #s + 1)
[...]

But you are using a "custom stack", 33 elements wide! It's a lot of
memory space lost (64 * 33 = 2kB!!). I can't statically allocate that
space in my AVR8 microcontroller. It's better to shoot a discovered
slave and restart the algorithm from the beginning: more queries so more
time to complete, but much less RAM requirements.

I think your code is very similar, in terms of memory requirements, to
an analog recursive algorithm that is much shorter and simpler.

As I mentioned, doing this several times to increase reliability may
be a good idea if your collision detection is not 100% reliable.

Yes, it is a possibility. Anyway the discovery is manually started by
the user that can check the result. If a problem appears, the user can
restart the detection.

pozz · Jan 3, 2014

Il 02/01/2014 23:55, Robert Wessel ha scritto:

No, it's a stack of 33 pairs of 32 bit words. IOW, 264 bytes.

Yes, you're right. I was calculating bits and not bytes. Anyway it's a
big space to allocate (I'm using ATmega32A that has 2kB of RAM).

Well, it is the basic explicit stack version of the inherently
recursive routine. A "real" recursive routine would probably use
considerably more than 264 bytes of stack, though (because of the need
to save registers, local variables, and return addresses in addition
to the parameters). The real recursive implementation would be less
code, though.

You're right again. If I had RAM space, I would use a "real" recursive
routine: much more simple.

You could reduce stack usage by iteratively looking at smaller ranges
than normally are generated for each recursion (implicit or explicit).
For example, you might subdivide the incoming range into 256 sub
parts, and then issue queries for those 1/256th subranges until to get
a response (valid or invalid), and then recurse (again implicitly or
explicitly) on that range and the remaining area to the right. That
would reduce the maximum smaller ranged recursed on, limiting the
stack size (and if you used 256, you'd need no more than 5* stack
levels). Of course that would increase the number of probes by about
a factor of 256 as well. An intermediate approach (say 16 subranges),
would reduce stack usage to 9, at an 16 fold increase in work.

*It might be four (or 32, in the original), but I'm too lazy to figure
out if that's true with the simpler empty range discarding method I'm
using.

RAM space requirements is important as well as number of probes on the
bus: the probe could take about 100ms-200ms (the timeout of the answer).
If the number of probes increase, the overall time increases as well.

I think my original implementation could be a good compromise: I will
send more probes on the bus (not too much more) than the recursive
algorithm, but I don't need memory space for stack.

Aleksandar Kuktin · Jan 3, 2014

Il 02/01/2014 23:05, Robert Wessel ha scritto:

Click to expand...

[snip]

I agree with you. Indeed I have chosen to use your tecnique, but now
I have to implement it

Well, in pseudocode:

struct {first, last} stack[33];
// (stack levels need to match number of bits in serial #s + 1)
[...]

Click to expand...

But you are using a "custom stack", 33 elements wide! It's a lot of
memory space lost (64 * 33 = 2kB!!). I can't statically allocate that
space in my AVR8 microcontroller. It's better to shoot a discovered
slave and restart the algorithm from the beginning: more queries so more
time to complete, but much less RAM requirements.

Click to expand...

No, it's a stack of 33 pairs of 32 bit words. IOW, 264 bytes.

I think your code is very similar, in terms of memory requirements, to
an analog recursive algorithm that is much shorter and simpler.

Click to expand...

Well, it is the basic explicit stack version of the inherently recursive
routine. A "real" recursive routine would probably use considerably
more than 264 bytes of stack, though (because of the need to save
registers, local variables, and return addresses in addition to the
parameters). The real recursive implementation would be less code,
though.

[snip]

Well... IMHO, this problem does not really have a good solution.
Optimization (AKA, political decision making) is in order.

Using a large amount of memory is mandatory to achieve the minimum
possible packet traffic on the net^H^H^Hbus. Memory is needed to store
the intermediate state of the probing algorithm. Discarding it (AKA using
less memory) requires regenerating the state at a subsequent point in
execution, AKA sending more packets.

Below, I propose a solution (which has not been tested and may or may not
contain fence-post errors) that uses the absolute minimum amount of
memory and discards as little of the intermediate state as possible,
thereby sending less packets than it would otherwise need to. It's also
non-recursive (in a way) and short.

void alloc_addr(void) {
uint32_t top, bottom, reusable;

#define DEFAULT_BOTTOM (~0) /* Assuming the compiler will fill the
entire field with ones. */
top = 0;
bottom = DEFAULT_BOTTOM;

do {
reusable = do_querry_bus(top, bottom);
switch (reusable) {
case ANSWER_OK:
/* Assuming do_assign_address() returns the point in the address
space where the device was detected. */
top = do_assign_address(top, bottom) +1;
bottom = DEFAULT_BOTTOM;
break;

case ANSWER_NONE:
/* Maybe it's in the other half. */
reusable = (bottom - top) /2;

top = bottom;
bottom = bottom + reusable; /* May or may not contain
fence-post errors. */
break;

case ANSWER_COLLISION:
default:
bottom = bottom - ((bottom - top) /2); /* May or may not contain
fence-post errors. */
};
} while (bottom > top);
#undef DEFAULT_BOTTOM
}

Aleksandar Kuktin · Jan 3, 2014

void alloc_addr(void)

God, I'm an idiot. Or not. The unpatched code will not incur any
additional cost when more than one slave is on the bus, but it /will/
send an unneccessary packet if there is only one slave on the bus.

@@ -10,9 +10,8 @@
reusable = do_querry_bus(top, bottom);
switch (reusable) {
case ANSWER_OK:
- /* Assuming do_assign_address() returns the point in the address
- space where the device was detected. */
- top = do_assign_address(top, bottom) +1;
+ do_assign_address(top, bottom);
+ top = bottom +1;
bottom = DEFAULT_BOTTOM;
break;

Aleksandar Kuktin · Jan 3, 2014

God, I'm an idiot. Or not. The unpatched code will not incur any
additional cost when more than one slave is on the bus, but it /will/
send an unneccessary packet if there is only one slave on the bus.

And it also has a halting problem when either the bus is empty or all the
slaves are very close to the beggining of the address space.

Optimize a serial protocol for data exchange between two C applications	16	Dec 20, 2012
VHDL help.	4	Feb 24, 2009
[Q] ctypes callbacks with Delphi	8	Oct 17, 2003
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Aug 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jul 1, 2007

Optimize a discovery algorithm

pozz

pozz

pozz

pozz

Johannes Bauer

pozz

Keith Thompson

Willem

glen herrmannsfeldt

pozz

Keith Thompson

pozz

pozz

Aleksandar Kuktin

Aleksandar Kuktin

Aleksandar Kuktin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads