Now: there are two methods:

- One can simply subtract ten from the number, and counts that until a carry happens. With small numbers up to 100 this is manageable, but with large numbers (16 bit or even larger) this a method to send the controller into useless and endless loops. Do that only if your controller has nothing else to do.
- With the second method it is irrelevant how large the numbers are: 16 bit are only twice as time-consuming as 8 bits are, not 256 times like in the first case. Also, the calculation lasts always the same time and it doesn't matter how large or small the number is. You'll get this exclusively here, designed by the website's owner personally.

Each run through the loop needs four clock cycles. The last, extra cycle, needs only three. So the total number of cycles is

With 255 these are 103 clock cycles.

With 16 bit wide numbers you'll need a 16 bit number storage and a 16 bit counter. The counter can be any two registers. If you'll select R25:R24 or XH:XL or YH:YL or ZH:ZL for those, you can count the counter up with

With ADIW and SBIW the loop lasts six clock cycles per execution, the last extra loop needs five. For dividing 65,535 by 10 you'll need

and your controller needs several milliseconds to do that. With 24 bit numbers you'll be completely lost in loops.

So there is a need for a different and faster approach.

Now two additional result registers

The determined carry from above is now rotated into those two registers from the right side to the left. If the most significant bit that is rotated to the carry is clear, the whole is repeated until all bits in the number have been shifted into the

Register

The source code for this is here. The results for those numbers show that rounding is exact. The execution times are depending from the numbers of ones in the result, because those process a little bit faster (no undo necessary, one jump less). But all are in the range between 204 and 218 clock cycles. This is more than the first method, if eight bits only are considered, but extremely faster than the respective 16-bit-method with repeated subtraction shown above.

The first approach is to divide the number by 8. If you divide the number by eight instead of ten, you'll already have a good first neighbor to the correct result. So e.g. 100 / 8 = 12 (more exact: 12.5), already near to the result of 10. Which further dividers (by 16, by 32, by 64, etc.) have to be subtracted from that to get closer to the result?

We see on that row, that we are already at ten if we subtract Number / 64. If the number would be 255, we would need to additionally subtract Number / 128 to be at the correct result. Larger numbers need more right shifts and subtractions.

Interesting is the row of numbers in column 3 of the table. These say if the number divided by the two's power has to be subtracted (1) or not (0). Starting from 16 two divided numbers are not subtracted (16 and 32), but the two following (64 and 128) have to be subtracted. This

To prove that here is the same for 55,000. Now we'll have to divide the number eight times more often, in total 23 times.

The same

To divide the number by the powers of two needs an extra register at 8 bit and two additional registers at 16 bit. Your divider register adds further two resp. four registers to that. In total you'll need three times the number of registers that your number has.

Because dividing the number by additional two's and subtracting are always the same, execution times are also the same, no matter how small or large a number is. If you use a loop instead of divisioning all by single instructions, your dependancy on number sizes increases a little bit.

Bits | Method | Number | Time (cycles) |
---|---|---|---|

8 | Repeated subtraction | 100 | 56 |

255 | 116 | ||

Binary division | 100 | 99 | |

255 | 101 | ||

Schmidt | doesn't matter | 59 | |

16 | Repeated subtraction | 10.000 | 6.025 |

65.535 | 39.344 | ||

Binary division | 10.000 | 220 | |

65.535 | 220 | ||

Schmidt | doesn't matter | 144 |

When simulating I have saved all additional registers on the stack and restored their content after dividing. This is for clear comparison only, you do not need that if you have plenty registers available.

A simple and cheap method would be to add 5 to the number. As this would lead to an overflow if the 16-bit number is 65,531 or larger, the 5 can only be added after the number has already been divided by 8. Therefore the adder at this point is 5 / 8, which is hexadecimal 0x0000A000 and 0xA0 is added to byte 1 of the divided result.

By experiments I found that up to an adder between 5.0 and 5.9988 the rounding works correct. A value of 5.5 gives the best distance to the limits, which is hexadecimal 0x0000AFFB. This is also applied in the source code.

Here you'll find the source code for this method for 16 bit numbers. Because the row

The code consists of only 54 instruction words and is very compact. The reason why I decided to check that all three lower registers are empty comes from the fact that in very rare cases the two least significant registers can be empty, while the third still isn't empty. That can happen with very special rounding adders. The priorization of the three registers is oriented on their relevance.

The source code for this is here. On the top you can add the number to be divided as constant, the result comes in register pair R17:R16.

The code has been optimized in that it uses the loop, just like in version 1. Note that the relative branching at the end of the loop is already at its limit, so do not add more than a few words in between the loop.

The basic decision to add the source code for dividing by four without subtraction and by another four with subtraction the consequence that smaller numbers need less execution clock cycles. This can be seen from the table: the shortest execution time is 182 clock cycles for dividing zero, the longest are 364 clock cycles for dividing 65,535. Each loop repetition adds roughly 47 cycles.

What can also be seen from the table: rounding is absolutely correct, no matter how small or large the numbers are!

Due to the optimization the source code is only 90 words long, slightly longer than the previous version, of course. That fits also into very small flash memory. Note that saving and re-storing needed registers is not included here. If you need it, add some pushes on top and the same number of pops at the end (and, of course, initiation of the stack on top).

Note that execution times now are longer than for the respective binary division and with the previous version 1. So there is a lot of optimization potential.

- For numbers smaller than 100 and 8 bit wide only the repeated subtraction of 10 is the fastest.
- In all other cases my new method is the fastest, binary division is a little bit slower than my method.

To the page top

©2021 by http://www.avr-asm-tutorial.net