Fast ARMv6M mempy - Part 2 - Options & Results
This is the 2nd part of "Fast ARMv6M mempy", see other parts you may have missed:
- Part 1 - ASM code
- Part 2 - Options & Results
- Part 3 - Replace SDK memcpy
- Part 4 - Automated case generation
- Part 5 - Test function
- Part 6 - Benchmark function
Full code available at github (test project, only fast ARMv6M mempy).
Configuration options
The code has been designed to allow customization, these are the options available (size differences are only rough approximations):
- MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS (miwo)
It chooses the loop used for misaligned copy.
Value | loop used | Code size vs. reference | Description |
---|---|---|---|
0 | src_aligned | 0 (reference) | Slowest |
1 | miwo1 | +144 | Fast |
2 | miwo2 | +209 | Fastest |
- MEMCPY_ARMV6M_MED_SIZE_SPEED (mssp)
It chooses the copy loop used for misaligned medium size copy.
Irrelevant if miwo = 0, because it uses "src_aligned" loop for all misaligned sizes >= 8.
Value | loop used | Code size vs. reference | Description |
---|---|---|---|
0 | mssp_0 | 0 (reference) | Slowest copy. |
1 | src_aligned | +45 / +66 (opxip = 1) | Best all round. |
2 | Lmemcpy_short | -25 / +100 (opxip = 1) | Fastest code for RAM, slow for uncached XIP FLASH if "opxip" = 0. |
- MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ (opxip)
Enables 2 extra optimizations only for misaligned XIP FLASH copy:
- Uses the "src_aligned" code because it is faster with XIP FLASH.
- Enables the use of a different threshold around 60 bytes for switching between medium ("src_aligned") and big size code ("miwo").
Irrelevant if any of these conditions is met:
- miwo = 0, because it forces the use of "src_aligned" loop for all misaligned sizes >= 8.
- mssp = 0, because that combination makes no sense: "mssp_0_opxip_1" would be bigger and slower (for RAM) than "mssp_1_opxip_1".
Value | loop used for XIP | Code size vs. reference | Description |
---|---|---|---|
0 | same as RAM | 0 | Slow for small / medium size uncached XIP FLASH copy. |
1 | src_aligned | +20 (mssp = 1) / +128 | Best speed even for unaligned XIP FLASH. |
- MEMCPY_ARMV6M_OPTIMIZE_SIZE (opsz)
Allows saving some code size sacrificing some speed.
Value | Code size vs. reference | Description |
---|---|---|
0 | 0 (reference) | Maximize speed. |
1 | [-25, -50] * | Less code size with a very small penalty in speed (recommended to save some size). |
2 | [-50, -85] * | Even less code size, with a noticeable speed decrease. |
Note
* Exact code savings depends on other options used.
- MEMCPY_ARMV6M_MED_SIZE_UPWARDS (msup)
Enables the use of an upwards copy loop (increasing pointer) for medium sizes when MEMCPY_ARMV6M_MED_SIZE_SPEED = 2.
I created this code only for checking whether uncached XIP flash memory reads are any faster when reading increasing addresses (instead of decreasing as in "Lmemcpy_short").
It turns out, there is no difference, so it is recommended to leave it at 0, because it's slower and bigger than "Lmemcpy_short".
Irrelevant if mssp != 2.
Value | loop used | Code size vs. reference | Description |
---|---|---|---|
0 | Lmemcpy_short | 0 (reference) | Fastest and smallest (recommended). |
1 | no name | +45 / -7 (opxip = 1) | Slow and almost equal or bigger size (not recommended, only for testing). |
Limiting valid configurations
Some combinations do not make sense or result in the same code output.
These "invalid" combinations can be used, but they are not unique, there are several combinations that lead to the same code.
For documentation and testing, it is better to know and test only the unique combinations.
In the test code, some cases are grouped:
// Grouped cases
=: miwo 0; opsz 0
=: miwo 0; opsz 1
=: miwo 0; opsz 2
This means that all combinations using a set of options matching one of these lines can be reduced to one case (more details about how the test system works later).
So, for "miwo_0" there are only 3 unique combinations.
Other cases are flagged to be skipped:
// Skip cases
// msup only used when mssp is 2
-: mssp 0,1; msup 1
Skip all cases where "msup" is 1 and "mssp" is not 2.
// opxip not used when mssp is 0
-: mssp 0; opxip 1
Skip cases where opxip is not used.
With all these exceptions, the number of unique cases is reduced to 45 (from the 108 yielded by all possible combinations of options).
Eliminating also the cases used only for testing, like "msup" = 1, we get 34 unique cases.
This is the list of all those cases and their size (including original ROM code for reference):
Case | Code size |
---|---|
Original ROM code | 154 |
Case (miwo_0) | Code size |
---|---|
miwo_0_mssp_0_opxip_0_opsz_0_msup_0 | 246 |
miwo_0_mssp_0_opxip_0_opsz_1_msup_0 | 220 |
miwo_0_mssp_0_opxip_0_opsz_2_msup_0 | 190 |
Case (miwo_1) | Code size |
---|---|
miwo_1_mssp_0_opxip_0_opsz_0_msup_0 | 338 |
miwo_1_mssp_0_opxip_0_opsz_1_msup_0 | 322 |
miwo_1_mssp_0_opxip_0_opsz_2_msup_0 | 316 |
miwo_1_mssp_1_opxip_0_opsz_0_msup_0 | 384 |
miwo_1_mssp_1_opxip_0_opsz_1_msup_0 | 342 |
miwo_1_mssp_1_opxip_0_opsz_2_msup_0 | 306 |
miwo_1_mssp_1_opxip_1_opsz_0_msup_0 | 404 |
miwo_1_mssp_1_opxip_1_opsz_1_msup_0 | 362 |
miwo_1_mssp_1_opxip_1_opsz_2_msup_0 | 326 |
miwo_1_mssp_2_opxip_0_opsz_0_msup_0 | 312 |
miwo_1_mssp_2_opxip_0_opsz_1_msup_0 | 296 |
miwo_1_mssp_2_opxip_0_opsz_2_msup_0 | 290 |
miwo_1_mssp_2_opxip_1_opsz_0_msup_0 | 440 |
miwo_1_mssp_2_opxip_1_opsz_1_msup_0 | 384 |
miwo_1_mssp_2_opxip_1_opsz_2_msup_0 | 358 |
Case (miwo_2) | Code size |
---|---|
miwo_2_mssp_0_opxip_0_opsz_0_msup_0 | 404 |
miwo_2_mssp_0_opxip_0_opsz_1_msup_0 | 384 |
miwo_2_mssp_0_opxip_0_opsz_2_msup_0 | 378 |
miwo_2_mssp_1_opxip_0_opsz_0_msup_0 | 450 |
miwo_2_mssp_1_opxip_0_opsz_1_msup_0 | 404 |
miwo_2_mssp_1_opxip_0_opsz_2_msup_0 | 368 |
miwo_2_mssp_1_opxip_1_opsz_0_msup_0 | 470 |
miwo_2_mssp_1_opxip_1_opsz_1_msup_0 | 424 |
miwo_2_mssp_1_opxip_1_opsz_2_msup_0 | 388 |
miwo_2_mssp_2_opxip_0_opsz_0_msup_0 | 378 |
miwo_2_mssp_2_opxip_0_opsz_1_msup_0 | 358 |
miwo_2_mssp_2_opxip_0_opsz_2_msup_0 | 352 |
miwo_2_mssp_2_opxip_1_opsz_0_msup_0 | 506 |
miwo_2_mssp_2_opxip_1_opsz_1_msup_0 | 446 |
miwo_2_mssp_2_opxip_1_opsz_2_msup_0 | 420 |
Recommended options (an easy decision table, showing misaligned copy speed and size):
Case | RAM Speed |
Medium size XIP FLASH Speed |
Size | Comments |
---|---|---|---|---|
miwo_0_mssp_0_opxip_0_opsz_2_msup_0 | + | ++ | 190 | Smallest. |
miwo_1_mssp_1_opxip_1_opsz_0_msup_0 | ++ | ++ | 404 | Fast. |
miwo_2_mssp_2_opxip_0_opsz_0_msup_0 | +++ | + | 378 | Fastest, slow with uncached flash. |
miwo_2_mssp_2_opxip_1_opsz_0_msup_0 | +++ | ++ | 506 | Fastest (default). |
Results
The following graphs show the copy throughput using real hardware tests.
Methodology:
- The test for each size is repeated 100 times.
- For each test, memcpy is called 10 times in case of a RAM test and only once for a XIP FLASH test.
- For aligned copy, the 4 possible different offsets are tested (0, 1, 2, 3).
- For misaligned data, all offset combinations are tested (src-dst: 0-1, 0-2, 0-3, 1-0, 1-2, 1-3, 2-0, 2-1, 2-3, 3-0, 3-1, 3-2).
Data processing:
- Each graph curve shows the average of all offset combinations tested.
Colors and styles:
- Solid lines: new code.
- Dashed lines: original ROM code.
- Green lines: aligned copy.
- Red lines: misaligned copy.
- Highlighted: range where medium size copy is used.
The horizontal limit line is the theoretical thoughput for the best aligned copy loop.
Each graph shows the effect of one or two options over the copy throughput.
- RAM copy throughput vs "miwo" option (X axis is logarithmic):
Options used: miwo_X_mssp_1_opxip_0_opsz_0_msup_0
Aligned copy | Misaligned copy |
---|---|
Small improvement thanks to the 8 cycles saved. |
- Speed is slightly faster than original code starting at 8 bytes. - The difference in speed increases with each "miwo" option after the 16 bytes size. - Miwo0: 1.5x faster around 50 bytes and 1.6x faster around 500 bytes. - Miwo1: 2x faster around 50 bytes and 2.7x faster around 500 bytes. - Miwo2: 2.3x faster around 50 bytes and 3.5x faster around 500 bytes. |
The speed achieved at 512 bytes is near to the maximum possible.
- Uncached XIP FLASH copy throughput vs "miwo" option (X axis is logarithmic):
Options used: miwo_X_mssp_1_opxip_1_opsz_0_msup_0
Aligned copy | Misaligned copy |
---|---|
Almost same speed, 8 cycles difference is very small @ XIP speed. |
- Speed is 2 to 3 times faster than original code starting at 8 bytes. - The difference in speed increases with each "miwo" option after the 60 bytes size. - Miwo0: 3.2x faster around 500 bytes. - Miwo1: 3.8x faster around 500 bytes - Miwo2: 4x faster around 500 bytes. |
The speed achieved at 512 bytes is almost the maximum possible.
- RAM copy throughput vs "mssp" & "opxip" options:
Options used: miwo_2_mssp_X_opxip_Y_opsz_0_msup_0
Aligned copy | Misaligned copy |
---|---|
At this zoom level, the effect of the 8 cycles saved can be clearly seen. |
- Speed between 8 and 16 bytes increases with each "mssp" option. - There is a small penalty due to the extra checks needed when opxip = 1. |
- Uncached XIP FLASH: copy throughput vs "mssp" & "opxip" options:
Options used: miwo_2_mssp_X_opxip_Y_opsz_0_msup_0
Aligned copy | Misaligned copy |
---|---|
Almost same speed, 8 cycles difference is very small @ XIP speed. |
- Speed between 8 and 16 bytes is only a little bit over original ROM code for "mssp0" and "mss2_opxip0". - Mssp1_opxip0 achieves good performance in this range because it uses the "src_aligned" loop. - In the 16 to 60 bytes range the best results are obtained when using opxip = 1. |