Fast ARMv6M mempy - Part 1 - ASM code
This is the 1st part of "Fast ARMv6M mempy":
- Part 1 - ASM code
- Part 2 - Options & Results
- Part 3 - Replace SDK memcpy
- Part 4 - Automated case generation
- Part 5 - Test function
- Part 6 - Benchmark function
Full code available at github (test project, only fast ARMv6M mempy).
Introduction
While working with a video application with a Raspberry Pi Pico (RP2040), I observed that memcpy performance was randomly dropping.
I found out that the slowdown happened when data was misaligned (data not aligned to 32 bit boundaries).
Digging in the ROM code I located the code for misaligned copy, "__memcpy_slow_lp" in bootrom_misc.S:
__memcpy_slow_lp:
sub r2, #1
ldrb r3, [r1, r2]
strb r3, [r0, r2]
bne __memcpy_slow_lp
It uses 7 cycles per byte.
Very slow compared to the copy loop used for aligned memory in the same ROM code:
// 16 byte loop
1:
ldmia r1!, {r3, r4, r5, r6}
stmia r0!, {r3, r4, r5, r6}
subs r2, #16
bcs 1b
That takes only 0.81 cycles per byte (13 cycles / 16 bytes)
It is so efficient due to the use of "ldmia"-"stmia" that load and store 4 words in one shot taking only 1 cycle per word in each instruction (+1 for the instruction itself).
In contrast, the original code uses "ldrb"-"strb" that load and store 1 byte at a time, taking 2 cycles for each instruction.
All this means that misaligned data copy was about 8.6 times slower than aligned data copy.
For programs handling graphics this can be a real problem, because it's one of those applications that moves big chunks of data that can be misaligned (moving graphics in 1px step).
Is this important for any ARMv6-M ?
ARMv6-M cores (M0, M0+ & M1) normally have very low amounts of RAM:
92% of them have <=32Kb of RAM.
A few exceptions in the top range:
Brand | Device | RAM | FLASH | PRICE (Eur.) |
---|---|---|---|---|
ST | STM32G0B1xB | 144Kb | 512Kb | 2.5 |
Renesas | R01DS0363EJ0110 | 144Kb | 512Kb | 16.5 |
With 0 wait states in flash @32Mhz
So, applications are not expected to rely on big amounts of RAM either for data or program execution (FLASH is usually fast enough for most use cases).
But RP2040 is a little bit different:
- Dual ARM Cortex-M0+.
- 264Kb RAM (0 onchip Flash)
- SPI flash (Max 16Mb), very slow (compared to onchip Flash present in other uC).
- 16Kb cache memory to accelerate the program execution from SPI flash.
- All this for just 0.7 + 0.5 Eur. for the external SPI.
So, for most devices, ARMv6-M performance for unaligned data was not a big deal, because of the small RAM size and target applications.
But with RP2040 a 320x240x8bits display can be double buffered with approximately 154Kb, still leaving 110 Kb for code and application data.
That coupled with PIO and a powerfull DMA convinced me that it is well targeted for some video/graphics applications.
(More on the video subject on another article).
So I thought an improved memcpy version could be useful.
Starting point
Important
Be ready to see some insane levels of optimization, optimizing can be like a fun puzzle, but some may say that I've gone too far!
I invite you to follow me down the rabbit hole!
I thought that this must have been solved before in other similar ARM cores, and indeed I was right, I found code for armv7m in newlib "memcpy-armv7m.S" file, available at: memcpy-armv7m.S @ eblot, memcpy-armv7m.S @ bminor and memcpy-armv7m.S @ sourceware.org.
The usefull code was after the ".Lmisaligned_copy" label, but I had to modify all the armv7m instructions not compatible with armv6m, and optimize the resulting code.
And after changing:
- RSB to NEG.
- ITT sections with branches.
- Load and stores with post indexing increment with load/store multiple with post increment.
- Some register usage.
- And more things I can't remember.
I had a working code much better than the RP2040 ROM, even using less cycles than the code from the armv7m (I saved 1 instruction in the copy loop).
The code basicaly does the copy using word load-store (ldr, ldmia - str, trmia) and fixing the misalignement with shifts in between load & store.
This was the original code from "memcpy-armv7m.S" code:
/* Pre-load on word */
ldr r4, [r1], #4
...
1:
lsrs r4, r4, \shift
ldr r3, [r1], #4
lsls r5, r3, 32 - \shift
orr r4, r4, r5
str r4, [r0], #4
mov r4, r3
subs r2, #4
bhs 1b
And this is the new code for what I call the "miwo1" loop:
/* Pre-load first word */
ldmia r1!, {r4}
...
1:
lsrs r3, r4, \shift // Prepare data to be combined with next word.
ldmia r1!, {r4} // Read new word.
lsls r5, r4, 32 - \shift // Shift new data.
orrs r3, r3, r5 // Combine it with previous remaining data.
stmia r0!, {r3} // Store word.
subs r2, #4 // 4 bytes each loop.
bhs 1b
The optimization I did was removing the "wasteful" "mov" inside the loop by using the first "lsrs" to do its job, getting the shift and move ("r4" to "r3") with one instruction.
To accomplish that, the roles of "r3" and "r4" have been swapped after the "lsrs", so at the end of the loop we have the value from the last "ldmia" ready in "r4" to be shifted in next iteration.
This code takes 10/4 cycles per byte = 2.5 cycles/byte.
Almost 3x the speed vs the 7 cycles/byte of the ROM "__memcpy_slow_lp" (And +10% over "memcpy-armv7m.S").
Doing 8 bytes per loop iteration
But I thought, can we go further?, what if I unroll the loop a little bit to try to double the speed?, and after some trial and error to reduce register usage, I got the following code (using only 1 extra register).
Code for "miwo2" loop:
/* Pre-load first word */
ldmia r1!, {r4}
...
1:
lsrs r5, r4, \shift // Prepare data to be combined with next word.
ldmia r1!, {r3, r4} // Get 2 new words.
lsls r6, r3, 32 - \shift // Shift new data in 1st word.
orrs r5, r5, r6 // Combine with previous remaining data.
lsrs r3, \shift // Prepare data to be combined with the 2nd word.
lsls r6, r4, 32 - \shift // Shift new data in 2nd word.
orrs r6, r3, r6 // Combine with remaining data from 1st word.
stmia r0!, {r5, r6} // Store the 2 words.
subs r2, #8 // 8 bytes each loop.
bhs 1b
And this resulted in 15/8 cycles per byte = 1.88 cycles/byte.
Of course, not doubled, but a good +33% increase in speed (and a total 3.7x vs "__memcpy_slow_lp").
Note
In both cases ("miwo1" and "miwo2") there is extra code before and after the fast copy loop to handle the remaining bytes as efficiently as possible.
Filling the gaps
The original ROM code uses a very fast code for sizes < 8:
__memcpy:
mov ip, r0
cmp r2, #8
blo _memcpy_short
_memcpy_short:
adr r3, _memcpy_short_end
lsl r2, #2
sub r3, r2
add r3, #1
bx r3
.align 2
ldrb r3, [r1, #6]
strb r3, [r0, #6]
ldrb r3, [r1, #5]
strb r3, [r0, #5]
ldrb r3, [r1, #4]
strb r3, [r0, #4]
ldrb r3, [r1, #3]
strb r3, [r0, #3]
ldrb r3, [r1, #2]
strb r3, [r0, #2]
ldrb r3, [r1, #1]
strb r3, [r0, #1]
ldrb r3, [r1, #0]
strb r3, [r0, #0]
_memcpy_short_end:
For the rest it calls the original aligned code or the misaligned code.
But my new misaligned code only works for sizes >= 8 ("miwo1") or >= 12 ("miwo2"), so an intermediate version was needed to handle the misaligned cases in the size range [8..11].
In the original code from "memcpy-armv7m.S" this was handled by this code:
.Lbyte_copy:
subs r2, #4
blo .Lcopy_less_than_4
.Lbyte_copy_loop:
subs r2, #1
ldrb r3, [r1], #1
strb r3, [r0], #1
bhs .Lbyte_copy_loop
ldrb r3, [r1]
strb r3, [r0]
ldrb r3, [r1, #1]
strb r3, [r0, #1]
ldrb r3, [r1, #2]
strb r3, [r0, #2]
.Lcopy_less_than_4:
adds r2, #4
beq .Ldone
lsls r2, r2, #31
itt ne
ldrbne r3, [r1], #1
strbne r3, [r0], #1
bcc .Ldone
ldrb r3, [r1]
strb r3, [r0]
ldrb r3, [r1, #1]
strb r3, [r0, #1]
.Ldone:
That would be in between "__memcpy_slow_lp" and "_memcpy_short" in terms of speed.
An obvious way to handle this medium sizes was to extend the already present "_memcpy_short" to more than 7 bytes:
_memcpy_short:
adr r3, _memcpy_short_end
lsl r2, #2
sub r3, r2
add r3, #1
bx r3
.align 2
.irp offset,15,14,13,12,11,10,9,8,7
ldrb r3, [r1, #\offset]
strb r3, [r0, #\offset]
.endr
.irp offset,6,5,4,3,2,1,0
ldrb r3, [r1, #\offset]
strb r3, [r0, #\offset]
.endr
_memcpy_short_end:
This longer version can handle up to 16 bytes (and adds "just" 18 instructions, less than the code in "memcpy-armv7m.S").
(It seems very short, but uses "irp" to let the assembler generate the repeated instructions).
I added the extra instructions in a separated "irp" to be able to add conditional compilation to that block later.
Compared to the code in "memcpy-armv7m.S", these are the cycles taken by the copy for each size after the label indicated:
Copy size | "memcpy-armv7m.S", label: ".Lbyte_copy" | New code, label: "_memcpy_short" |
---|---|---|
8 | 2 + 5 x 7 - 1 + 3 x 4 = 48 | 6 + 8 * 4 = 38 |
9 | 2 + 6 x 7 - 1 + 3 x 4 = 55 | 6 + 9 * 4 = 42 |
10 | 2 + 7 x 7 - 1 + 3 x 4 = 62 | 6 + 10 * 4 = 46 |
16 | 2 + 13 x 7 - 1 + 3 x 4 = 104 | 6 + 16 * 4 = 70 |
The original idea was to use it only for the [8..11] range for misaligned data, but I knew that the extra code required before and after the "miwo" loops make that code more inefficient for these medium sizes.
After some real tests I observed that "miwo" and "_memcpy_short" code paths had a comparable execution speed at 16-18 bytes.
I decided that 16 was a good choice for a smooth transition:
.Lmemcpy_not_aligned:
cmp r2, #16 //14-16 - Threshold for using misaligned copy, see Lmisaligned_copy comments
bhi .Lmisaligned_copy
// Fall through to _memcpy_short
Another alternative for medium sizes
But what about an 8-byte unrolled loop version?, this could be shorter than the new "_memcpy_short" while having a similar speed.
Code for "mssp_0":
/* ---------------------------------------------------------------
Low speed, but quite compact version - upwards copy.
Any size can be copied unsing an 8 byte unrolled loop.
---------------------------------------------------------------*/
.Lmemcpy_med_size:
// Get the offset up to the 8 byte boundary
movs r3, #7
ands r3, r2
beq .Lmemcpy_med_size_loop_start
// Calculate offset for loop entry point
negs r3, r3
adds r3, #8 // r3 = 8 - unaligned bytes
// Offset pointers by the same amount negated, to start at the right address
subs r0, r3
subs r1, r3
// Compute a jump setting PC = PC + 4 * entry point
lsls r3, #2
add pc, r3 // ADD PC and MOV PC branch within Thumb state without interworking
// This part is skipped in first iteration, because entry point is always after these instructions
.Lmemcpy_med_size_loop:
// Move src pointer to next 8 byte block
adds r1, #8 // Using an instruction here we avoid the need to add a fixed negative offset before the "add pc"
// Move the data
.Lmemcpy_med_size_loop_start:
.irp offset,0,1,2,3,4,5,6,7
ldrb r3, [r1, #\offset]
strb r3, [r0, #\offset]
.endr
adds r0, #8
subs r2, #8
bhi .Lmemcpy_med_size_loop
This is 8 instructions shorter (29 vs 37) and can handle any copy size.
Of course, some speed is lost: 37/8 = 4.63 cycles/byte vs the "_memcpy_short" 4 cycles/byte (not taking into account the instructions before the loop, 4 more in this new version).
Micro optimization for "_memcpy_short"
Having used "add pc, reg" instruction to do the computed jump in the code above, I realized that a similar strategy could be used to optimize one instruction in the original "_memcpy_short":
_memcpy_short:
adr r3, _memcpy_short_end
lsl r2, #2
sub r3, r2
add r3, #1
bx r3
.align 2
.irp offset,6,5,4,3,2,1,0
ldrb r3, [r1, #\offset]
strb r3, [r0, #\offset]
.endr
_memcpy_short_end:
New "_memcpy_short":
.Lmemcpy_short:
// Compute a jump setting PC = PC + FIXED_OFFSET - 4 * entry point
// Use "add pc", instead of the classic "adr" followed by "bx rx" or "mov pc":
// This is shorter than the "bx rx" version and avoids the align requirement.
// As a bonus, all the math can be done with only 1 register.
lsls r2, #2
subs r2, #1 + (.Lmemcpy_short_end - .Lmemcpy_short_jump_start - 4) // +1 should not be needed, but it doesn't hurt to use it
negs r2, r2
// From ARMv6-M reference manual:
// - ADD PC and MOV PC branch within Thumb state without interworking.
.Lmemcpy_short_jump_start:
add pc, r2
.irp offset,6,5,4,3,2,1,0
ldrb r3, [r1, #\offset]
strb r3, [r0, #\offset]
.endr
.Lmemcpy_short_end:
Testing with the hardware confirmed the saved cycle.
Optimizing misaligned copy when reading from XIP FLASH of RP2040
After more testing, I realized how slow read from XIP FLASH can be, specially if the copy is done without cache and one byte at a time.
Sometimes, disabling cache for data from flash can be a good decision, because cache memory is a scarce resource and sometimes performance can be improved by handling data caching in our application, so more cache is available for instructions.
When cache is disabled, RP2040 seems to do a SPI read operation for each "ld" instruction, no matter if it is a word, halfword or byte load.
With cache disabled, each read operation takes approximately 50 cycles, compared to only 2 for a word read from RAM.
It is clear that read operations must be minimized, so aligning reads at source, using word loads and shifting seems the best option.
This strategy is implemented in the "memcpy_copy_src_aligned" macro ("src_aligned" used from now on for brevity):
// Source is aligned, read using ldmia, store with strb.
src_align_done:
subs r2, #4
copy_word_loop:
ldmia r1!, {r3}
strb r3, [r0]
.irp offset,1,2,3
lsrs r3, #8
strb r3, [r0, #\offset]
.endr
adds r0, #4 // Next word address, r1 has already been incremented by ldmia.
subs r2, #4 // Sub size.
bhs copy_word_loop
Cycles per byte comparison:
Code | XIP FLASH uncached | RAM |
---|---|---|
__memcpy_slow_lp | 57 | 7 |
_memcpy_short | 54 | 4 |
mssp_0 | 54.63 | 4.63 |
src_aligned | 16.75 | 4.25 |
It achieves a decent speed for XIP FLASH, and its performance is between "_memcpy_short" and "mssp_0" for RAM.
So, it's a good all-rounder for aligned or unaligned data.
Before the loop there is code to copy 1-3 bytes before reaching alignment in the most efficient way, minimizing the number of reads.
Same thing after the loop, handling the 1-3 bytes that may be left.
All in all, the pre-post code required is much faster than the pre-post code for "miwo" loops. So, it is better suited for medium sizes because of the reduced overhead.
Code path selection for RAM - XIP FLASH
When using the new "src_aligned" code for XIP FLASH, the source address must be checked to select the best code for the source address.
Additionally, 2 different thresholds must be used for changing between the middle size copy code and the "miwo" code.
The optimal threshold for RAM is around 16, for XIP FLASH it is around 60 in most cases, it is calculated at compile time depending on the options used (equations based on experimental tests):
#if MEMCPY_ARMV6M_MED_SIZE_SPEED <= 1 && MEMCPY_ARMV6M_OPTIMIZE_SIZE > 0
#define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD (64 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1) * 9 - (MEMCPY_ARMV6M_OPTIMIZE_SIZE) * 6)
#else
#if MEMCPY_ARMV6M_OPTIMIZE_SIZE >= 2
#define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD 20 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1)
#else
#define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD (73 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1) * 13 - (MEMCPY_ARMV6M_OPTIMIZE_SIZE) * 3 - (MEMCPY_ARMV6M_MED_SIZE_UPWARDS) * 15)
#endif
#endif
The following code is added when needed, to use the "src_aligned" copy loop when the copy is from uncached XIP FLASH.
It also uses a different size threshold to change between the medium size and big size loops ("miwo", "Lmisaligned_copy" label here):
.Lmemcpy_not_aligned:
#if MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ && (MEMCPY_ARMV6M_MED_SIZE_SPEED >= 1) \
&& \
(MEMCPY_ARMV6M_MED_SIZE_ONLY_FOR_XIP_MEMORY || (MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD > 16))
// Extra checks for XIP memory
lsrs r3, r1, #29
bne 1f // Address >= than 0x20000000
bcc 1f // Address < than 0x10000000
// Address is in XIP memory range: 0x1xxxxxxx
// Check if it is cached, if it is, it is better to use the code path used for RAM-RAM copy
lsls r3, r1, #6
bcs 1f // Check bit 26: XIP_SRAM 0x15XXXXXX, XIP_CTRL_BASE 0x14XXXXXX
lsrs r3, r3, #30
beq 1f // Check bits 25-24: 0x11XXXXXX-0x13XXXXXX -> no cache, 0x10XXXXXX -> cache
// A different threshold is used for XIP memory,
cmp r2, MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD
bls .Lmemcpy_med_size
b .Lmisaligned_copy
1:
#endif
cmp r2, #16 //14-16 - Threshold for using misaligned copy, see Lmisaligned_copy comments
bhi .Lmisaligned_copy
// Fallthrough to "Lmemcpy_med_size" or "Lmemcpy_short"
All these checks (when enabled) add a small penalty in cycles for the misaligned copy case:
RAM | ROM | FLASH - WITH CACHE | FLASH - NO CACHE | |
---|---|---|---|---|
Extra Cycles: | 3 | 4 | 8 | 8 - 9 |
The highest count is for FLASH, but XIP FLASH is so slow, that these quantities are negligible.
Far more important is the fact that, with these checks enabled, the throughput of the code used for uncached XIP FLASH can be up to 2.5 times faster, saving lots of cycles:
Size: | [8..16] | [17..60] |
---|---|---|
Cycles saved: | [225..560] | [200..0] |
Copy loops comparison
The following graph shows the maximum theoretical loop throughput (@125Mhz) of all the copy loops analyzed so far.
- RAM: Green axis and bars.
- XIP FLASH: Red axis and red/yellow bars.
Note
Throughput shown in this graph does not take into account any of the needed pre-post code, only the loop cycles.
"memcpy_short" has no loop, so it uses the cycles for each byte.
Optimization of aligned code
There was very little room for improvement in the original code for aligned copy.
But I found one way of saving 8 cycles when memory was word aligned and 1 or 2 cycles otherwise.
This is the original code to copy the non word aligned bytes before the fast word aligned copy loop:
sub r3, r0, r1
lsl r3, #30
bne __memcpy_slow_lp
// r0 and r1 are co-aligned
push {r4-r6}
sub r1, r0
mov r5, r0
lsr r3, r0, #1
bcc 1f
// byte at odd address
ldrb r4, [r0, r1]
strb r4, [r0]
add r0, #1
1:
lsr r3, r0, #2
bcc 1f
// halfword on non word boundary
ldrh r4, [r0, r1]
strh r4, [r0]
add r0, #2
1:
// adjust length
add r1, r0
sub r5, r0
add r2, r5
.Lmemcpy_word_aligned:
This code uses several checks and jumps to copy the required bytes or halfwords before starting the word aligned copy loop.
Following the code for the word aligned case, it can be seen that after the push instruction, 11 cycles are required to reach the ".Lmemcpy_word_aligned" label,
including 2 jumps and several unneeded operations.
Taking advantage of the fact that "r3" is already 0 after "bne __memcpy_slow_lp", it can be used as an offset for "ldrh" / "strh" instructions, avoiding math with "r1".
All the math required for checks can be done with "r4" with a couple of shifts and jumps.
With these changes only 1 jump is taken in all cases (after the push instruction), and the math for incrementing pointers and counter is skipped when not needed.
This is the improved code:
subs r3, r0, r1
lsls r3, #30
bne .Lmemcpy_not_aligned
// r0 and r1 are co-aligned
push {r4-r6}
// r3 must be 0 here
lsls r4, r0, #30 // Get the 2 lsbs of r0
beq .Lmemcpy_word_aligned // If all 0s, we are word aligned
lsls r4, #1 // Bit0 of r0 in r4, Bit1 goes to carry
beq 2f // Check the Bit0, eq means Bit0 = 0
// Instructions before the carry check must not affect carry
// Copy byte at odd address
ldrb r4, [r1]
strb r4, [r0]
movs r3, #1 // Add bytes done (used also as ldrh offset)
bcs .Lalign_done // Check Bit1, if it is 1 we are done, because 11 + 01 = 00
//nop // Optional, this avoids an extra nop at next ".align 2" (16 byte copy loop),
2:
// Copy halfword on non word boundary
ldrh r4, [r1, r3]
strh r4, [r0, r3]
adds r3, #2 // Add bytes done
.Lalign_done:
// Adjust pointers and length
add r0, r3
add r1, r3
subs r2, r3
.Lmemcpy_word_aligned:
With this modification, the code path for the word aligned case takes only 3 cycles instead of 11.
These are the cycles saved shown by real tests:
Alignment: | word | half-word | others |
---|---|---|---|
Cycles saved: | 8 | 2 | 1 |
But this is not the whole story, to achieve these results, instruction alignment must be tweaked, otherwise 1 cycle is lost.
This is because there is a ".align" directive before the copy loop (used to minimize instruction fetch from memory during the loop):
.Lmemcpy_word_aligned:
subs r2, #16
bcc 5f
.align 2 // Make sure the loop is word aligned, so, the 4 instructions are fetched only with 2 RAM reads (32bits each one).
..... Copy loop ...
The original code size up to this point was word aligned, so the ".align 2" directive did not add any padding.
But the new code has 1 instruction less, so, an extra "nop" will be added by the assembler.
To avoid wasting a cycle with this padding, I added a "nop" before the function label, so the function entry point is shifted by 1 half-word:
.align 2
.global memcpy_armv6m
.type memcpy_armv6m, %function
.thumb_func
nop // Optional, this shifts function entry point by one halfword,
// avoiding an extra nop at next ".align 2" (16 byte copy loop)
// If we want to keep the function aligned, instead of this nop, another nop can be used, see below
memcpy_armv6m:
mov ip, r0
cmp r2, #8
Note
This wastes 2 bytes of space, if anyone knows a better way of doing this without adding this "nop", I'm all ears!.
Note about endianness
Caution
The code assumes little-endianess, I don't know about any big-endian M0 or M0+.
Anyway, it should be easy to add support for it, as code for ARMv7-M shows, it's a matter of changing the shifts direction:
1:
#ifdef __ARM_BIG_ENDIAN
lsls r4, r4, \shift
#else
lsrs r4, r4, \shift
#endif
ldr r3, [r1], #4
#ifdef __ARM_BIG_ENDIAN
lsrs r5, r3, 32-\shift
#else
lsls r5, r3, 32-\shift
#endif
orr r4, r4, r5
str r4, [r0], #4
mov r4, r3
subs r2, #4
bhs 1b
At least these parts should be changed:
- "miwo" copy loops.
- "src_aligned" copy loops.