Home » blog » 2024

Fast ARMv6M mempy - Part 2 - Options & Results

By Visenri

January 20, 2024

This is the 2nd part of "Fast ARMv6M mempy", see other parts you may have missed:

Full code available at github (test project, only fast ARMv6M mempy).

Configuration options

The code has been designed to allow customization, these are the options available (size differences are only rough approximations):

MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS (miwo)

It chooses the loop used for misaligned copy.

Value	loop used	Code size vs. reference	Description
0	src_aligned	0 (reference)	Slowest
1	miwo1	+144	Fast
2	miwo2	+209	Fastest

MEMCPY_ARMV6M_MED_SIZE_SPEED (mssp)

It chooses the copy loop used for misaligned medium size copy.

Irrelevant if miwo = 0, because it uses "src_aligned" loop for all misaligned sizes >= 8.

Value	loop used	Code size vs. reference	Description
0	mssp_0	0 (reference)	Slowest copy.
1	src_aligned	+45 / +66 (opxip = 1)	Best all round.
2	Lmemcpy_short	-25 / +100 (opxip = 1)	Fastest code for RAM, slow for uncached XIP FLASH if "opxip" = 0.

MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ (opxip)

Enables 2 extra optimizations only for misaligned XIP FLASH copy:

Uses the "src_aligned" code because it is faster with XIP FLASH.
Enables the use of a different threshold around 60 bytes for switching between medium ("src_aligned") and big size code ("miwo").

Irrelevant if any of these conditions is met:

miwo = 0, because it forces the use of "src_aligned" loop for all misaligned sizes >= 8.
mssp = 0, because that combination makes no sense: "mssp_0_opxip_1" would be bigger and slower (for RAM) than "mssp_1_opxip_1".

Value	loop used for XIP	Code size vs. reference	Description
0	same as RAM	0	Slow for small / medium size uncached XIP FLASH copy.
1	src_aligned	+20 (mssp = 1) / +128	Best speed even for unaligned XIP FLASH.

MEMCPY_ARMV6M_OPTIMIZE_SIZE (opsz)

Allows saving some code size sacrificing some speed.

Value	Code size vs. reference	Description
0	0 (reference)	Maximize speed.
1	[-25, -50] *	Less code size with a very small penalty in speed (recommended to save some size).
2	[-50, -85] *	Even less code size, with a noticeable speed decrease.

Note

* Exact code savings depends on other options used.

MEMCPY_ARMV6M_MED_SIZE_UPWARDS (msup)

Enables the use of an upwards copy loop (increasing pointer) for medium sizes when MEMCPY_ARMV6M_MED_SIZE_SPEED = 2.

I created this code only for checking whether uncached XIP flash memory reads are any faster when reading increasing addresses (instead of decreasing as in "Lmemcpy_short").
It turns out, there is no difference, so it is recommended to leave it at 0, because it's slower and bigger than "Lmemcpy_short".

Irrelevant if mssp != 2.

Value	loop used	Code size vs. reference	Description
0	Lmemcpy_short	0 (reference)	Fastest and smallest (recommended).
1	no name	+45 / -7 (opxip = 1)	Slow and almost equal or bigger size (not recommended, only for testing).

Limiting valid configurations

Some combinations do not make sense or result in the same code output.
These "invalid" combinations can be used, but they are not unique, there are several combinations that lead to the same code.
For documentation and testing, it is better to know and test only the unique combinations.

In the test code, some cases are grouped:

// Grouped cases
=: miwo 0; opsz 0
=: miwo 0; opsz 1
=: miwo 0; opsz 2

This means that all combinations using a set of options matching one of these lines can be reduced to one case (more details about how the test system works later).
So, for "miwo_0" there are only 3 unique combinations.

Other cases are flagged to be skipped:

// Skip cases
// msup only used when mssp is 2
-: mssp 0,1; msup 1

Skip all cases where "msup" is 1 and "mssp" is not 2.

// opxip not used when mssp is 0
-: mssp 0; opxip 1

Skip cases where opxip is not used.

With all these exceptions, the number of unique cases is reduced to 45 (from the 108 yielded by all possible combinations of options).
Eliminating also the cases used only for testing, like "msup" = 1, we get 34 unique cases.

This is the list of all those cases and their size (including original ROM code for reference):

Case	Code size
Original ROM code	154

Case (miwo_0)	Code size
miwo_0_mssp_0_opxip_0_opsz_0_msup_0	246
miwo_0_mssp_0_opxip_0_opsz_1_msup_0	220
miwo_0_mssp_0_opxip_0_opsz_2_msup_0	190

Case (miwo_1)	Code size
miwo_1_mssp_0_opxip_0_opsz_0_msup_0	338
miwo_1_mssp_0_opxip_0_opsz_1_msup_0	322
miwo_1_mssp_0_opxip_0_opsz_2_msup_0	316
miwo_1_mssp_1_opxip_0_opsz_0_msup_0	384
miwo_1_mssp_1_opxip_0_opsz_1_msup_0	342
miwo_1_mssp_1_opxip_0_opsz_2_msup_0	306
miwo_1_mssp_1_opxip_1_opsz_0_msup_0	404
miwo_1_mssp_1_opxip_1_opsz_1_msup_0	362
miwo_1_mssp_1_opxip_1_opsz_2_msup_0	326
miwo_1_mssp_2_opxip_0_opsz_0_msup_0	312
miwo_1_mssp_2_opxip_0_opsz_1_msup_0	296
miwo_1_mssp_2_opxip_0_opsz_2_msup_0	290
miwo_1_mssp_2_opxip_1_opsz_0_msup_0	440
miwo_1_mssp_2_opxip_1_opsz_1_msup_0	384
miwo_1_mssp_2_opxip_1_opsz_2_msup_0	358

Case (miwo_2)	Code size
miwo_2_mssp_0_opxip_0_opsz_0_msup_0	404
miwo_2_mssp_0_opxip_0_opsz_1_msup_0	384
miwo_2_mssp_0_opxip_0_opsz_2_msup_0	378
miwo_2_mssp_1_opxip_0_opsz_0_msup_0	450
miwo_2_mssp_1_opxip_0_opsz_1_msup_0	404
miwo_2_mssp_1_opxip_0_opsz_2_msup_0	368
miwo_2_mssp_1_opxip_1_opsz_0_msup_0	470
miwo_2_mssp_1_opxip_1_opsz_1_msup_0	424
miwo_2_mssp_1_opxip_1_opsz_2_msup_0	388
miwo_2_mssp_2_opxip_0_opsz_0_msup_0	378
miwo_2_mssp_2_opxip_0_opsz_1_msup_0	358
miwo_2_mssp_2_opxip_0_opsz_2_msup_0	352
miwo_2_mssp_2_opxip_1_opsz_0_msup_0	506
miwo_2_mssp_2_opxip_1_opsz_1_msup_0	446
miwo_2_mssp_2_opxip_1_opsz_2_msup_0	420

Recommended options (an easy decision table, showing misaligned copy speed and size):

Case	RAM Speed	Medium size XIP FLASH Speed	Size	Comments
miwo_0_mssp_0_opxip_0_opsz_2_msup_0	+	++	190	Smallest.
miwo_1_mssp_1_opxip_1_opsz_0_msup_0	++	++	404	Fast.
miwo_2_mssp_2_opxip_0_opsz_0_msup_0	+++	+	378	Fastest, slow with uncached flash.
miwo_2_mssp_2_opxip_1_opsz_0_msup_0	+++	++	506	Fastest (default).

Results

The following graphs show the copy throughput using real hardware tests.

Methodology:

The test for each size is repeated 100 times.
For each test, memcpy is called 10 times in case of a RAM test and only once for a XIP FLASH test.
For aligned copy, the 4 possible different offsets are tested (0, 1, 2, 3).
For misaligned data, all offset combinations are tested (src-dst: 0-1, 0-2, 0-3, 1-0, 1-2, 1-3, 2-0, 2-1, 2-3, 3-0, 3-1, 3-2).

Data processing:

Each graph curve shows the average of all offset combinations tested.

Colors and styles:

Solid lines: new code.
Dashed lines: original ROM code.
Green lines: aligned copy.
Red lines: misaligned copy.
Highlighted: range where medium size copy is used.

The horizontal limit line is the theoretical thoughput for the best aligned copy loop.
Each graph shows the effect of one or two options over the copy throughput.

RAM copy throughput vs "miwo" option (X axis is logarithmic):

Options used: miwo_X_mssp_1_opxip_0_opsz_0_msup_0

Aligned copy	Misaligned copy
Small improvement thanks to the 8 cycles saved.	- Speed is slightly faster than original code starting at 8 bytes. - The difference in speed increases with each "miwo" option after the 16 bytes size. - Miwo0: 1.5x faster around 50 bytes and 1.6x faster around 500 bytes. - Miwo1: 2x faster around 50 bytes and 2.7x faster around 500 bytes. - Miwo2: 2.3x faster around 50 bytes and 3.5x faster around 500 bytes.

The speed achieved at 512 bytes is near to the maximum possible.

Uncached XIP FLASH copy throughput vs "miwo" option (X axis is logarithmic):

Options used: miwo_X_mssp_1_opxip_1_opsz_0_msup_0

Aligned copy	Misaligned copy
Almost same speed, 8 cycles difference is very small @ XIP speed.	- Speed is 2 to 3 times faster than original code starting at 8 bytes. - The difference in speed increases with each "miwo" option after the 60 bytes size. - Miwo0: 3.2x faster around 500 bytes. - Miwo1: 3.8x faster around 500 bytes - Miwo2: 4x faster around 500 bytes.

The speed achieved at 512 bytes is almost the maximum possible.

RAM copy throughput vs "mssp" & "opxip" options:

Options used: miwo_2_mssp_X_opxip_Y_opsz_0_msup_0

Aligned copy	Misaligned copy
At this zoom level, the effect of the 8 cycles saved can be clearly seen.	- Speed between 8 and 16 bytes increases with each "mssp" option. - There is a small penalty due to the extra checks needed when opxip = 1.

Uncached XIP FLASH: copy throughput vs "mssp" & "opxip" options:

Options used: miwo_2_mssp_X_opxip_Y_opsz_0_msup_0

Aligned copy	Misaligned copy
Almost same speed, 8 cycles difference is very small @ XIP speed.	- Speed between 8 and 16 bytes is only a little bit over original ROM code for "mssp0" and "mss2_opxip0". - Mssp1_opxip0 achieves good performance in this range because it uses the "src_aligned" loop. - In the 16 to 60 bytes range the best results are obtained when using opxip = 1.

Fast ARMv6M mempy - Part 1 - ASM code

Fast ARMv6M mempy - Part 3 - Replace SDK memcpy

Comments (0) :

No comments yet, be the first !!

Comment:

Your email address will not be published. Required fields marked with *.

Name *:

E-mail *:

Web:

Anti-spam question:

Write only letters: 1.c.q.m.3.h.7.0.d.9.1.y

Anti-spam answer *: