Fast ARMv6M mempy - Part 4 - Automated case generation
This is the 4th part of "Fast ARMv6M mempy", see other parts you may have missed:
- Part 1 - ASM code
- Part 2 - Options & Results
- Part 3 - Replace SDK memcpy
- Part 4 - Automated case generation
- Part 5 - Test function
- Part 6 - Benchmark function
Full code available at github (test project, only fast ARMv6M mempy).
Testing memcpy
To test and compare the results of all combinations of compile options, the following points are needed to do it in a reliable and almost fully automated way:
- Automated generation of files needed to test all cases.
- Create a test function to ensure that all memcpy cases work correctly.
- Create a benchmark function to evaluate the speed of each implementation.
- Create a good compare system.
To implement point 1 we need to:
- Generate all unique combinations of options (Test cases).
- Give each combination a unique function name.
- Create a c file with all function pointers and names to be able to iterate them in the "c" code.
- Create a header with all function names generated.
- Create makefile to add the generated ".S" and ".c" files to the target.
Test case generator
I used a modified version of a script used by SDCC compiler (generate-cases.py) to generate the test cases for automated testing.
Note
I worked on SDCC some time ago, I'm still a developer of SDCC, but it's been a long time since my last commit.
This script can generate a list of files using a template and several lists of values to replace.
It generates all combinations, one for each file, naming each file with the names and values used in each file.
Because it was used for SDCC, it also generates some extra code required for testing using SDCC's test suite:
void
__runSuite(void)
{
}
const int __numCases = 0;
__code const char *
__getSuiteName(void)
{
return "gen\\memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0";
}
I added an extra argument ("/ns") to avoid the generation of this code.
With my modifications, a command like this:
generate-cases.py memcpy_armv6m_test.S gen /ns
Using this template ("memcpy_armv6m_test.S"):
/* Template for memcpy ARMv6M implementation cases
// Test cases
miwo: 0, 1, 2
mssp: 0, 1, 2
opxip: 0, 1
opsz: 0, 1, 2
msup: 0, 1
// Grouped cases
=: miwo 0; opsz 0
=: miwo 0; opsz 1
=: miwo 0; opsz 2
// Skip cases
// msup only used when mssp is 2
-: mssp 0,1; msup 1
// opxip not used when mssp is 0
-: mssp 0; opxip 1
*/
#define MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS ({miwo})
#define MEMCPY_ARMV6M_MED_SIZE_SPEED ({mssp})
#define MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ ({opxip})
#define MEMCPY_ARMV6M_OPTIMIZE_SIZE ({opsz})
#define MEMCPY_ARMV6M_MED_SIZE_UPWARDS ({msup})
#define MEMCPY_ARMV6M_FUNCTION_NAME {testcaseFilename}
#define MEMCPY_ARMV6M_FUNCTION_END_SIGNATURE 1
#include "..\..\memcpy_armv6m.S"
Generates files like this ("memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0.S" in subfolder "gen"):
/* memcpy armv6 implementations
// Test cases
miwo: 0, 1, 2
mssp: 0, 1, 2
opxip: 0, 1
opsz: 0, 1, 2
msup: 0, 1
// Grouped cases
=: miwo 0; opsz 0
=: miwo 0; opsz 1
=: miwo 0; opsz 2
// Skip cases
// msup only used when mssp is 2
-: mssp 0,1; msup 1
// opxip not used when mssp is 0
-: mssp 0; opxip 1
*/
#define MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS (0)
#define MEMCPY_ARMV6M_MED_SIZE_SPEED (0)
#define MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ (0)
#define MEMCPY_ARMV6M_OPTIMIZE_SIZE (0)
#define MEMCPY_ARMV6M_MED_SIZE_UPWARDS (0)
#define MEMCPY_ARMV6M_FUNCTION_NAME memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0
#define MEMCPY_ARMV6M_FUNCTION_END_SIGNATURE 1
#include "..\..\memcpy_armv6m.S"
The names and values to replace are taken from the first comment block in the template:
miwo: 0, 1, 2
mssp: 0, 1, 2
opxip: 0, 1
opsz: 0, 1, 2
msup: 0, 1
The format is very simple, a name followed by ":" and a list of values separated by commas.
This means 3 miwo x 3 mssp x 2 opxip x 3 opsz x 2 msup = 108 combinations.
The original script from SDCC blindly generates all combinations, but this is far from optimal in this scenario.
To eliminate the known duplicated cases, I modified the script to handle "special cases".
I created a way to specify 2 kind of cases:
- Grouped cases: starting with "=:" and followed by the values to match.
- Skipped cases: starting with "-:" and followed by the values to match.
Grouped case example:
=: miwo 0; opsz 0
The first combination using "miwo 0" and "opsz 0" is generated, the rest matching these conditions are not generated.
Skipped case example:
-: mssp 0,1; msup 1
All cases with ("mssp 0" or "mssp 1") & "msup 1" are not generated.
Template details
The files generated with the template just define some macros and include the original "memcpy_armv6m.S".
So the exact same ASM code is used in each file, but the result depends on the macros defined.
Aside from the code generation options, two more macros are defined:
#define MEMCPY_ARMV6M_FUNCTION_NAME memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0
#define MEMCPY_ARMV6M_FUNCTION_END_SIGNATURE 1
#include "..\..\memcpy_armv6m.S"
The "MEMCPY_ARMV6M_FUNCTION_NAME" macro assigns a unique name for each implementation.
The "MEMCPY_ARMV6M_FUNCTION_END_SIGNATURE" macro is used to enable a signature at the end to automate the calculation of function length and a hash.
Having a hash for each file is a simple good way to check for duplicated cases.
Makefile and "c" code files generator
I created another python script to generate these files from the list of files generated by the first script.
And it is called like this:
generate-cases-includes.py memops_opt_test_imp gen/memcpy_armv6m_test*.S
The first argument is the pattern used to generate the required names / files, and the second is the list of files to include.
It generates 3 files, one ".h", one ".c" and one ".cmake.".
The generated "memops_opt_test_imp.h" has the declarations of all the ASM functions, the total count of implementations and the declaration of the arrays from the following "c" file, it looks like this:
#ifndef MEMOPS_OPT_TEST_IMP_H
#include <stddef.h>
#define MEMOPS_OPT_TEST_IMP_H
extern void * memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0(void *dst, const void *src, size_t length);
...
extern void * memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_1(void *dst, const void *src, size_t length);
#define MEMCPY_ARMV6M_TEST_IMP_COUNT 45
extern void * (* const MEMCPY_ARMV6M_TEST_IMP_FUNCTIONS[MEMCPY_ARMV6M_TEST_IMP_COUNT])(void *, const void *, size_t);
extern const char * const MEMCPY_ARMV6M_TEST_IMP_NAMES[MEMCPY_ARMV6M_TEST_IMP_COUNT];
#endif
The generated "memops_opt_test_imp.c" defines two arrays, one with the function pointers of all implementations and another with their names, it looks like this:
#include "memops_opt_test_imp.h"
void * (* const MEMCPY_ARMV6M_TEST_IMP_FUNCTIONS[MEMCPY_ARMV6M_TEST_IMP_COUNT])(void *, const void *, size_t) =
{
&memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0,
...
&memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_1,
};
const char * const MEMCPY_ARMV6M_TEST_IMP_NAMES[MEMCPY_ARMV6M_TEST_IMP_COUNT] =
{
"memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0",
...
"memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_1",
};
And finally, the generated "memops_opt_test_imp.cmake" adds all the sources needed for the target:
target_sources(memops_opt_test_imp INTERFACE
${CMAKE_CURRENT_LIST_DIR}/gen/memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0.S
...
${CMAKE_CURRENT_LIST_DIR}/gen/memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_1.S
${CMAKE_CURRENT_LIST_DIR}/memops_opt_test_imp.c
)
The whole process takes a fraction of a second using a ".bat" file to execute all commands when I need to regenerate the cases:
del .\gen\*.S /q
generate-cases.py memcpy_armv6m_test.S gen /ns
generate-cases-includes.py memops_opt_test_imp gen/memcpy_armv6m_test*.S
Using the variables and functions from this files, the main function can print a list of all available implementations:
printf("\nImplementation\tSize\tHash\n");
for (int i = 0; i < MEMCPY_ARMV6M_TEST_IMP_COUNT; i++)
{
uint8_t * fnBytes = (uint8_t*)((uint32_t)MEMCPY_ARMV6M_TEST_IMP_FUNCTIONS[i] & 0xFFFFFFFE);
size_t size = memcpy_get_implementation_size((void *)fnBytes);
uint32_t hash = crc32b(fnBytes, size);
printf("%s\t%zu\t0x%08" PRIX32 "\n", MEMCPY_ARMV6M_TEST_IMP_NAMES[i], size, hash);
}
This is a sample output, showing the name, the size and hash of each implementation:
Implementation Size Hash
memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_0_msup_0 246 0x05D504E9
memcpy_armv6m_test_miwo_0_mssp_0_opxip_0_opsz_1_msup_0 220 0x7AE99E15
...
memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_0 420 0x9CADEF7D
memcpy_armv6m_test_miwo_2_mssp_2_opxip_1_opsz_2_msup_1 466 0x79E9752F
The function "memcpy_get_implementation_size" gets the size searching for the signature located at the end of the "memcpy_armv6m.S" file:
#if MEMCPY_ARMV6M_FUNCTION_END_SIGNATURE // For automated tests, this pattern is searched to get the size of this function.
.word 0xFFFFFFFF
.word 0xFFFFFFFF
.word 0x0
.word 0x0
#endif
Once the size is known, the "crc32b" can calculate a hash for it.