Home » blog » 2024

Fast ARMv6M mempy - Part 1 - ASM code

By Visenri

January 09, 2024

This is the 1st part of "Fast ARMv6M mempy":

Full code available at github (test project, only fast ARMv6M mempy).

Introduction

While working with a video application with a Raspberry Pi Pico (RP2040), I observed that memcpy performance was randomly dropping.
I found out that the slowdown happened when data was misaligned (data not aligned to 32 bit boundaries).

Digging in the ROM code I located the code for misaligned copy, "__memcpy_slow_lp" in bootrom_misc.S:

__memcpy_slow_lp:
    sub r2, #1
    ldrb r3, [r1, r2]
    strb r3, [r0, r2]
    bne __memcpy_slow_lp

It uses 7 cycles per byte.

Very slow compared to the copy loop used for aligned memory in the same ROM code:

    // 16 byte loop
1:
    ldmia r1!, {r3, r4, r5, r6}
    stmia r0!, {r3, r4, r5, r6}
    subs  r2, #16
    bcs   1b

That takes only 0.81 cycles per byte (13 cycles / 16 bytes)
It is so efficient due to the use of "ldmia"-"stmia" that load and store 4 words in one shot taking only 1 cycle per word in each instruction (+1 for the instruction itself).

In contrast, the original code uses "ldrb"-"strb" that load and store 1 byte at a time, taking 2 cycles for each instruction.

All this means that misaligned data copy was about 8.6 times slower than aligned data copy.

For programs handling graphics this can be a real problem, because it's one of those applications that moves big chunks of data that can be misaligned (moving graphics in 1px step).

Is this important for any ARMv6-M ?

ARMv6-M cores (M0, M0+ & M1) normally have very low amounts of RAM:
92% of them have <=32Kb of RAM.

A few exceptions in the top range:

Brand	Device	RAM	FLASH	PRICE (Eur.)
ST	STM32G0B1xB	144Kb	512Kb	2.5
Renesas	R01DS0363EJ0110	144Kb	512Kb	16.5

With 0 wait states in flash @32Mhz

So, applications are not expected to rely on big amounts of RAM either for data or program execution (FLASH is usually fast enough for most use cases).

But RP2040 is a little bit different:

Dual ARM Cortex-M0+.
264Kb RAM (0 onchip Flash)
SPI flash (Max 16Mb), very slow (compared to onchip Flash present in other uC).
16Kb cache memory to accelerate the program execution from SPI flash.
All this for just 0.7 + 0.5 Eur. for the external SPI.

So, for most devices, ARMv6-M performance for unaligned data was not a big deal, because of the small RAM size and target applications.
But with RP2040 a 320x240x8bits display can be double buffered with approximately 154Kb, still leaving 110 Kb for code and application data.
That coupled with PIO and a powerfull DMA convinced me that it is well targeted for some video/graphics applications. (More on the video subject on another article).

So I thought an improved memcpy version could be useful.

Starting point

Important

Be ready to see some insane levels of optimization, optimizing can be like a fun puzzle, but some may say that I've gone too far!
I invite you to follow me down the rabbit hole!

I thought that this must have been solved before in other similar ARM cores, and indeed I was right, I found code for armv7m in newlib "memcpy-armv7m.S" file, available at: memcpy-armv7m.S @ eblot, memcpy-armv7m.S @ bminor and memcpy-armv7m.S @ sourceware.org.

The usefull code was after the ".Lmisaligned_copy" label, but I had to modify all the armv7m instructions not compatible with armv6m, and optimize the resulting code.

And after changing:

RSB to NEG.
ITT sections with branches.
Load and stores with post indexing increment with load/store multiple with post increment.
Some register usage.
And more things I can't remember.

I had a working code much better than the RP2040 ROM, even using less cycles than the code from the armv7m (I saved 1 instruction in the copy loop).

The code basicaly does the copy using word load-store (ldr, ldmia - str, trmia) and fixing the misalignement with shifts in between load & store.

This was the original code from "memcpy-armv7m.S" code:

    /* Pre-load on word */
    ldr     r4, [r1], #4
    ...
1:
 lsrs  r4, r4, \shift
 ldr   r3, [r1], #4
 lsls  r5, r3, 32 - \shift
 orr   r4, r4, r5
 str   r4, [r0], #4
 mov   r4, r3
 subs  r2, #4
 bhs   1b

And this is the new code for what I call the "miwo1" loop:

    /* Pre-load first word */
    ldmia r1!, {r4}
    ...
1:
    lsrs  r3, r4, \shift        // Prepare data to be combined with next word.
    ldmia r1!, {r4}             // Read new word.
    lsls  r5, r4, 32 - \shift   // Shift new data.
    orrs  r3, r3, r5            // Combine it with previous remaining data.
    stmia r0!, {r3}             // Store word.
    subs  r2, #4                // 4 bytes each loop.
    bhs   1b

The optimization I did was removing the "wasteful" "mov" inside the loop by using the first "lsrs" to do its job, getting the shift and move ("r4" to "r3") with one instruction.

To accomplish that, the roles of "r3" and "r4" have been swapped after the "lsrs", so at the end of the loop we have the value from the last "ldmia" ready in "r4" to be shifted in next iteration.

This code takes 10/4 cycles per byte = 2.5 cycles/byte.
Almost 3x the speed vs the 7 cycles/byte of the ROM "__memcpy_slow_lp" (And +10% over "memcpy-armv7m.S").

Doing 8 bytes per loop iteration

But I thought, can we go further?, what if I unroll the loop a little bit to try to double the speed?, and after some trial and error to reduce register usage, I got the following code (using only 1 extra register).

Code for "miwo2" loop:

    /* Pre-load first word */
    ldmia r1!, {r4}
    ...
1:
    lsrs  r5, r4, \shift        // Prepare data to be combined with next word.
    ldmia r1!, {r3, r4}         // Get 2 new words.
    lsls  r6, r3, 32 - \shift   // Shift new data in 1st word.
    orrs  r5, r5, r6            // Combine with previous remaining data.

    lsrs  r3, \shift            // Prepare data to be combined with the 2nd word.

    lsls  r6, r4, 32 - \shift   // Shift new data in 2nd word.
    orrs  r6, r3, r6            // Combine with remaining data from 1st word.

    stmia r0!, {r5, r6}         // Store the 2 words.
    subs  r2, #8                // 8 bytes each loop.
    bhs   1b

And this resulted in 15/8 cycles per byte = 1.88 cycles/byte.
Of course, not doubled, but a good +33% increase in speed (and a total 3.7x vs "__memcpy_slow_lp").

Note

In both cases ("miwo1" and "miwo2") there is extra code before and after the fast copy loop to handle the remaining bytes as efficiently as possible.

Filling the gaps

The original ROM code uses a very fast code for sizes < 8:

__memcpy:
    mov ip, r0
    cmp r2, #8
    blo _memcpy_short

_memcpy_short:
    adr r3, _memcpy_short_end
    lsl r2, #2
    sub r3, r2
    add r3, #1
    bx r3

.align 2
    ldrb r3, [r1, #6]
    strb r3, [r0, #6]
    ldrb r3, [r1, #5]
    strb r3, [r0, #5]
    ldrb r3, [r1, #4]
    strb r3, [r0, #4]
    ldrb r3, [r1, #3]
    strb r3, [r0, #3]
    ldrb r3, [r1, #2]
    strb r3, [r0, #2]
    ldrb r3, [r1, #1]
    strb r3, [r0, #1]
    ldrb r3, [r1, #0]
    strb r3, [r0, #0]
_memcpy_short_end:

For the rest it calls the original aligned code or the misaligned code.

But my new misaligned code only works for sizes >= 8 ("miwo1") or >= 12 ("miwo2"), so an intermediate version was needed to handle the misaligned cases in the size range [8..11].

In the original code from "memcpy-armv7m.S" this was handled by this code:

.Lbyte_copy:
 subs r2, #4
 blo .Lcopy_less_than_4

.Lbyte_copy_loop:
 subs    r2, #1
 ldrb    r3, [r1], #1
 strb    r3, [r0], #1
 bhs .Lbyte_copy_loop

 ldrb r3, [r1]
 strb r3, [r0]
 ldrb r3, [r1, #1]
 strb r3, [r0, #1]
 ldrb r3, [r1, #2]
 strb r3, [r0, #2]

.Lcopy_less_than_4:
 adds r2, #4
 beq .Ldone

 lsls r2, r2, #31
 itt ne
 ldrbne  r3, [r1], #1
 strbne  r3, [r0], #1

 bcc .Ldone
 ldrb r3, [r1]
 strb r3, [r0]
 ldrb r3, [r1, #1]
 strb r3, [r0, #1]
.Ldone:

That would be in between "__memcpy_slow_lp" and "_memcpy_short" in terms of speed.

An obvious way to handle this medium sizes was to extend the already present "_memcpy_short" to more than 7 bytes:

_memcpy_short:
    adr r3, _memcpy_short_end
    lsl r2, #2
    sub r3, r2
    add r3, #1
    bx r3

.align 2
    .irp offset,15,14,13,12,11,10,9,8,7
    ldrb  r3, [r1, #\offset]
    strb  r3, [r0, #\offset]
    .endr

    .irp offset,6,5,4,3,2,1,0
    ldrb  r3, [r1, #\offset]
    strb  r3, [r0, #\offset]
    .endr
_memcpy_short_end:

This longer version can handle up to 16 bytes (and adds "just" 18 instructions, less than the code in "memcpy-armv7m.S").
(It seems very short, but uses "irp" to let the assembler generate the repeated instructions).
I added the extra instructions in a separated "irp" to be able to add conditional compilation to that block later.

Compared to the code in "memcpy-armv7m.S", these are the cycles taken by the copy for each size after the label indicated:

Copy size	"memcpy-armv7m.S", label: ".Lbyte_copy"	New code, label: "_memcpy_short"
8	2 + 5 x 7 - 1 + 3 x 4 = 48	6 + 8 * 4 = 38
9	2 + 6 x 7 - 1 + 3 x 4 = 55	6 + 9 * 4 = 42
10	2 + 7 x 7 - 1 + 3 x 4 = 62	6 + 10 * 4 = 46
16	2 + 13 x 7 - 1 + 3 x 4 = 104	6 + 16 * 4 = 70

The original idea was to use it only for the [8..11] range for misaligned data, but I knew that the extra code required before and after the "miwo" loops make that code more inefficient for these medium sizes.
After some real tests I observed that "miwo" and "_memcpy_short" code paths had a comparable execution speed at 16-18 bytes.

I decided that 16 was a good choice for a smooth transition:

.Lmemcpy_not_aligned:
    cmp   r2, #16 //14-16 - Threshold for using misaligned copy, see Lmisaligned_copy comments
    bhi   .Lmisaligned_copy

    // Fall through to _memcpy_short

Another alternative for medium sizes

But what about an 8-byte unrolled loop version?, this could be shorter than the new "_memcpy_short" while having a similar speed.

Code for "mssp_0":

/* ---------------------------------------------------------------
  Low speed, but quite compact version - upwards copy.
  Any size can be copied unsing an 8 byte unrolled loop.
  ---------------------------------------------------------------*/
.Lmemcpy_med_size:
    // Get the offset up to the 8 byte boundary
    movs  r3, #7
    ands  r3, r2
    beq   .Lmemcpy_med_size_loop_start

    // Calculate offset for loop entry point
    negs  r3, r3
    adds  r3, #8    // r3 = 8 - unaligned bytes

    // Offset pointers by the same amount negated, to start at the right address
    subs  r0, r3
    subs  r1, r3

    // Compute a jump setting PC = PC + 4 * entry point
    lsls  r3, #2
    add   pc, r3    // ADD PC and MOV PC branch within Thumb state without interworking

    // This part is skipped in first iteration, because entry point is always after these instructions
.Lmemcpy_med_size_loop:
    // Move src pointer to next 8 byte block
    adds  r1, #8    // Using an instruction here we avoid the need to add a fixed negative offset before the "add pc"
    // Move the data
.Lmemcpy_med_size_loop_start:
    .irp offset,0,1,2,3,4,5,6,7
    ldrb  r3, [r1, #\offset]
    strb  r3, [r0, #\offset]
    .endr

    adds  r0, #8
    subs  r2, #8
    bhi   .Lmemcpy_med_size_loop

This is 8 instructions shorter (29 vs 37) and can handle any copy size.
Of course, some speed is lost: 37/8 = 4.63 cycles/byte vs the "_memcpy_short" 4 cycles/byte (not taking into account the instructions before the loop, 4 more in this new version).

Micro optimization for "_memcpy_short"

Having used "add pc, reg" instruction to do the computed jump in the code above, I realized that a similar strategy could be used to optimize one instruction in the original "_memcpy_short":

_memcpy_short:
    adr r3, _memcpy_short_end
    lsl r2, #2
    sub r3, r2
    add r3, #1
    bx r3

.align 2
    .irp offset,6,5,4,3,2,1,0
    ldrb  r3, [r1, #\offset]
    strb  r3, [r0, #\offset]
    .endr
_memcpy_short_end:

New "_memcpy_short":

.Lmemcpy_short:
    // Compute a jump setting PC = PC + FIXED_OFFSET - 4 * entry point
    // Use "add pc", instead of the classic "adr" followed by "bx rx" or "mov pc":
    // This is shorter than the "bx rx" version and avoids the align requirement.
    // As a bonus, all the math can be done with only 1 register.
    lsls  r2, #2
    subs  r2, #1 + (.Lmemcpy_short_end - .Lmemcpy_short_jump_start - 4) // +1 should not be needed, but it doesn't hurt to use it
    negs  r2, r2
    // From ARMv6-M reference manual:
    //  - ADD PC and MOV PC branch within Thumb state without interworking.
.Lmemcpy_short_jump_start:
    add   pc, r2

    .irp offset,6,5,4,3,2,1,0
    ldrb  r3, [r1, #\offset]
    strb  r3, [r0, #\offset]
    .endr
.Lmemcpy_short_end:

Testing with the hardware confirmed the saved cycle.

Optimizing misaligned copy when reading from XIP FLASH of RP2040

After more testing, I realized how slow read from XIP FLASH can be, specially if the copy is done without cache and one byte at a time.

Sometimes, disabling cache for data from flash can be a good decision, because cache memory is a scarce resource and sometimes performance can be improved by handling data caching in our application, so more cache is available for instructions.

When cache is disabled, RP2040 seems to do a SPI read operation for each "ld" instruction, no matter if it is a word, halfword or byte load.
With cache disabled, each read operation takes approximately 50 cycles, compared to only 2 for a word read from RAM.
It is clear that read operations must be minimized, so aligning reads at source, using word loads and shifting seems the best option.

This strategy is implemented in the "memcpy_copy_src_aligned" macro ("src_aligned" used from now on for brevity):

    // Source is aligned, read using ldmia, store with strb.
src_align_done:
    subs  r2, #4
copy_word_loop:
    ldmia r1!, {r3}
    strb  r3, [r0]
    .irp offset,1,2,3
    lsrs  r3, #8
    strb  r3, [r0, #\offset]
    .endr
    adds  r0, #4    // Next word address, r1 has already been incremented by ldmia.
    subs  r2, #4    // Sub size.
    bhs   copy_word_loop

Cycles per byte comparison:

Code	XIP FLASH uncached	RAM
__memcpy_slow_lp	57	7
_memcpy_short	54	4
mssp_0	54.63	4.63
src_aligned	16.75	4.25

It achieves a decent speed for XIP FLASH, and its performance is between "_memcpy_short" and "mssp_0" for RAM.
So, it's a good all-rounder for aligned or unaligned data.
Before the loop there is code to copy 1-3 bytes before reaching alignment in the most efficient way, minimizing the number of reads.
Same thing after the loop, handling the 1-3 bytes that may be left.

All in all, the pre-post code required is much faster than the pre-post code for "miwo" loops. So, it is better suited for medium sizes because of the reduced overhead.

Code path selection for RAM - XIP FLASH

When using the new "src_aligned" code for XIP FLASH, the source address must be checked to select the best code for the source address.
Additionally, 2 different thresholds must be used for changing between the middle size copy code and the "miwo" code.

The optimal threshold for RAM is around 16, for XIP FLASH it is around 60 in most cases, it is calculated at compile time depending on the options used (equations based on experimental tests):

#if MEMCPY_ARMV6M_MED_SIZE_SPEED <= 1 && MEMCPY_ARMV6M_OPTIMIZE_SIZE > 0
    #define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD (64 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1) * 9 - (MEMCPY_ARMV6M_OPTIMIZE_SIZE) * 6)
#else
    #if MEMCPY_ARMV6M_OPTIMIZE_SIZE >= 2
        #define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD 20 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1)
    #else
        #define MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD (73 - (MEMCPY_ARMV6M_MISALIGNED_COPY_LOOP_WORDS - 1) * 13 - (MEMCPY_ARMV6M_OPTIMIZE_SIZE) * 3 - (MEMCPY_ARMV6M_MED_SIZE_UPWARDS) * 15)
    #endif
#endif

The following code is added when needed, to use the "src_aligned" copy loop when the copy is from uncached XIP FLASH.
It also uses a different size threshold to change between the medium size and big size loops ("miwo", "Lmisaligned_copy" label here):

.Lmemcpy_not_aligned:
  #if MEMCPY_ARMV6M_OPTIMIZE_XIP_MEMORY_READ && (MEMCPY_ARMV6M_MED_SIZE_SPEED >= 1) \
    && \
    (MEMCPY_ARMV6M_MED_SIZE_ONLY_FOR_XIP_MEMORY || (MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD > 16))
    // Extra checks for XIP memory
    lsrs  r3, r1, #29
    bne   1f // Address >= than 0x20000000
    bcc   1f // Address <  than 0x10000000
    // Address is in XIP memory range: 0x1xxxxxxx
    // Check if it is cached, if it is, it is better to use the code path used for RAM-RAM copy
    lsls  r3, r1, #6
    bcs   1f        // Check bit 26: XIP_SRAM 0x15XXXXXX, XIP_CTRL_BASE 0x14XXXXXX
    lsrs  r3, r3, #30
    beq   1f        // Check bits 25-24: 0x11XXXXXX-0x13XXXXXX -> no cache, 0x10XXXXXX -> cache
    // A different threshold is used for XIP memory,
    cmp   r2, MEMCPY_ARMV6M_XIP_SIZE_THRESHOLD
    bls   .Lmemcpy_med_size
    b     .Lmisaligned_copy
1:
  #endif
    cmp   r2, #16 //14-16 - Threshold for using misaligned copy, see Lmisaligned_copy comments
    bhi   .Lmisaligned_copy
    // Fallthrough to "Lmemcpy_med_size" or "Lmemcpy_short"

All these checks (when enabled) add a small penalty in cycles for the misaligned copy case:

	RAM	ROM	FLASH - WITH CACHE	FLASH - NO CACHE
Extra Cycles:	3	4	8	8 - 9

The highest count is for FLASH, but XIP FLASH is so slow, that these quantities are negligible.

Far more important is the fact that, with these checks enabled, the throughput of the code used for uncached XIP FLASH can be up to 2.5 times faster, saving lots of cycles:

Size:	[8..16]	[17..60]
Cycles saved:	[225..560]	[200..0]

Copy loops comparison

The following graph shows the maximum theoretical loop throughput (@125Mhz) of all the copy loops analyzed so far.

RAM: Green axis and bars.
XIP FLASH: Red axis and red/yellow bars.

Note

Throughput shown in this graph does not take into account any of the needed pre-post code, only the loop cycles.
"memcpy_short" has no loop, so it uses the cycles for each byte.

Optimization of aligned code

There was very little room for improvement in the original code for aligned copy.
But I found one way of saving 8 cycles when memory was word aligned and 1 or 2 cycles otherwise.

This is the original code to copy the non word aligned bytes before the fast word aligned copy loop:

    sub r3, r0, r1
    lsl r3, #30
    bne __memcpy_slow_lp

    // r0 and r1 are co-aligned
    push {r4-r6}
    sub r1, r0
    mov r5, r0
    lsr r3, r0, #1
    bcc 1f

    // byte at odd address
    ldrb r4, [r0, r1]
    strb r4, [r0]
    add r0, #1
1:
    lsr r3, r0, #2
    bcc 1f

    // halfword on non word boundary
    ldrh r4, [r0, r1]
    strh r4, [r0]
    add r0, #2
1:
    // adjust length
    add r1, r0
    sub r5, r0
    add r2, r5

.Lmemcpy_word_aligned:

This code uses several checks and jumps to copy the required bytes or halfwords before starting the word aligned copy loop.
Following the code for the word aligned case, it can be seen that after the push instruction, 11 cycles are required to reach the ".Lmemcpy_word_aligned" label, including 2 jumps and several unneeded operations.

Taking advantage of the fact that "r3" is already 0 after "bne __memcpy_slow_lp", it can be used as an offset for "ldrh" / "strh" instructions, avoiding math with "r1".
All the math required for checks can be done with "r4" with a couple of shifts and jumps.
With these changes only 1 jump is taken in all cases (after the push instruction), and the math for incrementing pointers and counter is skipped when not needed.

This is the improved code:

    subs  r3, r0, r1
    lsls  r3, #30
    bne   .Lmemcpy_not_aligned

    // r0 and r1 are co-aligned
    push {r4-r6}

    // r3 must be 0 here
    lsls  r4, r0, #30   // Get the 2 lsbs of r0
    beq   .Lmemcpy_word_aligned   // If all 0s, we are word aligned
    lsls  r4, #1        // Bit0 of r0 in r4, Bit1 goes to carry
    beq   2f            // Check the Bit0, eq means Bit0 = 0

    // Instructions before the carry check must not affect carry
    // Copy byte at odd address
    ldrb  r4, [r1]
    strb  r4, [r0]
    movs  r3, #1        // Add bytes done (used also as ldrh offset)
    bcs   .Lalign_done  // Check Bit1, if it is 1 we are done, because 11 + 01 = 00
    //nop       // Optional, this avoids an extra nop at next ".align 2" (16 byte copy loop),
2:
    // Copy halfword on non word boundary
    ldrh  r4, [r1, r3]
    strh  r4, [r0, r3]
    adds  r3, #2        // Add bytes done

.Lalign_done:
    // Adjust pointers and length
    add   r0, r3
    add   r1, r3
    subs  r2, r3

.Lmemcpy_word_aligned:

With this modification, the code path for the word aligned case takes only 3 cycles instead of 11.

These are the cycles saved shown by real tests:

Alignment:	word	half-word	others
Cycles saved:	8	2	1

But this is not the whole story, to achieve these results, instruction alignment must be tweaked, otherwise 1 cycle is lost.
This is because there is a ".align" directive before the copy loop (used to minimize instruction fetch from memory during the loop):

.Lmemcpy_word_aligned:
    subs  r2, #16
    bcc 5f
.align 2 // Make sure the loop is word aligned, so, the 4 instructions are fetched only with 2 RAM reads (32bits each one).
..... Copy loop ...

The original code size up to this point was word aligned, so the ".align 2" directive did not add any padding.
But the new code has 1 instruction less, so, an extra "nop" will be added by the assembler.

To avoid wasting a cycle with this padding, I added a "nop" before the function label, so the function entry point is shifted by 1 half-word:

.align 2
.global memcpy_armv6m
.type memcpy_armv6m, %function
.thumb_func
    nop     // Optional, this shifts function entry point by one halfword,
            // avoiding an extra nop at next ".align 2" (16 byte copy loop)
            // If we want to keep the function aligned, instead of this nop, another nop can be used, see below
memcpy_armv6m:
    mov   ip, r0
    cmp   r2, #8

Note

This wastes 2 bytes of space, if anyone knows a better way of doing this without adding this "nop", I'm all ears!.

Note about endianness

Caution

The code assumes little-endianess, I don't know about any big-endian M0 or M0+.

Anyway, it should be easy to add support for it, as code for ARMv7-M shows, it's a matter of changing the shifts direction:

1:
#ifdef __ARM_BIG_ENDIAN
lsls r4, r4, \shift
#else
lsrs r4, r4, \shift
#endif
ldr r3, [r1], #4
#ifdef __ARM_BIG_ENDIAN
lsrs r5, r3, 32-\shift
#else
lsls r5, r3, 32-\shift
#endif
orr r4, r4, r5
str r4, [r0], #4
mov r4, r3
subs r2, #4
bhs 1b

At least these parts should be changed:

"miwo" copy loops.
"src_aligned" copy loops.

Contact

Fast ARMv6M mempy - Part 2 - Options & Results

Comments (0) :

No comments yet, be the first !!

Comment:

Your email address will not be published. Required fields marked with *.

Name *:

E-mail *:

Web:

Anti-spam question:

Write only letters: p.6.f.7.h.a.n.9.1.9.6.s

Anti-spam answer *: