coding MMX

MMX corner

(assembly language knowledge required)

MMX, introduced with Pentium MMX processors, offers high speed SIMD (Single Instruction Multiple Data) processing of large amount of data. MMX instructions differ from standard instructions in that a single MMX instruction works on multiple data (2 doublewords, 4 words or 8 bytes). MMX registers (unlike standard registers that are 32-bits wide) are 64-bit wide (storing 2 doublewords, 4 words or 8 bytes or a single 64-bit value). When you write ADD EAX,EBX you are adding 2 32-bit values. Writing (MMX) PADD MM0,MM1 you are adding 4 32-bit values. That is the main power of MMX. However, if you plan to use MMX, think twice if your algorithm is suitable for the conversion. Ok, what is it all about?

Advantages:

speed: will written code run up to 300% faster
speed: MMX fetches use 64-bit memory accesses
speed: MMX and standard integer code execute simultaneously
speed: MMX instructions are fully pipelined, each execute in 1 clock cycle and multiple MMX instructions can be executed in a clock cycle

Disadvantages:

MMX code and floating point code cannot execute simultaneously since MMX registers are aliased upon FP registers (they are effectively the same)
MMX is poorly designed (few algorithms receive significant speed-up)
you cannot use MMX registers for addressing (mov mm0,[mm1] is illegal)
you cannot move upper half of an MMX register into an integer register
not all processors support MMX · must execute EMMS instruction after use of MMX

There are 8 64-bit MMX registers (MM0 to MM7) and 57 MMX instructions. Basically, you can add, subtract, multiply, shift and compare (plus logical and, or, xor and nand) packed bytes, words and doublewords. Packed add on bytes means you add 8 bytes with another 8 bytes in a single instruction. There are some special instructions for packing/unpacking bytes/words/doublewords from/to MMX registers. For more detailed description on MMX please refer to Intel web site at developer.intel.com.

The following example shows how to do a 32-bit alpha-blending using MMX. Similar routine was used in our demo WHY. Short description of the algorithm itself: We have two true-color pixels A and B, both in the format (32-bit): 00000000 rrrrrrrr gggggggg bbbbbbbb.

We have alpha (0-255) 0 means A 100%, B 0% ; 128 means A 50%, B 50% and so on. We need to mix these 2 pixels (with given alpha value) into a single pixel C.

The equation is basically C = (A * alpha + B * (256 - alpha) ) / 256, or if you want:
Cr = (Ar * alpha + Br * (256 - alpha)) / 256
Cg = (Ag * alpha + Bg * (256 - alpha) ) / 256
Cb = (Ab * alpha + Bb * (256 - alpha) ) / 256

This algorithm requires (count with me) 6 multiplications, 3 additions, 3 shifts per pixel plus computing (256 - alpha) once. I used MMX to speed this up by packing RGB values into MMX registers, then doing all arithmetic on R,G and B simultaneously (as in the 'packed' equation):

MM0 and MM1 (pixel A and B)

16-bit	16-bit	16-bit	16-bit
0	RED	GREEN	BLUE
0000000000000000	00000000rrrrrrrr	00000000gggggggg	00000000bbbbbbbb

MM5

16-bit	16-bit	16-bit	16-bit
0	0	0	0

MM6

16-bit	16-bit	16-bit	16-bit
alpha	alpha	alpha	alpha

MM7

16-bit	16-bit	16-bit	16-bit
256 - alpha	256 - alpha	256 - alpha	256 - alpha

static unsigned short alphaMMXmul_const1[4] = {256,256,256,256};	      
static unsigned alphaMMXmul_0[2] = {1,1};				      
									      
void MixAlphaMMX32(void *dest,const void *src,unsigned len,unsigned opacity) {
									      
__asm {									      
	mov		edi,dest					      
	mov		ebx,src						      
	mov		ecx,len						      
	mov		edx,opacity					      
									      
	movzx   eax,dl                                                        
   	movq    mm7,[alphaMMXmul_const1]				      
									      
	shl     eax,16							      
	add     eax,edx							      
   	mov     [alphaMMXmul_0],eax 					      
   	mov     [alphaMMXmul_0 + 4],eax					      
   	movq    mm6,[alphaMMXmul_0]	; mm6(X) = alpha (4 words)	      
   	pxor    mm5,mm5							      
   	psubusw mm7,mm6			; mm7(Y) = 256 – alpha (4 words)      
									      
ALIGN 16								      
MixAlphaMMX32_MainLoop: 						      
									      
 movd    	mm0,[edi]	; mm0(A) = 0 0 0 0 | 0 Ra Ga Ba	 	      
 add     	edi,4							      
 movd    	mm1,[ebx]	; mm1(B) = 0 0 0 0 | 0 Rb Gb Bb	  	      
 add     	ebx,4							      
 punpcklbw 	mm0,mm5		; mm0 = 0 0 0 Ra | 0 Ga 0 Ba		      
 punpcklbw 	mm1,mm5		; mm1 = 0 0 0 Rb | 0 Gb 0 Bb		      
 pmullw  	mm0,mm6		; mm0 = 0 Ra*X | Ga*X Ba*X		      
 pmullw  	mm1,mm7		; mm1 = 0 Rb*Y | Gb*Y Bb*Y 		      
 paddusw 	mm0,mm1		; mm0 = 0 Ra*X+Rb*y | Ga*X+Gb*y Ba*X+Bb*Y     
 psrlw		mm0,8		; mm0 = 0 0 0 Rc | 0 Gc 0 Bc		      
 packuswb 	mm0,mm0		; mm0 = 0 0 0 0 | 0 Rc Gc Bc		      
 movd    	[edi-4],mm0						      
 dec		ecx							      
 jnz   		MixAlphaMMX32_MainLoop 					      
 emms									      
}									      
}

This routine can be further speeded up by processing 2 pixels in one loop and by reordering fetches and arithmetic for best pipeline usage ... exercise for the reader ;-)

Sayza