MMX corner

(assembly language knowledge required)

MMX, introduced with Pentium MMX processors, offers high speed SIMD (Single Instruction Multiple Data) processing of large amount of data. MMX instructions differ from standard instructions in that a single MMX instruction works on multiple data (2 doublewords, 4 words or 8 bytes). MMX registers (unlike standard registers that are 32-bits wide) are 64-bit wide (storing 2 doublewords, 4 words or 8 bytes or a single 64-bit value). When you write ADD EAX,EBX you are adding 2 32-bit values. Writing (MMX) PADD MM0,MM1 you are adding 4 32-bit values. That is the main power of MMX. However, if you plan to use MMX, think twice if your algorithm is suitable for the conversion. Ok, what is it all about?

Advantages:

Disadvantages:

There are 8 64-bit MMX registers (MM0 to MM7) and 57 MMX instructions. Basically, you can add, subtract, multiply, shift and compare (plus logical and, or, xor and nand) packed bytes, words and doublewords. Packed add on bytes means you add 8 bytes with another 8 bytes in a single instruction. There are some special instructions for packing/unpacking bytes/words/doublewords from/to MMX registers. For more detailed description on MMX please refer to Intel web site at developer.intel.com.

The following example shows how to do a 32-bit alpha-blending using MMX. Similar routine was used in our demo WHY. Short description of the algorithm itself: We have two true-color pixels A and B, both in the format (32-bit): 00000000 rrrrrrrr gggggggg bbbbbbbb.

We have alpha (0-255) 0 means A 100%, B 0% ; 128 means A 50%, B 50% and so on. We need to mix these 2 pixels (with given alpha value) into a single pixel C.

The equation is basically C = (A * alpha + B * (256 - alpha) ) / 256, or if you want:
Cr = (Ar * alpha + Br * (256 - alpha)) / 256
Cg = (Ag * alpha + Bg * (256 - alpha) ) / 256
Cb = (Ab * alpha + Bb * (256 - alpha) ) / 256

This algorithm requires (count with me) 6 multiplications, 3 additions, 3 shifts per pixel plus computing (256 - alpha) once. I used MMX to speed this up by packing RGB values into MMX registers, then doing all arithmetic on R,G and B simultaneously (as in the 'packed' equation):

Register layout:

MM0 and MM1 (pixel A and B)
16-bit 16-bit 16-bit 16-bit
0 RED GREEN BLUE
0000000000000000 00000000rrrrrrrr 00000000gggggggg 00000000bbbbbbbb

MM5
16-bit 16-bit 16-bit 16-bit
0 0 0 0

MM6
16-bit 16-bit 16-bit 16-bit
alpha alpha alpha alpha

MM7
16-bit 16-bit 16-bit 16-bit
256 - alpha 256 - alpha 256 - alpha 256 - alpha
static unsigned short alphaMMXmul_const1[4] = {256,256,256,256};	      
static unsigned alphaMMXmul_0[2] = {1,1};				      
									      
void MixAlphaMMX32(void *dest,const void *src,unsigned len,unsigned opacity) {
									      
__asm {									      
	mov		edi,dest					      
	mov		ebx,src						      
	mov		ecx,len						      
	mov		edx,opacity					      
									      
	movzx   eax,dl                                                        
   	movq    mm7,[alphaMMXmul_const1]				      
									      
	shl     eax,16							      
	add     eax,edx							      
   	mov     [alphaMMXmul_0],eax 					      
   	mov     [alphaMMXmul_0 + 4],eax					      
   	movq    mm6,[alphaMMXmul_0]	; mm6(X) = alpha (4 words)	      
   	pxor    mm5,mm5							      
   	psubusw mm7,mm6			; mm7(Y) = 256 – alpha (4 words)      
									      
ALIGN 16								      
MixAlphaMMX32_MainLoop: 						      
									      
 movd    	mm0,[edi]	; mm0(A) = 0 0 0 0 | 0 Ra Ga Ba	 	      
 add     	edi,4							      
 movd    	mm1,[ebx]	; mm1(B) = 0 0 0 0 | 0 Rb Gb Bb	  	      
 add     	ebx,4							      
 punpcklbw 	mm0,mm5		; mm0 = 0 0 0 Ra | 0 Ga 0 Ba		      
 punpcklbw 	mm1,mm5		; mm1 = 0 0 0 Rb | 0 Gb 0 Bb		      
 pmullw  	mm0,mm6		; mm0 = 0 Ra*X | Ga*X Ba*X		      
 pmullw  	mm1,mm7		; mm1 = 0 Rb*Y | Gb*Y Bb*Y 		      
 paddusw 	mm0,mm1		; mm0 = 0 Ra*X+Rb*y | Ga*X+Gb*y Ba*X+Bb*Y     
 psrlw		mm0,8		; mm0 = 0 0 0 Rc | 0 Gc 0 Bc		      
 packuswb 	mm0,mm0		; mm0 = 0 0 0 0 | 0 Rc Gc Bc		      
 movd    	[edi-4],mm0						      
 dec		ecx							      
 jnz   		MixAlphaMMX32_MainLoop 					      
 emms									      
}									      
}									      

This routine can be further speeded up by processing 2 pixels in one loop and by reordering fetches and arithmetic for best pipeline usage ... exercise for the reader ;-)

Sayza