MMX, introduced with Pentium MMX processors, offers high speed SIMD (Single Instruction Multiple Data) processing of large amount of data. MMX instructions differ from standard instructions in that a single MMX instruction works on multiple data (2 doublewords, 4 words or 8 bytes). MMX registers (unlike standard registers that are 32-bits wide) are 64-bit wide (storing 2 doublewords, 4 words or 8 bytes or a single 64-bit value). When you write ADD EAX,EBX you are adding 2 32-bit values. Writing (MMX) PADD MM0,MM1 you are adding 4 32-bit values. That is the main power of MMX. However, if you plan to use MMX, think twice if your algorithm is suitable for the conversion. Ok, what is it all about?
Advantages:
Disadvantages:
There are 8 64-bit MMX registers (MM0 to MM7) and 57 MMX instructions. Basically, you can add, subtract, multiply, shift and compare (plus logical and, or, xor and nand) packed bytes, words and doublewords. Packed add on bytes means you add 8 bytes with another 8 bytes in a single instruction. There are some special instructions for packing/unpacking bytes/words/doublewords from/to MMX registers. For more detailed description on MMX please refer to Intel web site at developer.intel.com.
The following example shows how to do a 32-bit alpha-blending using MMX. Similar routine was used in our demo WHY. Short description of the algorithm itself: We have two true-color pixels A and B, both in the format (32-bit): 00000000 rrrrrrrr gggggggg bbbbbbbb.
We have alpha (0-255) 0 means A 100%, B 0% ; 128 means A 50%, B 50% and so on. We need to mix these 2 pixels (with given alpha value) into a single pixel C.
The equation is basically C = (A * alpha + B * (256 - alpha) ) / 256, or if
you want:
Cr = (Ar * alpha + Br * (256 - alpha)) / 256
Cg = (Ag * alpha + Bg * (256 - alpha) ) / 256
Cb = (Ab * alpha + Bb * (256 - alpha) ) / 256
This algorithm requires (count with me) 6 multiplications, 3 additions, 3 shifts per pixel plus computing (256 - alpha) once. I used MMX to speed this up by packing RGB values into MMX registers, then doing all arithmetic on R,G and B simultaneously (as in the 'packed' equation):
Register layout:
MM0 and MM1 (pixel A and B)16-bit | 16-bit | 16-bit | 16-bit |
0 | RED | GREEN | BLUE |
0000000000000000 | 00000000rrrrrrrr | 00000000gggggggg | 00000000bbbbbbbb |
16-bit | 16-bit | 16-bit | 16-bit |
0 | 0 | 0 | 0 |
16-bit | 16-bit | 16-bit | 16-bit |
alpha | alpha | alpha | alpha |
16-bit | 16-bit | 16-bit | 16-bit |
256 - alpha | 256 - alpha | 256 - alpha | 256 - alpha |
static unsigned short alphaMMXmul_const1[4] = {256,256,256,256}; static unsigned alphaMMXmul_0[2] = {1,1}; void MixAlphaMMX32(void *dest,const void *src,unsigned len,unsigned opacity) { __asm { mov edi,dest mov ebx,src mov ecx,len mov edx,opacity movzx eax,dl movq mm7,[alphaMMXmul_const1] shl eax,16 add eax,edx mov [alphaMMXmul_0],eax mov [alphaMMXmul_0 + 4],eax movq mm6,[alphaMMXmul_0] ; mm6(X) = alpha (4 words) pxor mm5,mm5 psubusw mm7,mm6 ; mm7(Y) = 256 – alpha (4 words) ALIGN 16 MixAlphaMMX32_MainLoop: movd mm0,[edi] ; mm0(A) = 0 0 0 0 | 0 Ra Ga Ba add edi,4 movd mm1,[ebx] ; mm1(B) = 0 0 0 0 | 0 Rb Gb Bb add ebx,4 punpcklbw mm0,mm5 ; mm0 = 0 0 0 Ra | 0 Ga 0 Ba punpcklbw mm1,mm5 ; mm1 = 0 0 0 Rb | 0 Gb 0 Bb pmullw mm0,mm6 ; mm0 = 0 Ra*X | Ga*X Ba*X pmullw mm1,mm7 ; mm1 = 0 Rb*Y | Gb*Y Bb*Y paddusw mm0,mm1 ; mm0 = 0 Ra*X+Rb*y | Ga*X+Gb*y Ba*X+Bb*Y psrlw mm0,8 ; mm0 = 0 0 0 Rc | 0 Gc 0 Bc packuswb mm0,mm0 ; mm0 = 0 0 0 0 | 0 Rc Gc Bc movd [edi-4],mm0 dec ecx jnz MixAlphaMMX32_MainLoop emms } }
This routine can be further speeded up by processing 2 pixels in one loop and by reordering fetches and arithmetic for best pipeline usage ... exercise for the reader ;-)
Sayza