The AltiVec Difference
Pages: 1, 2
Scaling and Translation
Mathematically, this operation is represented by:
y = ax + b
where x and y are vectors and a and b are constants. In this operation, x is scaled by a factor of a and then shifted by b. With scalar computations, this can be implemented as:
for (i=1; i<=n; i++) {
y[i]=alpha*x[i]+beta;
}
where x, y, alpha and beta are defined as floats. Using AltiVec’s vector multiply-add instruction, the same operation can be written as:
for (i=1; i<=n/4; i++) {
y[i]=vec_madd(alphaV,x[i],betaV);
}
Here, x and y are vector floats, as are alphaV and betaV:
alphaV = (alpha, alpha, alpha, alpha)
betaV = (beta, beta, beta, beta)
These four-element vectors were created from alpha and beta in order to use them in the vec_madd function. On the PowerBook G4, again with n=1000, the results are 198 MFLOPS for the scalar computation and 386 MFLOPS for the vector computation. Here, the performance gain from AltiVec is a factor of 1.9.
2D Rotation
In this example, data points in a 2D x-y plane are mapped into a new u-v coordinate system through a rotation of axes by an angle q. The equations for this transformation are:
u = x cosq + y sinq
v = –x sinq + y cosq
To carry out this mapping transformation for a collection of “n” (x,y) points, the following scalar computation can be used:
for (i=1; i<=n; i++) {
u[i]=x[i]*c+y[i]*s;
v[i]=-x[i]*s+y[i]*c;
}
where s = sinq and c = cosq. Using AltiVec’s vector multiply-add function, the computation can be vectorized as follows:
for (i=1; i<=n/4; i++) {
u[i]=vec_madd(x[i],cV,zeroV);
u[i]=vec_madd(y[i],sV,u[i]);
v[i]=vec_madd(x[i],msV,zeroV);
v[i]=vec_madd(y[i],cV,v[i]);
}
As before, the following vector floats were created for this computation:
cV=(c, c, c, c)
sV=(s, s, s, s)
msV=(–s, –s, –s, –s)
zeroV=(0., 0., 0., 0.)
For n=1000, the two approaches give the following performance: 300 MFLOPS for the scalar computation, and 472 MFLOPS for the vector computation, a factor of 1.6 performance increase from AltiVec.
Matrix Multiplication
The last example involves the multiplication of n x n matrices A and B to form the n x n matrix C:
In this multiplication, the ij element of matrix C would be formed by the ith row – jth column dot product between matrices A and B as follows:
Using a scalar computation, the matrix multiplication can be structured as follows:
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
for (k=1; k<=n; k++) {
c[i,j]=c[i,j]+a[i,k]*b[k,j];
}
}
}
This algorithm involves 2n3 floating-point operations, carried out by a total of 6n3 instructions (3 loads, an add, a multiply, and a store are required each time the computation is evaluated). The easiest way to implement this matrix multiplication using AltiVec is to vectorize the inner “k” loop to use a vector multiply-add function. Conceptually, this would look like:
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
for (k=1; k<=n/4; k++) {
c[j+(i-1)*n]=vector_madd(a[k+(i-1)*n],b[j+(k-1)*n],c[j+(i-1)*n]);
}
}
}
Here, the 2D array’s [i,j] index has been replaced by a single index [j+(1-1)n] that threads through the data in "vector" form. Using the multiply-add, the aikbkj multiplication is carried out and added to cij, repeatedly, in vector form. AltiVec would march through this computation four instructions at a time, until all of the aikbkj products have been carried out.
In reality, the algorithm is not so simple, and neither is the contraction of indices -- a few additional lines of code are required to stuff the multidimensional arrays into Altivec vector functions. But the end result is quite good. On the PowerBook G4 with n=200, the scalar computation runs at 84 MFLOPS, while the AltiVec version runs at 384 MFLOPS, a factor of 4.6 improvement. An even better matrix multiplication algorithm available from Apple boosts that performance to a whopping 681 MFLOPS, an improvement close to a factor of 8.
This is a good example of how AltiVec can really pay off for complex computations. For a more detailed look at the AltiVec matrix multiplication code, see Apple’s Developer Web site.
One final note -- each of these examples implied that n was evenly divisible by four, but that is not a requirement. Arrays can be padded with dummy elements (rounding the array size up to the next largest multiple of 4), or a scalar "cleanup" loop can be used to pick up any leftovers. An example is shown below:
for (i=1; i<=n/4; i++) {
z[i]=vec_add(x[i],y[i]);
}
m=mod(n,4)
for (i=n-m+1; i<=n; i++) {
z[i]=x[i]+y[i];
}
Here, m=mod(n,4) is the remainder of n/4, either 0, 1, 2, or 3. In cases where n is evenly divisible by 4, m=0 and the scalar cleanup loop will be bypassed.
Final Thoughts
This article took a brief look at AltiVec and illustrated some of the performance gains available with this technology. For specific applications, you need to take a close look at the code and see where AltiVec can be used. In many cases, using AltiVec will involve very minor code changes -- more of a code tune-up -- while other cases will require in-depth code restructuring and rewriting to allow for vectorization. How well AltiVec works will depend on many factors, but if your code hinges on key computations that can be vectorized, AltiVec will surely make a difference.
Craig Hunter is an aerospace engineer at NASA Langley Research Center in Hampton, Virg.
Return to the Mac DevCenter.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 7 of 7.
-
The real world
2003-05-24 23:02:26 anonymous2 [Reply | View]
We use high end Avid editing systems. Avid are just about to bring to the market a new system called Adrenalane. It will now plug into a PC or Mac via FireWire. So it is now possible to compare speed. The Mac is running on dual 1.33 MHz and the PC on a HP dual 2.33 Mhz, so If AltiVec was say giving us a conervitive 2 times speedup you would expect a similiar speed. But is not the case, the Mac runs half as slow as the PC, and I assume part of the reason that the software has not been optimised for Altivec and is unlikely to be so, thats the real world talking, and if Apple do not get their processor act to-gether thats where they will be out in the cold real world and nobody buying their boxes
-
Its like horsepoer vs. torque
2002-12-05 14:50:06 anonymous2 [Reply | View]
Its true what Apple says. Like a car w/350 HP and 200 lb/ft of torque at say 2KRPM weighting 4,000 lbs will be much slower than a car of the same weight, with less HP, but 300 lb/ft of torque at say 2KRPM.
Think about it, its no clock speed, its what is done within a clock tick. CISC chips are 2 to 20X slower than most RISC chips *per tick*.
e
-
simd
2002-04-10 08:03:17 psheldon [Reply | View]
I am happy that you illustrated how many algorithms could be parallelized by the G4 instruction set. I had thought that Apple was only selling the cocoa frameworks and would never discuss parallel algorithms constructed in near pseudocode, but Apple is also in the business of selling hardware and it's instruction sets. I have hope that I shall learn how to organize myself into making my own frameworks of objects cotaining algorithms to save my "knowledge work" in science, not merely interface, be interface ever so important for science.
The fact that there is such a modifier (single instruction multiple data) means that there are other modifiers, for example multiple instruction multiple data. Perhaps here I am sensing familiarity in language of someone who consumes and specifies processors just like General Dynamics did.
I might even hope for an article pointing to the extended bacher nauer formalism for cocoa as a translation exercise for some algorithms packaged somewhere else such as cactus? A translation exercise without a nervous standards committee at my home alone might be a blast.
-
found your NASA trade study asking Jeeves on cactus then fortran90
2002-04-13 14:02:55 psheldon [Reply | View]
So both cactus and fortran 90 pushing into multiprocessor eventually but not now.
Seemed to me the major speedup is in Apple's matrix multiplication algorithm exploiting the G4 arrray processor not in the multiprocessor cluster.
Is that correct?
So far got to page 14 of the Jy 2000 report "Evaluation of Powermac G4 Systems for Fortran-..."
-
BLAS for Altivec
2002-04-09 16:17:48 professorxavier [Reply | View]
Does anybody know whether there is a (free?) source for BLAS routines for the G4/Altivec (Velocity Engine)? RSVP. -
BLAS for Altivec
2002-04-22 07:51:27 willic3 [Reply | View]
Check out http://math-atlas.sourceforge.net/. An even better option if you're using Mac OSX is to install fink on your mac (http://fink.sourceforge.net). This will allow you to easily install a number of useful software packages. Once you have fink installed, you can generate the BLAS (I believe LAPACK is also installed) libraries by doing the following:
1. sudo cp /sw/fink/dists/unstable/main/finkinfo/sci/atlas* /sw/fink/dists/local/main/finkinfo
2. sudo fink install atlas
You will then need to answer a number of questions for the interactive installer, and the installation will take some time as atlas tunes the routines for your machine. Also, you should probably install g77 prior to installing atlas (sudo fink install g77). Good luck.
Charles Williams





