Confessions of the World's Largest Switcher
by Daniel H. Steinberg10/29/2003
It's a shame that Apple no longer runs the "Switch" campaign on television. Dr. Srinidhi Varadarajan would make an excellent spokesperson for moving to the Mac. Just as Ellen Feiss' switcher story was the hit at Macworld Expo, so has been Dr. Varadarajan's presentation at the O'Reilly Mac OS X Conference, where he received a standing ovation.
His ad might go something like this. "I was in the market for a new machine. I was hoping to get ten teraflops by the end of the year. I'd never used a Mac and had been looking at Dells and IBMs. Then Apple released the G5 on June 23. A week later I bought 1,100 duals online at the Apple Store. I'm Srinidhi Varadarajan and I build Supercomputers at Virginia Tech."
Goals in Building Virginia Tech's G5 Supercomputer
The timing was right to make a big move at Virginia Tech. Varadarajan explained that they had a new dean and a new program in Computational Sciences and Engineering (CSE). They also had experience building a previous, smaller cluster. The goals were pretty straightforward: To build a world-class program, you need to provide world-class resources. This included creating high-end computing facilities and high-performance networking capabilities to tie the computational facilities into national computational grids.
In addition, there were political goals to get communication and cooperation across department lines. Most universities have subcultures and pockets that don't speak to each other. Varadarajan asks how you get people to talk across these different fiefdoms and explains that another of the goals of the project was to get everyone on the same team by providing support for both experimental and production research. The cooperation was evident in the speed with which the project was accepted and supported. Conference co-chair Derrick Story asked how long the project took from start to finish. The answer was surprising. The project was started in March and April of this year. Within a month it had everyone's backing. Money was raised in April and May and the cluster was launched in ramping-up mode in September.
In addition to these goals, there were also architectural ideals. Varadarajan wanted a high-performance supercomputer based on a 64-bit processor and never looked at 32 bit. In addition, he felt that clusters imply gigabit Ethernet. You need high-performance interconnect with high-bandwidth, ultra-low latency. He also wanted to offer the cluster as a service, which meant he needed connections to Internet1, Internet2, and soon, into the NLR (National Lambda Rail). Lambda is a proposed high-speed network used to support research institutions.
In addition, the project had usage goals--to provide easy access for new investigators and exploratory research. The access policy was open door. Varadarajan explained that he didn't want to shut people out just because they don't have a grant. Often, you need results in order to get a grant. He also wanted to support multi-site research activities. Finally, for a premium, he wanted to support on-demand access to computational cycles. For example, an external customer may ask for so much power and in so much time. This required being able to check-point and store the current state of the system so that currently running applications wouldn't be lost.
Dude, You Need a Mac
A prime constraint in designing the Supercomputer was cost. Academia has small budgets so the focus was on high-price performance. Competing installations include DOE (Department of Energy) installations, which can afford to pay top dollar. The Virginia Tech computer wanted the same performance but for bottom dollar. The cost was more than just the machines. The existing facilities would need to be upgraded with cooling systems and power distribution. And they would have to account for the cost of the cables, memory, and back-up power. Varadarajan's team built one of the cheapest world-class Supercomputers. He laughed that "The fact that it's running is a big deal in itself."
He looked at various architecture options and was in the process of buying Dells when the deal fell through. He also worked with IBM and AMD and couldn't get the price to match. The budgets were coming in at $9 to $12 million dollars. The IBM with a PowerPC 970 was a first choice but the earliest delivery date would have been January 2004. Varadarajan said that you can't design a Supercomputer and wait that long for delivery. "You wouldn't buy a car and leave it at the dealer for a year and a half. We wanted a short three-month-build cycle and could not wait six months.
|
On June 23 Apple announced the G5. Varadarajan said that contrary to rumors, it was the first that they had heard about it as well. On June 26 they told Apple they were interested in placing a "fairly large order". A day later he flew to California and met with Apple. One of their first questions was how long he'd been a Mac owner. Varadarajan said he never had one. Twenty-four hours later Apple committed. Starting on September 5, the G5's arrived in Virginia. An audience member asked if he'd made the purchase through the Apple store. Varadarajan smiled and said that actually, yes, he had.
Performance and Power
Varadarajan said that a lot of people get the math wrong when calculating the performance of the machines. Each G5 processor has two, double-precision, floating-point units. Each is capable of a fused, multiple-add operation per cycle, so you get 2 flops per cycle. This means that 2GHz corresponds to 8 GFlops, so each dual G5 can deliver a peak of 16 GFlops of double-precision performance. That is more than a modern Cray node.
The primary communications architecture is built on InfiniBand's card, which has two ports on each node connecting into the network at 20 Gbps full-duplex bandwidth. Each node has a connection open to each other node and there is the potential to hold 150K connections per node. This translates into very low latency--less than 10 microseconds.
The computers and cables are just one piece of the infrastructure. Varadarajan also needed a large enough building to house the cluster, with a raised floor, environmental controls, fire suppression, and round-the-clock controlled access. In addition, the power needs include 1.5 MW of power coming in from two substations with back-up UPS and finally, a back-up diesel generator.
If you've ever sat with a TiBook in your lap, you understand that there is a further significant issue. As hot as a G4 runs, a G5 runs hotter. With a traditional air-conditioning setup, the calculations showed that instead of emptying out the air three times an hour as would be typical, they would need to empty the air three times per minute. Computers tend to each cool front to back. So the plan was to arrange the computers in rows back to back and pull the hot air out of the hot aisle. This would have required wind velocity under the floor of more than 60 miles per hour and still would have resulted in some hot spots. They decided instead to use a refrigerator-like system. Chillers cool water to 40 degrees to 50 degrees, which is then used to chill refrigerant, which is piped into a matrix of copper pipes. Effectively, you have a distributed refrigerator.
Tuning
The computers ran with few customizations. The volunteers started the computers, connected the InfiniBand card, restarted the computers, and cabled them up. The machines are currently running stock Mac OS X 10.2.7. An audience member asked if they use Software Update. Varadarajan said no but that there are plans to Pantherize the system in the next few weeks. This will require an install and a recompile of some of the code. Custom code included InfiniBand drivers and some parallel communication libraries known as MV APICH developed in Dr. Dhabaleswar Panda's lab at Ohio State University. This library had to be ported from Linux to Mac OS X. The PCI-X timing was changed to increase InfiniBand performance to 870MBps. Also, message caching and dynamic memory management were added for improved scientific application performance.
The LINPACK benchmark solves a very large system of linear equations, involving dense matrix operations. The main phase is LU decomposition (Gaussian elimination with partial row pivoting). The G5 cluster solved a system of equations at N=500K. The team realized that the only way to improve the benchmark score is to improve the numerical libraries. This boils down to the BLAS libraries. The core routine--matrix multiply (GEMM)--was optimized by Kazushige Goto. The current impressive benchmark results are due to a mix of Goto's libraries and Apple's veclib framework.
Varadarajan reported that "our latest numbers are 9.555 tera and we still have more tricks left. We are hoping for another 10 percent boost to become the first academic machine to cross 10 tera. The last ratings put us at number three worldwide." During the question-and-answer period at the end, an audience member from the Lawrence Livermore National Laboratory introduced himself as coming from the institution that had the Supercomputer that the Virginia Tech cluster had just passed. He asked whether the details of the Supercomputer would be published. The reply was that in addition to documentation and papers, the plans are to return the changes to MVAPICH to the open source project so that it would be freely available. There are also plans to open source the caching code and Varadarajan expects that Mellanox's code will be available.
Varadarajan said that they are getting requests for clones. "Expect to see a lot more G5 clusters."
Daniel H. Steinberg is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.
Return to Mac OS X Conference Coverage
Return to theMac DevCenter
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 12 of 12.
-
Next time M$-discipels downplay Apple. WE say: nr. # 3 !
2003-12-03 13:55:15 anonymous2 [Reply | View]
Yes folks... If all these Mac-bashers, IT CE's etc. say anything bad about Apple inc. and it's goods we just simply say:
nr. # 3 in the World!
Where were you btw with your fastest Xeon consumer whatever PC...*GRIN*
-
CPU Agnostic
2003-11-18 09:47:26 anonymous2 [Reply | View]
Up front, I'm a Wintel guy who's just scratching the surface of becoming a Lintel guy. I'm no scientist or engineer. I've used a Mac maybe twice in the pre-OS-X days, and never used one long enough to get the hang of it.
But I'm not married to any particular OS or architecture. I'm hearing impressive things about OS-X, and if I had the budget for a Mac I'd want to play with it. I say, if they can build a supercluster based on a G-5, then bravo! The engineering of the G-5 was obviously up to the task. Two things matter in the real world... price and results. Sounds like VT hit their mark, and I'm impressed that they were able to see outside the box and make it work.
Congratulations, Virginia Tech!
-
Could be the fastest
2003-11-13 21:50:02 anonymous2 [Reply | View]
You know..
if he bought another 10 million dollars in G5s it would be faster than the worlds fastest compouter in Japan.
which costs nearly 250 million dollars..
and it has 5124 processors which is more than double the G5....
If he spent a few more million dollars hed have the fastest computer in the world!!!
-
Ordering 1,100 G5s`
2003-11-03 08:38:43 anonymous2 [Reply | View]
Through the Apple store, he says. Were there any special offers going on at that time with the G5—Like he also got 1,100 Epson C60 printers with that order? I feel sorry for the person running the delivery dock at that uni!
-
Why the G5?
2003-11-01 07:02:43 anonymous2 [Reply | View]
I can't believe that people don't understand this. Everyone is commenting about how they bought the G5s "sight unseen." According to the article, "An IBM with the PowerPC 970 would have been the first choice, but the earliest delivery date would have been January 2004." So, they had their hopes pinned on a single processor IBM 970 based machine, but IBM couldn't deliver in time. One week later, Apple announces a DUAL processor machine using the same chip, available in a 60 day timeframe. Seems to me that the choice of the G5 was a complete and total no-brainer. They had already specced the 970 as having the capabilities they wanted, and now Apple was offering a consumer version with two of them. It was no reach at all for them to decide to grab those and run with them, it seems to me. Probably the only real issues were converting what was undoubtedly plans for a rack based facility into one that could handle the G5s larger form factor.
-
what the heck are they going to do with it?
2003-10-31 15:34:17 anonymous2 [Reply | View]
1,100 G5 s all in a row/parrallel? What will they test with that? Perhaps play Go??
-
I still don't get it
2003-10-31 00:35:31 anonymous2 [Reply | View]
How can anyone make such a huge investment in hardware without 'test-driving' it first? Without seeing whether the machine has any problems? Whether the processor stands up to promises?
Maybe the most exciting part in this project is that it seems like no suits ever had a chance to glimpse at the plan - pure hacker power...
Still a gamble I don't understand.
-
Flop calculations are off
2003-10-30 14:52:49 anonymous2 [Reply | View]
In addition to the two floating point units, there's also the vector unit which can do 4 * 32-bit floating point multiply accumulates (vmaddfp) per cycle. I'd have to dig deeper into the docs to find if the bus can fully supply all three units without stalls.
-
"Kazushige Goto Not Considered Harmful"
2003-10-30 08:54:01 anonymous2 [Reply | View]
... would have been a good subheading near the end there.
-
Thinking clearly, taking chances
2003-10-30 08:33:12 scienceman [Reply | View]
The entire Mac (and scientific) community owes a debt of thanks to this group, who demonstrated by direct action that they could think clearly, evaluate the numbers, and take this step based on their own knowledge of supercomputing. In one step -- though no doubt based on earlier experience with different hardware -- they catapulted their institution into prominence and their own work into effective productivity by making a large-scale resource available at comparatively low cost.
While it's only a matter of time before someone else imitates these techniques with different hardware, we can credit these folks with being first (the first at this price point, the first to create a nearly 10 Tflops machine at a university, and the first to do this with commodity 64-bit hardware at such a low cost) and for having the courage to act on their own technical knowledge. I'll bet the Dean and president at VT are breathing a sigh of relief now, having taken such a chance!
-
an important piece is missing.....
2003-10-29 12:38:31 anonymous2 [Reply | View]
Why doesn't the article include the most interesting part of the story ? How can you keep a cluster of over 1000 non-failsafe computers running ? Varadarajan has devised a system to make the cluster reliable, even though it, for one, doesn't have ECC RAM.






But I'm not married to any particular OS or architecture. I'm hearing impressive things about OS-X, and if I had the budget for a Mac I'd want to play with it. I say, if they can build a supercluster based on a G-5, then bravo! The engineering of the G-5 was obviously up to the task. Two things matter in the real world... price and results. Sounds like VT hit their mark, and I'm impressed that they were able to see outside the box and make it work.