Confessions of the World's Largest Switcher
by Daniel H. Steinberg10/29/2003
It's a shame that Apple no longer runs the "Switch" campaign on television. Dr. Srinidhi Varadarajan would make an excellent spokesperson for moving to the Mac. Just as Ellen Feiss' switcher story was the hit at Macworld Expo, so has been Dr. Varadarajan's presentation at the O'Reilly Mac OS X Conference, where he received a standing ovation.
His ad might go something like this. "I was in the market for a new machine. I was hoping to get ten teraflops by the end of the year. I'd never used a Mac and had been looking at Dells and IBMs. Then Apple released the G5 on June 23. A week later I bought 1,100 duals online at the Apple Store. I'm Srinidhi Varadarajan and I build Supercomputers at Virginia Tech."
Goals in Building Virginia Tech's G5 Supercomputer
The timing was right to make a big move at Virginia Tech. Varadarajan explained that they had a new dean and a new program in Computational Sciences and Engineering (CSE). They also had experience building a previous, smaller cluster. The goals were pretty straightforward: To build a world-class program, you need to provide world-class resources. This included creating high-end computing facilities and high-performance networking capabilities to tie the computational facilities into national computational grids.
In addition, there were political goals to get communication and cooperation across department lines. Most universities have subcultures and pockets that don't speak to each other. Varadarajan asks how you get people to talk across these different fiefdoms and explains that another of the goals of the project was to get everyone on the same team by providing support for both experimental and production research. The cooperation was evident in the speed with which the project was accepted and supported. Conference co-chair Derrick Story asked how long the project took from start to finish. The answer was surprising. The project was started in March and April of this year. Within a month it had everyone's backing. Money was raised in April and May and the cluster was launched in ramping-up mode in September.
In addition to these goals, there were also architectural ideals. Varadarajan wanted a high-performance supercomputer based on a 64-bit processor and never looked at 32 bit. In addition, he felt that clusters imply gigabit Ethernet. You need high-performance interconnect with high-bandwidth, ultra-low latency. He also wanted to offer the cluster as a service, which meant he needed connections to Internet1, Internet2, and soon, into the NLR (National Lambda Rail). Lambda is a proposed high-speed network used to support research institutions.
In addition, the project had usage goals--to provide easy access for new investigators and exploratory research. The access policy was open door. Varadarajan explained that he didn't want to shut people out just because they don't have a grant. Often, you need results in order to get a grant. He also wanted to support multi-site research activities. Finally, for a premium, he wanted to support on-demand access to computational cycles. For example, an external customer may ask for so much power and in so much time. This required being able to check-point and store the current state of the system so that currently running applications wouldn't be lost.
Dude, You Need a Mac
A prime constraint in designing the Supercomputer was cost. Academia has small budgets so the focus was on high-price performance. Competing installations include DOE (Department of Energy) installations, which can afford to pay top dollar. The Virginia Tech computer wanted the same performance but for bottom dollar. The cost was more than just the machines. The existing facilities would need to be upgraded with cooling systems and power distribution. And they would have to account for the cost of the cables, memory, and back-up power. Varadarajan's team built one of the cheapest world-class Supercomputers. He laughed that "The fact that it's running is a big deal in itself."
He looked at various architecture options and was in the process of buying Dells when the deal fell through. He also worked with IBM and AMD and couldn't get the price to match. The budgets were coming in at $9 to $12 million dollars. The IBM with a PowerPC 970 was a first choice but the earliest delivery date would have been January 2004. Varadarajan said that you can't design a Supercomputer and wait that long for delivery. "You wouldn't buy a car and leave it at the dealer for a year and a half. We wanted a short three-month-build cycle and could not wait six months.
|
On June 23 Apple announced the G5. Varadarajan said that contrary to rumors, it was the first that they had heard about it as well. On June 26 they told Apple they were interested in placing a "fairly large order". A day later he flew to California and met with Apple. One of their first questions was how long he'd been a Mac owner. Varadarajan said he never had one. Twenty-four hours later Apple committed. Starting on September 5, the G5's arrived in Virginia. An audience member asked if he'd made the purchase through the Apple store. Varadarajan smiled and said that actually, yes, he had.
Performance and Power
Varadarajan said that a lot of people get the math wrong when calculating the performance of the machines. Each G5 processor has two, double-precision, floating-point units. Each is capable of a fused, multiple-add operation per cycle, so you get 2 flops per cycle. This means that 2GHz corresponds to 8 GFlops, so each dual G5 can deliver a peak of 16 GFlops of double-precision performance. That is more than a modern Cray node.
The primary communications architecture is built on InfiniBand's card, which has two ports on each node connecting into the network at 20 Gbps full-duplex bandwidth. Each node has a connection open to each other node and there is the potential to hold 150K connections per node. This translates into very low latency--less than 10 microseconds.
The computers and cables are just one piece of the infrastructure. Varadarajan also needed a large enough building to house the cluster, with a raised floor, environmental controls, fire suppression, and round-the-clock controlled access. In addition, the power needs include 1.5 MW of power coming in from two substations with back-up UPS and finally, a back-up diesel generator.
If you've ever sat with a TiBook in your lap, you understand that there is a further significant issue. As hot as a G4 runs, a G5 runs hotter. With a traditional air-conditioning setup, the calculations showed that instead of emptying out the air three times an hour as would be typical, they would need to empty the air three times per minute. Computers tend to each cool front to back. So the plan was to arrange the computers in rows back to back and pull the hot air out of the hot aisle. This would have required wind velocity under the floor of more than 60 miles per hour and still would have resulted in some hot spots. They decided instead to use a refrigerator-like system. Chillers cool water to 40 degrees to 50 degrees, which is then used to chill refrigerant, which is piped into a matrix of copper pipes. Effectively, you have a distributed refrigerator.
Tuning
The computers ran with few customizations. The volunteers started the computers, connected the InfiniBand card, restarted the computers, and cabled them up. The machines are currently running stock Mac OS X 10.2.7. An audience member asked if they use Software Update. Varadarajan said no but that there are plans to Pantherize the system in the next few weeks. This will require an install and a recompile of some of the code. Custom code included InfiniBand drivers and some parallel communication libraries known as MV APICH developed in Dr. Dhabaleswar Panda's lab at Ohio State University. This library had to be ported from Linux to Mac OS X. The PCI-X timing was changed to increase InfiniBand performance to 870MBps. Also, message caching and dynamic memory management were added for improved scientific application performance.
The LINPACK benchmark solves a very large system of linear equations, involving dense matrix operations. The main phase is LU decomposition (Gaussian elimination with partial row pivoting). The G5 cluster solved a system of equations at N=500K. The team realized that the only way to improve the benchmark score is to improve the numerical libraries. This boils down to the BLAS libraries. The core routine--matrix multiply (GEMM)--was optimized by Kazushige Goto. The current impressive benchmark results are due to a mix of Goto's libraries and Apple's veclib framework.
Varadarajan reported that "our latest numbers are 9.555 tera and we still have more tricks left. We are hoping for another 10 percent boost to become the first academic machine to cross 10 tera. The last ratings put us at number three worldwide." During the question-and-answer period at the end, an audience member from the Lawrence Livermore National Laboratory introduced himself as coming from the institution that had the Supercomputer that the Virginia Tech cluster had just passed. He asked whether the details of the Supercomputer would be published. The reply was that in addition to documentation and papers, the plans are to return the changes to MVAPICH to the open source project so that it would be freely available. There are also plans to open source the caching code and Varadarajan expects that Mellanox's code will be available.
Varadarajan said that they are getting requests for clones. "Expect to see a lot more G5 clusters."
Daniel H. Steinberg is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.
Return to Mac OS X Conference Coverage
Return to theMac DevCenter
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 50 of 50.
-
Next time M$-discipels downplay Apple. WE say: nr. # 3 !
2003-12-03 13:55:15 anonymous2 [Reply | View]
Yes folks... If all these Mac-bashers, IT CE's etc. say anything bad about Apple inc. and it's goods we just simply say:
nr. # 3 in the World!
Where were you btw with your fastest Xeon consumer whatever PC...*GRIN*
-
CPU Agnostic
2003-11-18 09:47:26 anonymous2 [Reply | View]
Up front, I'm a Wintel guy who's just scratching the surface of becoming a Lintel guy. I'm no scientist or engineer. I've used a Mac maybe twice in the pre-OS-X days, and never used one long enough to get the hang of it.
But I'm not married to any particular OS or architecture. I'm hearing impressive things about OS-X, and if I had the budget for a Mac I'd want to play with it. I say, if they can build a supercluster based on a G-5, then bravo! The engineering of the G-5 was obviously up to the task. Two things matter in the real world... price and results. Sounds like VT hit their mark, and I'm impressed that they were able to see outside the box and make it work.
Congratulations, Virginia Tech!
-
Could be the fastest
2003-11-13 21:50:02 anonymous2 [Reply | View]
You know..
if he bought another 10 million dollars in G5s it would be faster than the worlds fastest compouter in Japan.
which costs nearly 250 million dollars..
and it has 5124 processors which is more than double the G5....
If he spent a few more million dollars hed have the fastest computer in the world!!!
-
Ordering 1,100 G5s`
2003-11-03 08:38:43 anonymous2 [Reply | View]
Through the Apple store, he says. Were there any special offers going on at that time with the G5—Like he also got 1,100 Epson C60 printers with that order? I feel sorry for the person running the delivery dock at that uni!
-
Why the G5?
2003-11-01 07:02:43 anonymous2 [Reply | View]
I can't believe that people don't understand this. Everyone is commenting about how they bought the G5s "sight unseen." According to the article, "An IBM with the PowerPC 970 would have been the first choice, but the earliest delivery date would have been January 2004." So, they had their hopes pinned on a single processor IBM 970 based machine, but IBM couldn't deliver in time. One week later, Apple announces a DUAL processor machine using the same chip, available in a 60 day timeframe. Seems to me that the choice of the G5 was a complete and total no-brainer. They had already specced the 970 as having the capabilities they wanted, and now Apple was offering a consumer version with two of them. It was no reach at all for them to decide to grab those and run with them, it seems to me. Probably the only real issues were converting what was undoubtedly plans for a rack based facility into one that could handle the G5s larger form factor. -
Why the G5?
2003-11-04 20:58:48 anonymous2 [Reply | View]
There's one small difference though. The G5 produced for apple is actually a slightly crippled PPC970. It's lacking a co-processing unit that isn't useful for desktop users, but in this case is a loss for them.... In choosing G5's they decided that the sacrifice was worth their time. -
Why the G5?
2003-12-29 23:32:33 anonymous2 [Reply | View]
the sacrifice was worth their TIME??? what about money? wouldn't they have had to buy more IBMs to get the same power, given that the IBMs are single processor units? And how does the price compare even so?
Given their results, it's hard to say the made a terribly large sacrifice. *Maybe* they could have gotten better results for the same money by waiting, but that's always true in this world: if you wait long enough, the power you want will come down and down and down, all depending on how long you're willing to wait. Time was not an option for them, therefore they got the best possible choice. To say they sacrificed seems a distortion of the truth, unless you think they also sacrificed by not waiting until 2010, when something much more powerful might be a tenth the price.
-
what the heck are they going to do with it?
2003-10-31 15:34:17 anonymous2 [Reply | View]
1,100 G5 s all in a row/parrallel? What will they test with that? Perhaps play Go??
-
I still don't get it
2003-10-31 00:35:31 anonymous2 [Reply | View]
How can anyone make such a huge investment in hardware without 'test-driving' it first? Without seeing whether the machine has any problems? Whether the processor stands up to promises?
Maybe the most exciting part in this project is that it seems like no suits ever had a chance to glimpse at the plan - pure hacker power...
Still a gamble I don't understand. -
I still don't get it
2003-11-01 05:08:07 anonymous2 [Reply | View]
Money. Apple probably made an offer he couldn't refuse. They also probably promised full-time on site support and a really cheap replacement agreement for failed machines. In return Apple gets the ultimate in advertisement within the academic and scientific communities. The shock factor alone has people looking at Macs that would not have otherwise. -
I still don't get it
2003-10-31 11:35:04 anonymous2 [Reply | View]
Oh, come on!!
You don't seriously believe he didn't test drive the G5 when he did his little "visit" to Apple do you?!?
The article simply states that he had never owned a Mac personally. Not that he'd never used one, nor even that no on involved in the project had used one.
Get a grip on reality.
-
confidence
2003-10-31 05:22:41 anonymous2 [Reply | View]
I know Srinidhi. He VERY smart. If he had any problems, he would have solved them. There are very few problems that connot be solved if you are willing to work very hard at solving them. So, he took the risk. -
I still don't get it
2003-10-31 05:13:03 anonymous2 [Reply | View]
This is real world, most people in the world actually buy products w/o ever "test driving", ie. you want to try a new soda pop, you buy a can, you don't "test drive" before you buy.
He didn't need to "test drive" because the equipment satisfied all of his design requirments. That is a true "low bid is good bid" a low bid is only good when you know your specifications that you want. If the product satisfies the specifications you desire, and the price you desire, then it is a good low bid. Often people look at the price, and fail to look at the specifications. Thus resulting in an acquisitions that people are not happy with.
The good professor knew what he wanted, and knew how to achieve what he wanted. This is also the beauty of why a unix based OS X system. He didn't need to learn a new language.
Peace
DS -
I still don't get it
2003-10-31 01:31:16 anonymous2 [Reply | View]
Trust Apple. When PowerMac 6100, the first PPC Mac, was born, it was good and stable. Hardly see any problem. -
I still don't get it
2003-11-01 06:12:17 anonymous2 [Reply | View]
Yeah, the last major example you can think of was in 1995. That relates well to the argument at hand. Cheap consumer hardware running an OS without protected memory and preemptive multitasking. You really made your point with that gem! -
I still don't get it
2003-11-14 11:38:58 anonymous2 [Reply | View]
You know, I have a feeling you hane a hard time accepting the fact that Apple has been proven to build a stable reliable product compared to the Wintel. The hardware is one of the main reasons why. You seem like the person who always wants to do things the hard way. Ask why dosen't it work verses lets get one that works.
In addition, if it wasn't for the Apple about 90% of the computer products you use today would cos you a lot more. Apple has always placed quality of product and the latest technology first on the consumer computer market. Examples: Appletalk (the first consumer computer network), the Newton (pda), SCSI Drives, USB, Firewire, Crossplatform compatability, the desktop interface (from Xerox), the first to use RISC processing in a personal computer, first to use a 3 1/2" floppy, first to use DVD in a factory built unit, first to install a DVD-R and DVD-RW Super drive, etc...etc...etc... Let's face it Wintel user should thank us cutting edgers for making the stuff so cheap for you who like to work. -
You all still don't get it
2003-11-03 05:24:09 anonymous2 [Reply | View]
Yeah, bash, bash.
That helps!
Maybe it's not so bad untill someone starts shouting 'Nuke 'm'!!!
Very civilized, people! -
I still don't get it
2003-11-01 23:57:31 anonymous2 [Reply | View]
No protected memory and preemptive multitasking?!... Obviously you overlooked the little fact that these machines are running Mac OS X 10.2.7... A system which has FreeBSD UNIX at its heart...
Talking about gems... Your's is a very shiny one... -
I still don't get it
2003-11-01 18:31:28 anonymous2 [Reply | View]
Ok, I'm an engineer and I am designing and building a car. I am going to use a turbocharged Ecotec 2.2 from General Motors. I have never driven an Opel Speedster (the only GM product with a factory Ecotec Turbo), in fact I have never even seen one in person. I have driven a Saturn Ion and a Pontiac Sunfire (same engine, no turbo), but I know that the performance and handling are nothing like a speedster.
At this point Mr.'I don't get it' is asking me if I have ever test driven my car that I haven't even got wheels for yet!
It's called a custom application! I don't need to spend hundreds of hours in powerplant development. I don't need to spend thousands of dollars in custom manufacturing and machine work. Someone else has done that for me! GM has Ecotecs pushing over 800 horsepower demonstrating the stability and versitility of their new powerplant. If my parts locator calls me and says he has an engine for me, I'm not gonna ask him if he will take it back if I can't turn 9 second quarter mile times in my one of a kind prototype! WAKE UP YOU MORON!
-
Flop calculations are off
2003-10-30 14:52:49 anonymous2 [Reply | View]
In addition to the two floating point units, there's also the vector unit which can do 4 * 32-bit floating point multiply accumulates (vmaddfp) per cycle. I'd have to dig deeper into the docs to find if the bus can fully supply all three units without stalls. -
Flop calculations are off
2003-10-30 17:35:09 anonymous2 [Reply | View]
Yes, but those are single-precision calculations. As such they probably won't be that useful to most people doing high-performance computing. -
Flop calculations are off
2003-10-30 23:00:42 anonymous2 [Reply | View]
Its a matter of algorithm design, if you can split an algorithm into two single precisions streams, and then just do the final combine in double precision you can get some mighty mighty mighty impressive performance from a G5
Often double precision is only necessary because two inputs have massively different scales. If you are working with similarly scaled numbers then you can get away single precision... and then just use the double precision for recombining the various scales.
-
"Kazushige Goto Not Considered Harmful"
2003-10-30 08:54:01 anonymous2 [Reply | View]
... would have been a good subheading near the end there.
-
Thinking clearly, taking chances
2003-10-30 08:33:12 scienceman [Reply | View]
The entire Mac (and scientific) community owes a debt of thanks to this group, who demonstrated by direct action that they could think clearly, evaluate the numbers, and take this step based on their own knowledge of supercomputing. In one step -- though no doubt based on earlier experience with different hardware -- they catapulted their institution into prominence and their own work into effective productivity by making a large-scale resource available at comparatively low cost.
While it's only a matter of time before someone else imitates these techniques with different hardware, we can credit these folks with being first (the first at this price point, the first to create a nearly 10 Tflops machine at a university, and the first to do this with commodity 64-bit hardware at such a low cost) and for having the courage to act on their own technical knowledge. I'll bet the Dean and president at VT are breathing a sigh of relief now, having taken such a chance!
-
an important piece is missing.....
2003-10-29 12:38:31 anonymous2 [Reply | View]
Why doesn't the article include the most interesting part of the story ? How can you keep a cluster of over 1000 non-failsafe computers running ? Varadarajan has devised a system to make the cluster reliable, even though it, for one, doesn't have ECC RAM. -
an important piece is missing.....
2003-10-30 14:22:31 anonymous2 [Reply | View]
"Also, message caching and dynamic memory management were added for improved scientific application performance."
They have software that performs dynamic memory management. It would be niceif Apple brings ECC to at least the XServes in time, but I'm sure the memory management through software (which was developed at VT) was a big part of the cost savings, and they were never looking at machines with ECC RAM because the software could handle the error corrections. -
an important piece is missing.....
2003-10-29 22:54:59 anonymous2 [Reply | View]
You don't need ECC RAM if the software is designed with data redundancy in mind. If I recall correctly, HP showed a number of years back that in a demonstration of a machine that had some 50,000 known problems in hardware that software could be designed to compensate for problems and return accurate results.
In the case of VT, I believe they built error correction and fault tolerance into the software, allowing them to forgo expensive ECC RAM-based hardware. -
an important piece is missing.....
2003-10-29 15:20:11 anonymous2 [Reply | View]
Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy. -
an important piece is missing.....
2003-11-25 08:32:03 anonymous2 [Reply | View]
From what I understand, it's actually important that each node remain stable. The primary ways of dividing processing time on a supercomputer are nodes and hours. If you grab 10 nodes and are going to be doing computations for the next 10 hours, you'll have a significant issue with one node going down.
Especially if you have to then piece together a dataset with incomplete data. -
an important piece is missing.....
2003-10-29 20:29:10 anonymous2 [Reply | View]
That's not what anonymous 1 meant ... statistically there is a possibility that there will be an error in RAM that flips a bit randomly (caused by a stray cosmic ray or whatever). ECC RAM has an extra chip on the memory module to compensate for that possibility. The longer the calculation and the more computers involved the more likely that a RAM error will occur. It would be interesting to know how they do resolve the problem. -
an important piece is missing.....
2003-10-29 22:30:39 anonymous2 [Reply | View]
this was mentioned elsewhere - the uni has developed specific fault tolerant software. -
an important piece is missing.....
2003-11-05 02:16:55 anonymous2 [Reply | View]
There is now way that a external software can always find a intermittent memory fault.
You can do some things that is acceptable on a workstation that do graphics but if you do science or manufacturing work were every result is important you basically have to recalculate everything twice on different nodes, this will lower the peak performance with 50% but is the only way to know that the result is the right one.
This is a fantastic system but for organization were the result has to be correct they better look at a system with ECC.
-
an important piece is missing.....
2003-11-05 13:10:01 anonymous2 [Reply | View]
Wow, anonymous, it's too bad all of those supercomputing guys didn't ask you! You could have set them straight before they wasted all that time and money. I guess you'd better let the folks at top500.com know, quick - they must have missed it before now.
OR, just MAYBE, you have no idea what you're talking about. It's a relatively simple exercise in software development to identify spurious results via multipl iterations. ECC memory won't protect you from processor faults and other glitches anyway, so the software has to be robust enough to allow for bad results even with expensive memory.
But thanks for playing. -
an important piece is missing.....
2003-11-06 00:03:09 anonymous2 [Reply | View]
Thank you to
You even prove me right what do this mean:
It's a relatively simple exercise in software development to identify spurious results via multipl iterations.
This mean that you have to do everything multile times to prove your result.
But it look nice to have the peek performance to put you on the TOP500 list.
-
++++++++++an important message++++++++++++
2003-11-06 08:02:28 anonymous2 [Reply | View]
both Apple and Dr. Varadarajan seem to be clear about what they are doing and it certainly seems like a success story for both.
SO ALL YOU GUYS CAN SHUT THE HELL UP AND GET BACK TO WORK!!
+++++++++++++++++++++++++++++++++ -
++++++++++an important message++++++++++++
2003-11-07 06:50:17 anonymous2 [Reply | View]
If I KNEW WHAT YOU ALL ARE TALKING ABOUT,
I WOULD NEVER SAY ANYTHING
BRAINS






But I'm not married to any particular OS or architecture. I'm hearing impressive things about OS-X, and if I had the budget for a Mac I'd want to play with it. I say, if they can build a supercluster based on a G-5, then bravo! The engineering of the G-5 was obviously up to the task. Two things matter in the real world... price and results. Sounds like VT hit their mark, and I'm impressed that they were able to see outside the box and make it work.