FFT acceleration
RealityFrontier announces the early availability of its FFT acceleration appliances.
Signal processing and FFT
Fast-Fourier transforms (FFT) are widely used in digital signal processing. RealityFrontier has been facing with two needs: lowering the energy consumption for mobile applications as well as increasing the processing throughput for real-time applications.
The computation is performed by a small-form factor PC. It is housing a NVIDIA card programmed using the CUDA toolkit. The throughput is as much as 50x compared to the same FFT computation performed on a similarly priced CPU board. It is still 10x higher than a computation optimized for an Intel multi-core CPU. (See our analysis below)

An example of use of FFT is in mobile robotics. The 2D FFT of an image provides for instance a high-pass filters in an image processing pipeline.
CoroWare, a partner with RealityFrontier, has been speeding up the signal processing of its mobile robots by using the RealityFrontier acceleration appliance.
Appliance specifications
RealityFrontier appliances are aimed at speeding up the computation of FFT while keeping the appliance’s electric consumption to a minimum. RealityFrontier offers two FFT acceleration appliances.
Embedded systems |
GPU Stream processors Power consumption Cooling CUDA Toolkit Speed-up (2) |
ION 2 16 35 Watts Fan-less 3.2 up to 10 |
|
For the lab |
GPU Stream processors Power consumption Cooling CUDA Toolkit Speed-up (2) |
GTS 450 (1) 192 350 Watts 2 fans 3.2 up to 50 |
(2) Speed-up using the GPU when compared to a single-thread implementation on the CPU
NVIDIA CUDA Toolkit 3.2
NVIDIA has recently updated its programming CUDA Toolkit. The CUDA Toolkit 3.2 provides many performance improvements over its previous iterations, making it a platform of choice for RealityFrontier when compared to OpenCL or specialized math libraries such as the Intel MKL library.
The latest CUDA Toolkit brings some noticeable improvement for signal processing. The CUDA FFT library (known as CUFFT) is delivering a 300% increase in performance compared to the previous Toolkit version. The benefits gained with the CUDA Toolkit 3.2 apply only to the lab version of RealityFrontier compute appliance – as only that version is using the power-hungry Fermi architecture.
Contrasting NVIDIA to Intel
NVIDIA has published how its latest solution compares with the one from Intel, its Nemesis in the many-core GPU war against multi-core CPU. The CUDA Toolkit brings 2 to 10 times faster performances when compared to Intel’s MKL library. NVIDIA compares as well favorably in terms of software licence cost. Intel has priced its MKL at USD 399.00. NVIDIA does not charge for its CUDA Toolkit 3.2.
One will have to use the latest architectures from both vendors to obtain these figures: the Fermi architecture in the case of NVIDIA and the Xeon™ or Core™ processor family in the case of Intel (or a processor compatible with Intel’s Streaming SIMD Extensions instruction set).
Let’s evaluate how RealityFrontier’s compute appliance will perform if one would decide to deploy the Intel’s MKL on it. Only the lab version of our compute appliance can support the MKL. The embedded system version use a ION 2 that is not SSE capable.
The compute appliance is powered by a dual-core CPU. Each of the CPU cores has two physical threads, bringing the total to 4 physical threads. Intel has published a whitepaper that shows an almost linear increase of performance with the number of threads up to 8. With 4 threads, we would then expect a 4-time speed up by using the MKL.
The GPU brings a far higher speed-up than the CPU even when using the Intel MKL.
Other Toolkit 3.2 improvements
Does your application need other applied math improvement? Using the latest CUDA Toolkit will bring some benefits as well. NVIDIA has announced that the new CUDA version sports a similar 300% improvement in performance compared to the previous version for the CUDA BLAS library, making it 8 times faster than MKL (useful for linear algebra such as matrix multiplication). Other improvements are in CURAND making this library 10 to 20 times faster than MKL (useful for random number generation). Finally, the latest version of CUSPARSE delivers blazing fast routines that operate on sparse matrices and that are between 6 and 30 times faster than MKL.
Additionally, the CUDA Toolkit 3.2 now comes with built-in support for H.264 encoding and decoding. Both the support for H.264 and the BLAS performance improvement are of particular importance in the field of video processing. RealityFrontier’s customer will benefit from these improvement when using video stream computation on mobile devices or in the lab.
Benefits
Using GPU cards to accelerate FFT is believed to provide one of the best throughput per watt as well as the most cost effective solution. Such a massively parallel architecture is a particularly good fit for signal processing.
RealityFrontier is primarily offering these compute appliances as part of its CUDA consulting service. The appliances can also be ordered as stand-alone equipments by those customers who wish to code their own CUDA algorithms with a pre-configured system.
This is an early availability announcement. Contact us for pricing information.
More information
CUDA Toolkit 3.2. NVIDIA programming toolkit (November 2010).
Intel MKT. Intel math library optimized for multi-core CPU.