0 Members and 1 Guest are viewing this topic.
CUDA 6, Available as Free Download, Makes Parallel Programming Easier, FasterWe’re always striving to make parallel programming better, faster and easier for developers creating next-gen scientific, engineering, enterprise and other applications.With the latest release of the CUDA parallel programming model, we’ve made improvements in all these areas.Available now to all developers on the CUDA website, the CUDA 6 Release Candidate is packed with several new features that are sure to please developers.A few highlights: Unified Memory – This major new feature lets CUDA applications access CPU and GPU memory without the need to manually copy data from one to the other. This is a major time saver that simplifies the programming process, and makes it easier for programmers to add GPU acceleration in a wider range of applications. Drop-in Libraries – Want to instantly accelerate your application by up to 8X? The new drop-in libraries can automatically accelerate your BLAS and FFTW calculations by simply replacing the existing CPU-only BLAS or FFTW library with the new, GPU-accelerated equivalent. Multi-GPU Scaling – Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node. This provides over nine teraflops of double-precision performance per node, supporting larger workloads than ever before (up to 512GB).And there’s more.
The following are known issues with the CUDA 6.0 Release Candidate that will be resolved in the production release:‣ The minBlocksPerMultiprocessor parameter for the launch_bounds() qualifier only accepts values up to 16 when used in compiling for sm_50, eventhough values up to 32 are possible on that architecture.‣ There is a performance issue with the new SIMD video intrinsics __v*2() and __v*4() when used in compiling for the sm_50 architecture.‣ The sm_50 architecture supports 48 KB of shared memory per block; however, the check for this limit is not functioning properly in the compiler. This can allowprograms that use more than 48 KB of shared memory per block to compile successfully, although they will fail to run because the driver component does checkthe limit properly.‣ The MT19937 random number generator in the cuRAND library generates non-deterministic results for curandGenerateUniformDouble().‣ The NPP library function nppiAlphaComp_8u_AC4R() generates incorrect results when used with the NPPI_OP_ALPHA_ATOP_PREMUL option.‣ The NPP library functions FilterSobelHorizSecondBorder() and FilterSobelVertSecondBorder() may generate incorrect results.
devicequery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "GeForce GTX 750 Ti" CUDA Driver Version / Runtime Version 6.0 / 6.0 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 2048 MBytes (2147483648 bytes) ( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores GPU Clock rate: 1268 MHz (1.27 GHz) Memory Clock rate: 2700 Mhz Memory Bus Width: 128-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GeForce GTX 750 TiResult = PASS
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure perfomance.> Single precision floating point simulation> 1 Devices used for simulationGPU Device 0: "GeForce GTX 750 Ti" with compute capability 5.0> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]5120 bodies, total time for 10 iterations: 8.208 ms= 31.936 billion interactions per second= 638.723 single-precision GFLOP/s at 20 flops per interactionRun "nbody -benchmark [-numbodies=<numBodies>]" to measure perfomance. -fp64 (use double precision floating point values for simulation)> Double precision floating point simulation> 1 Devices used for simulationGPU Device 0: "GeForce GTX 750 Ti" with compute capability 5.0> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]5120 bodies, total time for 10 iterations: 220.679 ms= 1.188 billion interactions per second= 35.637 double-precision GFLOP/s at 30 flops per interaction