Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Topics - JeGX

Pages: 1 ... 12 13 [14] 15 16 ... 39
Some notes on implementing ARB_shader_storage_buffer OpenGL extension in Mesa and the Intel i965 driver.

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.


Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels.

Full post:

A simple introduction to SSBO can be found here:

AMD FirePro S9170 Server GPU Offers Unmatched Onboard Memory to Support Large Dataset Computations.

AMD (NASDAQ: AMD) today announced the new AMD FirePro™ S9170 server GPU, the world’s first and fastest 32GB single-GPU server card for DGEMM heavy double-precision workloads1, with support for OpenCL™ 2.0. Based on the second-generation AMD Graphics Core Next (GCN) GPU architecture, this new addition to the AMD FirePro™ server GPU family is capable of delivering up to 5.24 TFLOPS of peak single precision compute performance while enabling full throughput double precision performance, providing up to 2.62 TFLOPS of peak double precision performance.

Designed with compute-intensive workflows in mind, the AMD FirePro S9170 server GPU is ideal for data center managers who oversee clusters within academic or government bodies, oil and gas industries, or deep neural network compute cluster development.

“AMD is recognized as an HPC industry innovator as the graphics provider with the top spot on the November 2014 Green500 List. Today the best GPU for compute just got better with the introduction of the AMD FirePro S9170 server GPU to complement AMD’s impressive array of server graphics offerings for high performance compute environments,” said Sean Burke, corporate vice president and general manager, AMD Professional Graphics group. “The AMD FirePro S9170 server GPU can accelerate complex workloads in scientific computing, data analytics, or seismic processing, wielding an industry-leading 32GB of memory. We designed the new offering for supercomputers to achieve massive compute performance while maximizing available power budgets.”

“There are some HPC workloads which require as much data as possible to stay resident on the device, and so the 32GB of memory provided by AMD FirePro S9170, the largest available on a single GPU, will enable the acceleration of scientific calculations that were previously impossible,” said Simon McIntosh-Smith, head of the Microelectronics Research Group at the University of Bristol. “For example, our new OpenCL version of the SNAP transport code from Los Alamos National Laboratory needs to keep as much data resident on the device as possible, and so the 32GB of memory will let us run problems of a much more interesting size faster than ever before. The large memory, combined with the 320GB/s memory bandwidth and double precision floating point performance, will make the AMD FirePro S9170 server GPU a ‘killer’ solution device for many HPC applications.”


“We have been developing a fully-parallel computational tool based on the AMD GPU heterogeneous computing platform and OpenCL,” said Omid Mahahadi, co-founder and director, Geomechanica Inc. “This tool accurately captures the complex physics of massive mines plus oil and gas fields rapidly and reliably. Thanks to the impressive 32GB of memory of the new cards, we expect to run computations on massive data structures containing tens of millions of data elements. The combination of rapid double-precision operations with the large memory capacity enables accurate, detailed, and reliable computations. A similar performance using CPUs would likely require much higher capital and maintenance costs. Moving forward, we plan to take advantage of the recent features of the OpenCL 2.0 open API to further enhance the performance of our software.”

Full press release:

- Vulkan API is more low-level than OpenGL (programmer is responsible for memory and threads management for example), what triggered this decision?

A low level API has simpler drivers.  This means reduced driver overhead – which results in higher performance for CPU limited applications – and fewer differences between multiple GPU vendors’ implementations.  Also, another fundamental advantage of handing the application more control is that the driver has to do less ‘behind the scenes’ management – resulting in much more reliable and predictable performance which doesn’t hit unexpected road bumps as the driver undertakes complex housekeeping tasks.

- What concrete improvements will gamers see, when Vulkan is used by video game studios? Can they expect better performance and better graphics, or is it just about simplifying studios backend work?

For applications that are CPU limited, which happens on desktop, and even more on mobile, end users should notice better performing applications with less stuttering and halting.

- When will Vulkan first version be released?

Vulkan is still on schedule to have specs and implementations before the end of the year.

Full interview:

NVIDIA has released a driver that brings fixes for the title Sony Vegas Pro.

You can download it from this page:

Havok®, a leading provider of AAA game development technology, extends congratulations to all of the nominees of Game Critics Awards for “Best of E3 2015,” which includes several of Havok’s developer partners.  The nominees pushed the limits of realism to create remarkable AAA game experiences.  Some of these great titles range from Bethesda Softworks’ wildly ambitious and gigantic open-world role-playing game,  Fallout 4, to the deeply compelling narrative and intense adventure of Sony and Naughty Dog’s  Uncharted 4: A Thief’s End, to the visually spectacular and blistering firefights of  Star Wars Battlefront. Havok’s cross-platform suite of technology helps create some of the most ambitious AAA games that leverage the new hardware, and this same technology continues to service future projects that will become milestone titles for years to come.

Havok congratulates the following AAA projects powered by Havok technology that are among the titles nominated by the Best of E3 2015 awards:

·          Star Wars Battlefront - Electronic Arts

·          Dark Souls III – BANDAI NAMCO Entertainment

·          Fallout 4 - Bethesda Softworks

·          Halo 5: Guardians - Microsoft Studios

·          Horizon: Zero Dawn - Sony Computer Entertainment

·          Just Cause 3 - Square Enix

·          Need for Speed - Electronic Arts

·          No Man's Sky - Hello Games

·          The Last Guardian - Sony Computer Entertainment

·          Tom Clancy's Rainbow Six Siege - Ubisoft

·          Tom Clancy's The Division - Ubisoft

·          Uncharted 4: A Thief's End - Sony Computer Entertainment

·          DOOM – Bethesda Softworks


3D-Tech News Around The Web / NVIDIA R 353.30 WHQL driver for Quadro
« on: June 24, 2015, 03:56:11 PM »
NVIDIA R353.30 graphics drivers for Quadro graphics cards:

- R353.30 Win7/Win8 64-bit
- R353.30 Win7/Win8 32-bit

- R353.30 Win10 64-bit
- R353.30 Win10 32-bit

- release notes

GeeXLab - english forum / GLSL Hacker available
« on: June 22, 2015, 09:20:20 PM »
A new update of GLSL Hacker is ready.

Download for Windows 64, Linux 64, OSX and Raspberry Pi here:


Big Pictures / ASUS Strix GTX 960 DC2 OC 4GB (8 pictures total)
« on: June 20, 2015, 06:27:17 PM »
Big pictures of ASUS' GTX 960 DirectCU 2 OC 4GB GDDR5.

The review is available here:

GeeXLab - english forum / Simple slideshow demo (Lua + GLSL 150)
« on: June 19, 2015, 06:34:23 PM »
Here is a very simple slideshow demo that displays several images, each image being displayed during 1.0 second (the duration can be changed in the code):

3D-Tech News Around The Web / MagPi #34 available
« on: June 18, 2015, 02:26:31 PM »
MagPi #34 is available in PDF format here:


#34 talks about VNC for Raspberry Pi, Windows 10 IoT Core, and more cool articles about RPi.

All previous issues are available here:


3D-Tech News Around The Web / Cache And How To Work For It
« on: June 18, 2015, 02:19:08 PM »
The Cache is always trying to guess what memory you’ll need to have before you request it, this prediction is called Cache Prefetching. This is why when working on an array it’s best to go through in sequential order instead of randomly jumping through, as the Cache Prefetcher will be able to guess what you’re using and have it ready before you need it. Cache loads things in groups of 64 bytes. The size is CPU-dependant and can be checked under your CPU’s specification under Cache Line size, although it’s typically 64 bytes. This means that if you have an array of integers and you grab 1 of those integers, the cache has also grabbed the Cache Line that it sits on. Grabbing the next integer stored next to it will be a Cache Hit and subsequently extremely fast. The alignment of the Cache Line will always be a multiple of the Cache Line's size, meaning that if you fetch memory at 0x00 (0) then what will be cached is everything between 0x00 (0) and 0x40 (64) and if you fetch something at 0x4F (79) then you’ll get everything between 0x40 (64) and 0x80 (128).

Full article:

In the horror genre, gore and guts are commonplace. And the folks at Tripwire Interactive are using our technology to take the gore in their new horror survival game to a new level. In Killing Floor 2, you must fight your way through waves of mutated specimens, called Zeds. The longer you fight, the messier things get.

The three foundations of the game’s initial design mantra were “Bullets, Blades and Blood.” And that lead to the creation of the M.E.A.T. (massive evisceration and trauma) system to depict dynamic gore, blood splatter and detailed graphic violence. To get the M.E.A.T. just right, Tripwire made Killing Floor 2 the first game to use our NVIDIA PhysX FleX technology for soft tissue and fluid interaction. That’s geek for guts and blood splatter.

- A Bloody Masterpiece: Killing Floor 2 is First with NVIDIA FleX

- NVIDIA Flex SDK v0.8

3D-Tech News Around The Web / Demoscene Statistics (1991 - 2014)
« on: June 15, 2015, 02:55:47 PM »
Some statistics about the demoscene:

More stats and full story here:

3D-Tech News Around The Web / OS X Metal - Raw Notes
« on: June 15, 2015, 11:24:29 AM »
Metal for desktop has instancing, sane constant buffers, texture barrier, occlusion query, and draw-indirect. It looks like it does not have transform feedback, geometry shaders or tessellation. (The docs do mention outputting vertices to a buffer with a nil fragment function, but I don't see a way to specify the output buffer for vertex transform. I also don't see any function attach points for geometry shaders or tessellation shaders.)


3D-Tech News Around The Web / AMD Radeon R9 Fury X latest news
« on: June 15, 2015, 11:14:47 AM »
Latest news about AMD Radeon Fury (Fiji GPU).



The board has 2 x 8-pin power connectors that allows the card to draw up to 375W.


Scores at 3DMark FireStrike Extreme (P1440 ) and Ultra (P2160):

We present a technique for synthesizing the effects of skin microstructure deformation by anisotropically convolving a highresolution displacement map to match normal distribution changes in measured skin samples. We use a 10-micron resolution scanning technique to measure several in vivo skin samples as they are stretched and compressed in different directions, quantifying how stretching smooths the skin and compression makes it rougher. We tabulate the resulting surface normal distributions, and show that convolving a neutral skin microstructure displacement map with blurring and sharpening filters can mimic normal distribution changes and microstructure deformations. We implement the spatially-varying displacement map filtering on the GPU to interactively render the effects of dynamic microgeometry on animated faces obtained from high-resolution facial scans.


3D-Tech News Around The Web / Unreal Engine 4.8 released
« on: June 11, 2015, 05:07:41 PM »
The largest update to Unreal Engine to date, this release includes 189 great changes that were submitted from our amazing community of developers, plus loads of new upgrades from Epic.

- Release notes

A bit lost with DirectX feature level?

A DirectX feature level, in contrast, defines the level of support a GPU gives while still supporting the underlying specification. This capability was first introduced in DirectX 11. Microsoft defines a feature level as “a well defined set of GPU functionality. For instance, the 9_1 feature level implements the functionality that was implemented in Microsoft Direct3D 9, which exposes the capabilities of shader models ps_2_x and vs_2_x, while the 11_0 feature level implements the functionality that was implemented in Direct3D 11.”


A 24.8 precision texture interpolator means that there's a maximum of 256 intermediate values possible between two adjacent pixels of a texture. 256 values are a lot for albedo textures for sure, but often in computer graphics textures encode not only surface properties, but they serve as LookUp Tables (LUT), heighfields (for terrain rendering), or who knows what. In those cases, you can find yourself easily lacking more resolution than 256 values between pixels. This article is about why this problem manifests and how it can be easily workarounded. In the image below you can see the difference between a regular GLSL's texture() or texture2D() call which triggers the hardware texture interpolation with its 256 intermediate values and that procudes starcase artifacts versus the correct full floating point texture interpolation which produces the desired smooth results.


In short for GLSL shaders, replace:
Code: [Select]
// regular texture fetching
vec4 textureBad( sampler2D sam, vec2 uv )
    return texture( sam, uv );


Code: [Select]
// improved bilinear interpolated texture fetch
vec4 textureGood( sampler2D sam, vec2 uv )
    vec2 res = textureSize( sam );

    vec2 st = uv*res - 0.5;

    vec2 iuv = floor( st );
    vec2 fuv = fract( st );

    vec4 a = texture( sam, (iuv+vec2(0.5,0.5))/res );
    vec4 b = texture( sam, (iuv+vec2(1.5,0.5))/res );
    vec4 c = texture( sam, (iuv+vec2(0.5,1.5))/res );
    vec4 d = texture( sam, (iuv+vec2(1.5,1.5))/res );

    return mix( mix( a, b, fuv.x),
                mix( c, d, fuv.x), fuv.y );

Pages: 1 ... 12 13 [14] 15 16 ... 39