M1R 2024-2025 project :
«making GPU Tensor cores compute texture interpolation (and more)»

Advisor

Antoine RICHERMOZ & Fabrice NEYRET - Maverick team, LJK, at INRIA-Montbonnot (Grenoble)

Context

GPU shaders perform an incredible amount of computations per second, allowing more and more complex effects in games. Still, the texture unit hides a lot of even more intensive computations it does for us: the texture interpolation, in 2D for surface textures or in 3D for volumes. Alas there are some cases where we can no longer rely on it: cubic interpolation, unusual texture formats (e.g. made of tiles), unusual blending modes (e.g. alpha-premultiply), textures of more than 4 channels (e.g. spectrums). Doing instead all these computations manually in the fragment shader would be tedious and very costly, especially in 3D. But nowadays GPUs have another powerful unit: the Tensor Core, and it is possible to call them from the shaders
→ Could we divert it to do the job for us ?

Description of the subject

Tensor Cores, built to accelerate deep-learning computations, are basically engines dedicated to matrix multiply and accumulation. We can express texture interpolation as a sum of texture value at corners times the respective interpolation coefficient (for 2D bilinear as well as for 3D and for tricubic), for the N channels of a texture buffer. Even adding an extra task such as volume ray-marching and accumulation of a few steps would still fit this scheme. So, the purpose of this internship subject is to decompose the texture interpolation task as the preparation of the coefficients in the fragment shader and the dialog with the Tensor Core to complete the calculation. First with basic 2D bilinear color texture interpolation, then 2D bicubic, 3D trilinear, and as far as the experimental fate allows – knowing that the expected gain is larger for larger footprints.
Sadly, as often with the GPU there are implementation constraints such as fix set of possible sizes (strangely API-dependent, e.g. 16×8×16 and 16×8×8 in Vulkan) which not conveniently fit our data size. Padding with 0 is always possible, but it’s wasting resources, so we should find more creative ways – knowing for instance that there exists a “sparse matrix” feature available in some APIs. Also, being exploring at the fringe of advanced and rarely explored yet ways of tickling the GPU, we sometime bump into undocumented feature or performance limitations: this can be pretty close to lab experiment. But sometime we do unveil powerful secret valleys of new ways to smartly use the GPU. In all cases, we obtain a better understanding of how the GPU behaves at undocumented levels.

Bibliography and examples

Cf links in the description.

Prerequisite