GPU Compute Basics

So, why should you dive into the world of GPU programming?

The answer is relevant on multiple levels. Firstly, GPUs are undeniably the future of high-performance computing. Their parallel processing capabilities are becoming increasingly crucial as computational demands surge. Secondly, and perhaps more immediately compelling, is the sheer performance advantage. Learning to program a GPU unlocks access to vast computational power that CPUs simply can't touch. For certain types of problems, the speedup can be many orders of magnitude, transforming previously intractable tasks into readily solvable ones. Imagine simulations running in real-time that previously took days to complete, or machine learning models training in hours instead of weeks. This level of performance opens doors to entirely new possibilities in fields like scientific research, artificial intelligence, and high-fidelity graphics.

However, it's crucial to acknowledge that GPUs aren't a universal solution for computing. They excel in specific domains, and their strengths are complemented by the versatility of CPUs. For instance, algorithms that rely heavily on recursion, a fundamental technique for tackling problems involving hierarchical or graph-like data structures, pose a significant challenge for GPU architectures. The inherent parallelism of GPUs doesn't lend itself easily to the sequential nature of recursive calls. Not only that, but GPUs don't really support the concept of a stack like on the CPU. Similarly, data structures on GPUs are often more constrained than their CPU counterparts. While it's possible to implement complex structures like hash tables, the limitations of current shading languages can make the implementation convoluted and less efficient than on a CPU. This means that choosing the right tool for the job – CPU or GPU – is paramount.

Despite these limitations, for a specific and growing class of problems, the performance boost offered by GPUs is, to put it mildly, unparalleled. While you might need to adapt your thinking to simpler data structures and navigate some initial complexities, the resulting gains in computational speed are often more than worth the effort. The ability to tackle massive datasets, run complex simulations at interactive rates, and train sophisticated AI models efficiently makes GPU programming an invaluable skill in the modern computing landscape. It's not about replacing CPUs, but rather about expanding your computational toolbox to leverage the unique power of GPUs for the tasks they're best suited to handle.

While GPUs offer tremendous computational power, harnessing that power often requires a substantial amount of CPU-side code. The CPU acts as the orchestrator, managing data transfer to and from the GPU's memory, launching compute kernels, and synchronizing execution. This CPU code is responsible for setting up the necessary data structures, configuring the GPU environment, and handling any pre- or post-processing of the data. It effectively forms the "glue" that binds the GPU's raw processing capabilities to the overall application. This layer of CPU code can be quite complex, especially when dealing with advanced GPU features or intricate data dependencies.

Ironically, the actual GPU compute code—the kernels or shaders that perform the core computations—is often surprisingly small. These kernels are highly optimized and focused, designed to execute a specific task on a massive number of data elements in parallel. While they are undeniably crucial and require careful design, their size is frequently dwarfed by the surrounding CPU code. This disparity highlights the fact that GPU programming isn't simply about writing compute kernels; it's about building a complete system that effectively leverages the GPU's parallel processing capabilities.

Therefore, developing GPU-accelerated applications is often a balancing act. A significant portion of the development effort is dedicated to the CPU-side infrastructure, ensuring efficient data management and kernel orchestration. The GPU compute code itself, though small, represents the critical core of the application, where the actual parallel processing takes place. This division of labor necessitates a strong understanding of both CPU and GPU programming paradigms, making GPU development a challenging but ultimately rewarding endeavor.

Let's take a look at how a CPU might process an image.

Figure 1.1. Sample Image. This image is 512x512 pixels, which means there are a total of 262,144 pixels in the image. Fundamentally, the image is composed of 512 scanlines that are 512 pixels wide.

Figure 1.2. Sample Image. If we zoom in, the horizontal scanlines and individual pixels clearly visible.

Zooming in to the image to see individual scan lines and pixels.

Let's examine some C++ code to saturate the red channel of the image. This is a skeletal example designed to show how image processing might be done, but it still has quite a lot of code.

Figure 1.3. C++ code for processing the image.

#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>

// Structure to represent image dimensions.
// Makes code more readable.
struct ImageDimensions
{
    size_t width;
    size_t height;
};

class Image
{
public:
   Image( ImageDimensions dims ) :
   m_oDims( dims ), m_oData( dims.width * dims.height * m_uBytesPerPixel ) {}

   ImageDimensions getDimensions() const { return m_oDims; }
   std::vector<uint8_t>& getData() { return m_oData; }
   const std::vector<uint8_t>& getData() const { return m_oData; }

   // Access pixel data at (x , y ).
   // Performs bounds checking ( Optional, but good practice ).
   uint8_t* GetPixel( size_t x, size_t y )
   {
      assert( x < m_oDims.width );
      assert( y < m_oDims.height );

      return &m_oData[ ( y * m_oDims.width + x ) * m_uBytesPerPixel ];
   }

   const uint8_t* GetPixel( size_t x, size_t y ) const
   {
      assert( x < m_oDims.width );
      assert( y < m_oDims.height );

      return &m_oData[ ( y * m_oDims.width + x ) * m_uBytesPerPixel ];
   }

private:
   ImageDimensions m_oDims;
   std::vector<uint8_t> m_oData;
   static constexpr size_t m_uBytesPerPixel = 3; // RGB (Red, Green, Blue)
};

int main()
{
   ImageDimensions dims{ 512, 512 };
   Image image( dims );

   // Saturate the red channel ( set to 255 ).
   for( size_t y = 0; y < dims.height; ++y )
   {
      for (size_t x = 0; x < dims.width; ++x)
      {
         uint8_t* pixel = image.GetPixel( x, y );
         pixel[0] = 255; // Red channel
         // pixel[1] = 0;   // Green channel (optional: set to 0)
         // pixel[2] = 0;   // Blue channel (optional: set to 0)
      }
   }

   return 0;
}

Figure 1.4. Classic nested for() loop for image processing..

This section of code forms the core of our image processing pipeline. We represent the image data using a std::vector, a one-dimensional array. To simplify working with the image's rows (scanlines), we've implemented some syntax sugar that allows us to iterate over sections of this array as if they were distinct scanlines. The code within the inner loop, which performs the actual pixel manipulation, will be executed 262,144 times — once for each pixel in our 512x512 image.

// Saturate the red channel ( set to 255 ).
for( size_t y = 0; y < dims.height; ++y )
{
   for (size_t x = 0; x < dims.width; ++x)
   {
      uint8_t* pixel = image.GetPixel( x, y );
      pixel[0] = 255; // Saturate the red channel.
   }
}

The double nested for loop is a fundamental and straightforward approach for processing image data. The outer loop typically iterates over the image's scanlines (rows), while the inner loop traverses the pixels within each scanline. This row-by-row, pixel-by-pixel traversal ensures that every single pixel in the image is visited and can be manipulated as needed. This pattern is particularly useful for tasks like applying image filters, color transformations, or analyzing pixel data. It mirrors how raster-based images are stored in memory, making it a natural way to access and modify pixel values.

However, when dealing with very large images, the nested loop approach can put significant pressure on memory, especially if modifications are made in place. If each pixel requires a substantial amount of processing or if the image dimensions are enormous, the program might experience performance issues due to cache misses or even run out of memory. This is because all the data for the current row needs to be in a relatively fast area of memory. If the rows are too large, the CPU will have to wait for the next chunk of the row to be loaded from main memory, which is significantly slower than the CPU cache.

Strategies to mitigate memory pressure include processing the image in tiles or chunks, rather than loading the entire image into memory at once. This allows for more efficient memory management by working with smaller, manageable portions of the image. Additionally, careful consideration should be given to data structures and algorithms used within the loops to minimize memory allocations and unnecessary data copying. If possible, operations should be performed in place to avoid creating multiple copies of the image data.

Figure 1.5. Results. The red channel is saturated.

Shows the results of saturating the red channel.

Within the heart of the image processing loop, pixels are drawn into the inner loop like matter spiraling into a computational black hole. The CPU, acting as the gravitational force, fetches each pixel's data from memory, one by one. This data, typically representing the red, green, and blue color values of the pixel, is then subjected to the transformations defined within the loop's body. Whether it's a color adjustment, a filter effect, or some other manipulation, the pixel's data is modified, undergoing a change within this computational vortex.

Once the pixel has been processed, it's ejected from the inner loop, its altered data now ready to be written back to memory. Like energy radiating from a black hole, the modified pixel emerges, carrying the imprint of the transformations it has undergone. This constant flow of pixels into and out of the inner loop creates a continuous stream of processed image data, gradually transforming the entire image.

The efficiency of this process depends heavily on how quickly the CPU can feed the inner loop with pixel data. Memory access patterns, caching strategies, and the complexity of the pixel manipulations all play a role in determining the overall performance. Just as the behavior of matter near a black hole is governed by extreme gravitational forces, the speed and efficiency of pixel processing are dictated by the underlying hardware and the design of the image processing algorithms.

Figure 1.5. Inner Loop / Black Hole. Note that the diagram shows you the source pixels being sucked into the inner loop, and ejected out into the destination image.

Things are quite different on the GPU.

While CPU-based image processing often involves sequentially fetching individual pixels into the inner loop for modification, GPU-based processing takes a fundamentally different approach. Instead of pixels being pulled into the processing unit, the program itself is broadcast or dispatched over the vast sea of pixel data. The GPU, with its massively parallel architecture, executes the same processing kernel (the program) concurrently across thousands of pixels. Imagine the inner loop being replicated and applied simultaneously to every pixel in the image, rather than iterating through them one by one. This "data-parallel" execution model allows the GPU to perform image processing at a dramatically accelerated rate, as the same instructions are applied to a multitude of data points in parallel. The GPU doesn't wait for pixels to come to it; it sends the processing logic to the pixels, achieving massive speedups for many image processing tasks.

Figure 1.6. Fragment Shader. This is the image processing kernel that will be run on every pixel in the image.

// #version 430
// The version number is automatically injected by the application.
// It is included above for reference purposes only.
#include <SPA_Version.glsl>
#include <SPA_Constants.glsl>
#include <Modules/SPA_EditStateFragmentColorOverride.glsl>
#include "vertex_attributes.glsl"

in Data { vertexData attributes; } DataIn;

layout( binding = 0 ) uniform sampler2D src_image;

out vec4 fragColor;

void main(void)
{
   vec4 pixel_color = texture( src_image, DataIn.attributes.texcoord );
   pixel_color.r = 1.0;
   fragColor = pixel_color;
}

Figure 1.7. Pixel Shader Wavefronts. A stylized representation of pixel shader wavefronts. Imagine the program shown above running inside each grid square. Every program is the same, but they all execute at the same time.

Shows the program being dispatched over the image on the GPU.

GPU programming presents a powerful paradigm shift in how we approach computationally intensive tasks, particularly those involving large datasets. While CPUs excel at sequential processing and complex control flow, GPUs shine in their ability to execute the same operation across vast amounts of data concurrently. This data-parallel approach makes GPUs ideal for problems where the same algorithm can be applied independently to many data points, such as image processing, scientific simulations, and machine learning. Instead of fetching individual data elements into a processing core, as in CPU-based workflows, GPUs broadcast the processing logic (a kernel) to the data itself, enabling massive parallelization.

This "program-centric" approach, as opposed to the CPU's "data-centric" model, unlocks significant performance gains. In the context of image processing, for example, a single fragment shader program can be executed simultaneously on every pixel of an image, drastically accelerating operations like color adjustments, filtering, and transformations. Furthermore, techniques like array textures and shader storage buffers allow for efficient access to large datasets within these kernels, enabling complex operations on volumes of data. This massively parallel programming model allows developers to focus on defining the core processing logic, while the GPU's hardware and scheduler handle the complexities of parallel execution.

Ultimately, GPU programming offers a compelling solution for a wide range of parallel data processing challenges. While not a universal replacement for CPUs, GPUs provide a specialized tool for tasks that benefit from data-parallelism. By embracing the GPU's unique architecture and programming model, developers can unlock unprecedented performance, enabling them to tackle previously intractable problems and push the boundaries of computation in fields ranging from graphics and visualization to scientific computing and artificial intelligence.

GPU programming can be difficult to learn, but instead of having to solve the problems with parallelism all by yourself, or with limited results libraries for parallel programming on the GPU, you can leave it to the thread scheduler on the GPU.

Or in other words:

Write once, run millions and millions and millions of times.