Learning CUDA

In this worklog I'll be documenting my process of learning how to make a simple CUDA kernel.

To Start

I'll be using this¹ tutorial/blogpost published by Mark Harris.

In preparation I installed g++ and nvidia-cuda-toolkit.

The blog post begins with the c++ code:

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

Lets analyze what this does,
to start we define add(int n, float *x, float *y) which is just a for-loop iterating through two arrays of size N.

What is size N? N is defined as int N = 1 << 20;. "<<" is an interesting operator, it signifies a bitwise left shift operator, essentially it can be read as $2^{20}$ or moving the 1 bit 20 bits to the left. This number is $n^{20}=1,048,576$. This all means that we have two arrays $\sim1000000$ elements in size.

Continuing, both arrays which are initialized. All values being a float of $1.0$ for x and float $2.0$ for y, which are added into array y in add(N, x, y);, and the result is checked for errors by subtracting float $3.0$ from each element in the array and checking if it is bigger than the max error ($0.0$).

Then for conventions sake the memory is cleared and a 0 is returned.

Running CPU/Host

Running this program of course produces Max error: 0 in the console. So far this code is running on the CPU. So lets get onto the GPU implementation.

GPU Start

The theory behind it is quite simple to understand, first off we need to create a Cuda C++ or .cu file.
Here is the cuda code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}

Thankfully the author Harris already commented on the code but I will explain what it's doing nonetheless.

We only want to run the add(int n, float *x, float *y) function on the GPU, so we declare that it runs on the GPU using __global__ which makes the add(int n, float *x, float *y) function a cuda kernel. To summarize the CUDA C++ compiler understands this function runs on the GPU and can be called from CPU code.

Nvidia uses something called Unified Memory² which makes allocating memory similar to dynamic memory allocation in c++ (in my opinion). First we have our int N = 1 << 20; from the CPU only code which we know is just the integer $1,048,576$ and the pointers float *x, *y;. This time instead of dynamically allocating our memory with float *x = new float[N] i.e. creating an array N in depth, we use cudaMallocManaged with N*sizeof(float) to allocate the memory for N floats for x and y respectively.

We go onto initializing the array same as before also on the host or CPU same as before. Here is the interesting part, we can now call add<<<1, 1>>>(N, x, y); which launches a single thread on the GPU and goes through our arrays adding them same as how the CPU did. When it's finished we cannot simply act as if we have been running solely on the CPU though, a cudaDeviceSynchronize(); is necessary to allow the GPU to completely finish before returning the result to the host.

The errors are checked same as before and like like before we need to free the memory which is done with cudaFree(); for x and y respectively.

Running GPU/Device

As expected there is no errors and Max error: 0 is returned same as before. Additionally though Harris

WIP

This post is actively being written!

Mark Harris, Jan 25th 2017, An Even Easier Introduction to CUDA. ↩
A pool of all memory available on the system CPU and GPU, link to nvidia technical blog. ↩

Worklog Posts

Learning CUDA

To Start

Running CPU/Host

GPU Start

Running GPU/Device