, , ,

This is the first article of ‘Hello world’ for CUDA platform article series. In this article I will write so really super simple kernel to introduce CUDA environment and to build foundations for further work. I will explain also what kernel is, by the way 😉

CUDA ‘Hello world’ articles:

1. Part 1: simple use of kernel with managed environment
2. Part 2: simple example of real parallel programming
3. Part 3: real life CUDA example: time series denoising with Discrete Wavelet Transform (wavelet Daubechies 4)


1. First of all you have to have computer with NVidia graphic card with CUDA support. The full list can be found at http://en.wikipedia.org/wiki/CUDA.

2. The next thing is to download and install CUDA Toolkit. The newest version can be found at https://developer.nvidia.com/cuda-toolkit.

3. While I will try to show how to interact with .NET managed enviroment the last step is to download managedCuda framework. It can be found at http://managedcuda.codeplex.com/.

CUDA code:

There is one few things that need to be clarified before we begin. When we are writing some code to be run on GPU we need to define three terms:
1. Device
2. Host
3. Kernel
The Device is your GPU, the Host is your CPU. They both have their own memory – this is important because you will need to take care for variables on both ends. The Kernel is simply the function that runs on the Device.

Now when we know all that stuff we are ready to have some fun. As a .NET developer I am going to use Visual Studio 2012 as my environment. So simply let us choose ‘New project’ and then let us choose ‘CUDA 5.5 Runtime’ (the name might be different depending on CUDA Toolkit version).VisualStudioCudaNewProjectThis project type is installed by CUDA Toolkit (I have mentioned about it at prerequisites section). Let us open kernel.cu file and clean it all up. Next copy and paste the code below into kernel.cu.

#include "cuda_runtime.h"
#include <stdio.h>

__global__ void kernel(int a, int b, int *c)
    *c = (a + b)*(a + b);

int main()
    return 0;

Those twelve lines are fully functional CUDA program (I mean kernel). Not rocket science, is it? I left the main method empty because we are going to “consume” the kernel from .NET environment. The crucial here is the method kernel (the name can be chosen freely). This is what we want to run on GPU. The __global__ key word is defined in cude_runtime.h. It means that this method will be run on the Device (that means on GPU). This method takes two integer parameters a and b and returns also integer number. Note that output parameter is a pointer to integer. We will simply count the (a * b)^2 – let’s keep it as simple as possible.

We may now compile our code. The outcome should be kernel.cu file in the output directory. We are almost done but what we need to have is not .cu file but .ptx file (preprocessed file). To produce such we need to inform compiler to keep it for us. To do that in project properties we need to choose ‘Keep preprocessed file‘ to ‘Yes (–keep)‘.VisualStudioCudaPreprocessedFileThere should be far more files in output directory right now and among them the new that we wanted: ‘kernel.ptx‘. We are now ready to consume it in .NET.

.NET environment – C#:

Now we are ready to use our kernel method in managed code. Let us add new project to our solution: ‘C# Console Application’. The next step to to add reference to managedCuda – ManagedCuda.dll. To utilize our kernel we need to do three thing:
1. Create CUDA context
2. Load module with our kernel (the ptx file)
3. Create new kernel

Let us have a look at the below code. Within Main method we invoke InitKernels method to do all of those three things.

    class Program
        static CudaKernel addWithCuda;

        static void InitKernels()
            CudaContext cntxt = new CudaContext();
            CUmodule cumodule = cntxt.LoadModule(@"C:\CUDAKernels\kernel.ptx");
            addWithCuda = new CudaKernel("_Z6kerneliiPi", cumodule, cntxt);

        static Func<int, int, int> cudaAdd = (a, b) =>
            // init output parameters
            CudaDeviceVariable result_dev = 0;
            int result_host = 0;
            // run CUDA method
            addWithCuda.Run(a, b, result_dev.DevicePointer);
            // copy return to host
            result_dev.CopyToHost(ref result_host);
            return result_host;

        static void Main(string[] args)
            Console.WriteLine(cudaAdd(3, 10));

First line in InitKernels method creates new context. From Cuda 4.0 on, the Cuda API demands (at least) one context per process per device. So for each device you want to use, you need to create a CudaContext instance. The second line loads a module: our ptx file that we have produced earlier. The constructor takes the path to the ptx file – do not forget to customize it. The last line create the kernel. The constructor takes three parameters which the first one is the name of the function in ptx file. Note that the method name in ptx differs from the one we use in CUDA project. To find ‘the real’ name you have to view the ptx. This is basically text file so you may use notepad to do that. The screen below shows where you should look for kernel name.

I have wrapped the kernel method with Func delegate and this is where the magic happens. First of all we are defining output parameter. One for host side and one for device. To the host we are passing result_dev pointer and after running kernel we are copying pointing value to result_host (from Device to Host – from GPU to CPU).

After running the program we should see 169 on the console screen. This is not the real parallel programming yet however the computation was run on GPU. I think that was pretty easy and good point to start CUDA adventure.

.NET environment – F#:

The code to run kernel on F# is very similar:

open System
open ManagedCuda
open ManagedCuda.BasicTypes

let cntxt = new CudaContext()
let cumodule = cntxt.LoadModule(@"C:\CUDAKernels\kernel.ptx")
let addWithCuda = new CudaKernel("_Z6kerneliiPi", cumodule, cntxt)

let doWorkWithCuda a b =
    let intSize = 4;
    let result_host = ref 0
    let mutable result_dev : CudaDeviceVariable = new CudaDeviceVariable(new SizeT(intSize))
    // run cuda method
    addWithCuda.Run(a, b, result_dev.DevicePointer) |> ignore
    // copy return to host

let main argv =
    printfn "%d" (doWorkWithCuda 3 10)
    Console.ReadKey() |> ignore
    0 // return an integer exit code

However there are few differences. The first one is during the creation of result_dev variable. We have to explicitly call the constructor. Unfortunately there is no parameterless one so I passed the size of stored object which is 4 bytes (integer number). More interesting is second difference: line 11. We have to define result_dev as reference type. The F# has constructs that inherits from OCaml. A “ref” is a mutable heap-allocated cell that holds one piece of data. Basically the function “ref” allocates a new ref cell given an initial value. Line 17 with prefix operator ‘!’ dereferences a ref cell (returns the value stored inside it). After running the program we should see magic 169 number on the console screen.


I hope I have shown you how easy can be to use your GPU and then utilize it in .NET environment. Given examples are in C# and F# however any other CLI language can be used. This is ‘Hello world’ program so I was trying to keep it as simple as possible therefore I have not shown the real advantage of using GPU: parallelism but I will do so in my next post, so stay in touch.