Up and running with Theano (GPU) + PyCUDA on Windows

tooling

Getting CUDA to work with python on Windows is really frustrating. Its not exactly hard, but is sure irritating when you start doing it. If you have ever tried it, you might be knowing that many possible combinations of compilers, cuda toolkit, python etc. don't work.

This post describes the steps that I followed for a working setup of theano working with GPU acceleration and PyCUDA for general access to GPU from python. Hopefully, it will help if you haven't found the sweet spot yet.

1. Setting up

Starting with my machine, it is a Pavilion DV6 7012tx Laptop with Nvidia GeForce GT 630m card. Right now its running Windows 10 x64. If you are already having cygwin or mingw based gcc in place, you might want to remove that since our scientific python stack will provide that.

1.1. Install Visual Studio

This is needed to get Nvidia's CUDA compiler (nvcc) working. For choosing the version, go to the latest CUDA on Windows doc and see which version of visual studio the current CUDA toolkit supports.

At the time of writing, CUDA 7 was the latest release and Visual Studio 2013 was the latest supported version. You also don't need to install 2008 or 2010 version of compiler for python. This will be taken care of later, just go with everything latest.

After installation, you don't actually need to add cl.exe (usually in a directory like C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin, depending on your Visual Studio version) to PATH for theano since we will define this explicitly in .theanorc, but it is better to do this as many other tools might be using it.

1.2. Install CUDA toolkit

This should be easy, get the latest CUDA and install it. Keep the samples while installing, they are nice for checking if things are working fine.

1.3. Setup Python

This is where most of the trouble is. Its easy to get lost while setting up vanilla python for theano specially since you also are setting up gcc and related tools. The theano installation tutorial will bog you down in this phase if you don't actually read it carefully. Most likely you would end up downloading lots of legacy Visual Studio versions and other stuff. We won't be going that way.

Install a scientific python distribution like Anaconda. I haven't tried setting up theano using other distributions but this should be one of the easier ways because of conda package manager. This really relieves you from setting up a separate mingw environment and handling commonly used libraries which are as easy as conda install boost in Anaconda.

If you feel Anaconda is a bit too heavy, try miniconda and adding basic packages like numpy on top of it.

Once you install Anaconda, install additional dependencies.

conda install mingw libpython

1.4. Install theano

Install theano using pip install theano and create a .theanorc file in your HOME directory with following contents.

[global]
floatX = float32
device = gpu

[nvcc]
flags=-LC:\Anaconda\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

Make sure to change the path C:\Anaconda\libs according to your Anaconda install directory and compiler_bindir to the path with cl.exe in it.

1.5. Install PyCUDA

Best way to install PyCUDA is to get the Unofficial Windows Binaries by Christoph Gohlke here.

For quick setup, you can use pipwin which basically automates the process of installing Gohlke's packages.

pip install pipwin
pipwin install pycuda

2. Test it out

A very basic test is to simply import theano.

import theano
Using gpu device 0: GeForce GT 630M (CNMeM is disabled)

This should tell you if the GPU is getting used.

One error says that CUDA is installed, but device gpu is not available. For me, this was solved after installing mingw and libpython via conda since Anaconda doesn't setup gcc along with it as it used to do earlier.

For a more extensive test, try the following snippet taken from theano docs.

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

You should see something like this.

Using gpu device 0: GeForce GT 630M (CNMeM is disabled)
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.42199993134 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
  1.62323296]
Used the gpu

For PyCUDA, here is a quick test snippet from their web page

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest-a*b

Seeing an array of zeros ? Its working fine then.

Hopefully, this should give you a working pythonish CUDA setup using the latest versions of VS, Windows, Python etc.