/orlovcs/

Journey to GAN

orlovcs · November 6th, 2020 · 2 min read

Initial Goals

The original intention was to get a GAN generating images on an Arch Linux system. These following are the hurdles along the way to succeeding installing Keras, CUDA (Compute Unified Device Architecture), cuDNN (CUDA Deep Neural Network library) as well as TensorFlow on an Arch Linux system.

Installing CUDA

Officially at the time of writing this, TensorFlow does not officially support Arch Linux as an operating system, however all the necessary packages could still be found through the official and community repositories.

Intuitively, based on the GPGPU article from the wiki, CUDA could be installed simply by running the follows commands:

1sudo pacman -S nvidia cuda cudnn

At the time, this installed CUDA 11.1.1 and cuDNN (CUDA Deep Neural Network library) 8.0.4. After a successful installation and reboot, the samples files could be tested with the installed nvcc compiler. These samples were located in /opt/cuda/samples. After building them, the simplest way to check a successful installation would be to run the following:

1sudo -r cp /opt/cuda/samples ~/samples
2cd samples
3cd 1_Utilities
4cd deviceQuery
5make
6./deviceQuery

This would however result in an error similar to the following:

1code=999(cudaErrorUnknown) cudaGetDeviceCount(&device_count)

In the midst of attempting to debug the reason why my device was not being recognized by CUDA, I mistakenly uninstalled the nvidia package. This resulted several hours of attempting to start an X server after getting a hang on

Reached target Graphical Interface

with reinstalled nvidia drivers and a black screen when logging into the SDM with nouveau drivers despite having a correctly scaled headless interface only using the nouveau drivers. Using the nvidia drivers with a regenerated X configuration, the X server would crash however the log in /var/log/Xorg.0.log would show before crashing:

Failed to initialize the NVIDIA kernel module

After offhandedly remembering that package updates were important, the X server crash and the CUDA device not found errors were both fixed simply by running:

1sudo pacman -Syu

Installing TensorFlow

With CUDA working, all that was left was to perform:

1pip install tensorflow-gpu keras

in order to create the GAN. However, executing a python script with TensorFlow imported would perpetually be unable to locate the graphics card when printing out available devices. This was remedied by performing the following actions:

1pip uninstall tensorflow-gpu
2sudo pacman -S tensorflow-cuda

With the correct version of TensorFlow installed, the graphics card was being accessed by TensorFlow and all the libraries successfully loaded - almost. Every single library loaded with the exception of libcusolver.so.10:

1ImportError: libcusolver.so.10.0: cannot open shared object file: No such file or directory

Performing the following did not seem to affect anything although it did change the import error slightly to include the following path for where CUPTI would lie:

1export LD_LIBRARY_PATH=/opt/cuda/extras/CUPTI/lib64

Docker - The Way Out

With every option to natively have things working smoothly seemingly exhausted, Docker seemed like a viable option as the official TensorFlow containers would be made to sure to have matching versions of the CUDA and TensorFlow drivers.

The following three commands successfully took care of any of the issues described above automatically and upon mounting a volume created a fully functional development environment:

1docker run --gpus all --rm nvidia/cuda nvidia-smi
2docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
4docker run --gpus all -it tensorflow/tensorflow:latest-gpu bash

Three commands took care of everything

Building the GAN

After installing the correct pip packages, the environment was ready to create a GAN model. This article was the inspiration for the model.

After building and tuning the epoch and batch sizes, the following are some epoch iterations of images generated when the GAN was fed a cubism paintings dataset:

4th iteration

Alt text

33rd iteration

Alt text

48th iteration

trained 48

54th iteration

trained 54

65th iteration

trained 65

Conclusion

Despite the hurdles, an effective way to run a GAN using TensorFlow, Keras and CUDA on Arch Linux was found and by using this method of dockerized containers, driver incompatibles could almost be completely avoided.