Initial Goals
The original intention was to get a GAN generating images on an Arch Linux system. These following are the hurdles along the way to succeeding installing Keras, CUDA (Compute Unified Device Architecture), cuDNN (CUDA Deep Neural Network library) as well as TensorFlow on an Arch Linux system.
Installing CUDA
Officially at the time of writing this, TensorFlow does not officially support Arch Linux as an operating system, however all the necessary packages could still be found through the official and community repositories.
Intuitively, based on the GPGPU article from the wiki, CUDA could be installed simply by running the follows commands:
1sudo pacman -S nvidia cuda cudnn
At the time, this installed CUDA 11.1.1 and cuDNN (CUDA Deep Neural Network library) 8.0.4. After a successful installation and reboot, the samples files could be tested with the installed nvcc
compiler. These samples were located in /opt/cuda/samples. After building them, the simplest way to check a successful installation would be to run the following:
1sudo -r cp /opt/cuda/samples ~/samples2cd samples3cd 1_Utilities4cd deviceQuery5make6./deviceQuery
This would however result in an error similar to the following:
1code=999(cudaErrorUnknown) cudaGetDeviceCount(&device_count)
In the midst of attempting to debug the reason why my device was not being recognized by CUDA, I mistakenly uninstalled the nvidia
package. This resulted several hours of attempting to start an X server after getting a hang on
Reached target Graphical Interface
with reinstalled nvidia
drivers and a black screen when logging into the SDM with nouveau
drivers despite having a correctly scaled headless interface only using the nouveau
drivers. Using the nvidia
drivers with a regenerated X configuration, the X server would crash however the log in /var/log/Xorg.0.log
would show before crashing:
Failed to initialize the NVIDIA kernel module
After offhandedly remembering that package updates were important, the X server crash and the CUDA device not found errors were both fixed simply by running:
1sudo pacman -Syu
Installing TensorFlow
With CUDA working, all that was left was to perform:
1pip install tensorflow-gpu keras
in order to create the GAN. However, executing a python script with TensorFlow imported would perpetually be unable to locate the graphics card when printing out available devices. This was remedied by performing the following actions:
1pip uninstall tensorflow-gpu2sudo pacman -S tensorflow-cuda
With the correct version of TensorFlow installed, the graphics card was being accessed by TensorFlow and all the libraries successfully loaded - almost. Every single library loaded with the exception of libcusolver.so.10
:
1ImportError: libcusolver.so.10.0: cannot open shared object file: No such file or directory
Performing the following did not seem to affect anything although it did change the import error slightly to include the following path for where CUPTI
would lie:
1export LD_LIBRARY_PATH=/opt/cuda/extras/CUPTI/lib64
Docker - The Way Out
With every option to natively have things working smoothly seemingly exhausted, Docker seemed like a viable option as the official TensorFlow containers would be made to sure to have matching versions of the CUDA and TensorFlow drivers.
The following three commands successfully took care of any of the issues described above automatically and upon mounting a volume created a fully functional development environment:
1docker run --gpus all --rm nvidia/cuda nvidia-smi2docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"4docker run --gpus all -it tensorflow/tensorflow:latest-gpu bash
Three commands took care of everything
Building the GAN
After installing the correct pip packages, the environment was ready to create a GAN model. This article was the inspiration for the model.
After building and tuning the epoch and batch sizes, the following are some epoch iterations of images generated when the GAN was fed a cubism paintings dataset:
4th iteration
33rd iteration
48th iteration
54th iteration
65th iteration
Conclusion
Despite the hurdles, an effective way to run a GAN using TensorFlow, Keras and CUDA on Arch Linux was found and by using this method of dockerized containers, driver incompatibles could almost be completely avoided.