Working on Tribble/Trouble

Trouble and Tribble are IBM Power9 machines with four V-100 GPUs for use with neural networks. They share the same file system, so once you set one up, you’ll be able to use either. We’ll be using a library called Tensorflow, Google’s neural network library. To get Tensorflow to work on Trouble/Tribble with GPU support, you should do the following steps:

Sharing GPUs

Neural Net libraries try to be fast, by allocating all memory on all visible GPUs at once, in advance. That’s not conducive for sharing a machine! You only need one of those GPUs! When you’re about to run a job, then, run the command nvidia-smi to see which GPUs are currently in use. Then request one of the others. The following is an example of some code that makes only one, requested, GPU visible, using a command line argument:

import argparse,os

parser=argparse.ArgumentParser()
parser.add_argument("gpu",type=int)
#other argparse stuff for your program
args=parser.parse_args()

gpu=args.gpu
assert gpu>=0 and gpu<4

os.environ["CUDA_VISIBLE_DEVICES"]=str(gpu)

#other imports
#rest of your code