gpu_nr_gpu() now really count the gpus

1a5d3f4f · Yuxin Wu · 1e790a93 · 1a5d3f4f · 1a5d3f4f · 1a5d3f4f
Commit 1a5d3f4f authored Feb 10, 2017 by Yuxin Wu
5 changed files
--- a/docs/tutorial/efficient-data.md
+++ b/docs/tutorial/efficient-data.md
@@ -59,9 +59,10 @@ there are ways to understand which one is the bottleneck:
 ### Load ImageNet efficiently
-We take ImageNet dataset as an example of how to optimize a DataFlow for speed.
+We take ImageNet dataset as an example of how to optimize a DataFlow.
 We use ILSVRC12 training set, which contains 1.28 million images.
-Following the [ResNet example](../examples/ResNet), our pre-processing need images in their original resolution, so we don't resize them.
+Following the [ResNet example](../examples/ResNet), our pre-processing need images in their original resolution, so we'll read the original
+dataset instead of a down-sampled version here.
 The average resolution is about 400x350 <sup>[[1]]</sup>.
 The original images (JPEG compressed) are 140G in total.

--- a/examples/A3C-Gym/README.md
+++ b/examples/A3C-Gym/README.md
 ### Code and models for Atari games in gym
-Implemented A3C in [Asynchronous Methods for Deep Reinforcement Learning](http://arxiv.org/abs/1602.01783).
+Implemented Multi-GPU version of the A3C algorithm in [Asynchronous Methods for Deep Reinforcement Learning](http://arxiv.org/abs/1602.01783).
 Results of the same code trained on 47 different Atari games were uploaded on OpenAI Gym.
 You can see them in [my gym page](https://gym.openai.com/users/ppwwyyxx).
@@ -8,14 +8,16 @@ Most of them are the best reproducible results on gym.
 ### To train on an Atari game:
-`./train-atari.py --env Breakout-v0 --gpu 0`
+`CUDA_VISIBLE_DEVICES=0 ./train-atari.py --env Breakout-v0`
 It should run at a speed of 6~10 iteration/s on 1 GPU plus 12+ CPU cores.
-Training with a significant slower speed (e.g. on CPU) will give bad performance,
+Training with a significant slower speed (e.g. on CPU) will result in very bad score,
 probably because of async issues.
 The pre-trained models are all trained with 4 GPUs for about 2 days.
+But note that multi-GPU doesn't give you obvious speedup here,
+because the bottleneck is not computation but data.
-Occasionally processes may not get terminated completely, therefore it is suggested to use systemd-run to run any
+Occasionally, processes may not get terminated completely, therefore it is suggested to use `systemd-run` to run any
 multiprocess Python program to get a cgroup dedicated for the task.
 ### To run a pretrained Atari model for 100 episodes:

--- a/examples/A3C-Gym/train-atari.py
+++ b/examples/A3C-Gym/train-atari.py
@@ -254,8 +254,8 @@ if __name__ == '__main__':
        elif args.task == 'eval':
            eval_model_multithread(cfg, EVAL_EPISODE)
    else:
-        if args.gpu:
+        nr_gpu = get_nr_gpu()
-            nr_gpu = get_nr_gpu()
+        if nr_gpu > 0:
            if nr_gpu > 1:
                predict_tower = range(nr_gpu)[-nr_gpu // 2:]
            else:

--- a/examples/SpatialTransformer/mnist-addition.py
+++ b/examples/SpatialTransformer/mnist-addition.py
@@ -7,6 +7,7 @@ import numpy as np
 import tensorflow as tf
 import os
 import sys
+import cv2
 import argparse
 from tensorpack import *

--- a/tensorpack/utils/gpu.py
+++ b/tensorpack/utils/gpu.py
@@ -5,6 +5,7 @@
 import os
 from .utils import change_env
+from . import logger
 __all__ = ['change_gpu', 'get_nr_gpu']
@@ -26,5 +27,10 @@ def get_nr_gpu():
        int: the number of GPU from ``CUDA_VISIBLE_DEVICES``.
    """
    env = os.environ.get('CUDA_VISIBLE_DEVICES', None)
-    assert env is not None, 'gpu not set!'  # TODO
+    if env is not None:
-    return len(env.split(','))
+        return len(env.split(','))
+    logger.info("Loading local devices by TensorFlow ...")
+    from tensorflow.python.client import device_lib
+    device_protos = device_lib.list_local_devices()
+    gpus = [x.name for x in device_protos if x.device_type == 'GPU']
+    return len(gpus)