accelerate加速器指定GPU卡号进行训练多个进程

假如你的服务器有 4 GPUs.

首先,确保安装了accelerate命令。没有安装的话执行

pip install accelerate

第二,确保CUDA_VISIBLE_DEVICES命令存在。

第三,直接指定GPU命令

指定任务1为卡0

CUDA_VISIBLE_DEVICES=0 nohup  accelerate launch a.py  >log.txt &

指定任务2为卡1

CUDA_VISIBLE_DEVICES=1 nohup  accelerate launch –main_process_port 20655 a.py  >log.txt &

这个方法可以跑成功

其中nohup为守候进程,>为将标准输出打印到日志文件,&为后台进程运行。

====================================================

后面的方法还有问题,会报错

第三,配置一个默认的运行配置文件 default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

第三,配置第二个运行配置文件second_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: 20655
main_training_function: main
num_machines: 1
num_processes: 2

第四,运行模型代码

CUDA_VISIBLE_DEVICES=0,1 accelerate launch default_config.yaml  train_script.py

或者

CUDA_VISIBLE_DEVICES=0,1 accelerate launch  train_script.py

CUDA_VISIBLE_DEVICES=2,3 accelerate launch second_config.yaml train_script.py

#指定process id

CUDA_VISIBLE_DEVICES=2,3 accelerate launch –main_process_port 20655 train_script.py

# pytorch指定GPU和nohup同时使用的时候出错”no such directory or file”

CUDA_VISIBLE_DEVICES=0 nohup python -u main.py >log.txt &
#注意CUDA_VISIBLE_DEVICES在nohup前面
 

The same script can be run in any of the following configurations:

  • single CPU or single GPU
  • multi GPUs (using PyTorch distributed mode)
  • (multi) TPUs
  • fp16 (mixed-precision) or fp32 (normal precision)

To run it in each of these various modes, use the following commands:

  • single CPU:
    • from a server without GPU
      python ./nlp_example.py
    • from any server by passing cpu=True to the Accelerator.
      python ./nlp_example.py --cpu
    • from any server with Accelerate launcher
      accelerate launch --cpu ./nlp_example.py
  • single GPU:
    python ./nlp_example.py  # from a server with a GPU
  • with fp16 (mixed-precision)
    • from any server by passing fp16=True to the Accelerator.
      python ./nlp_example.py --fp16
    • from any server with Accelerate launcher
      accelerate launch --fp16 ./nlp_example.py
  • multi GPUs (using PyTorch distributed mode)
    • With Accelerate config and launcher
      accelerate config  # This will create a config file on your server
      accelerate launch ./nlp_example.py  # This will run the script on your server
    • With traditional PyTorch launcher
      python -m torch.distributed.launch --nproc_per_node 2 --use_env ./nlp_example.py
  • multi GPUs, multi node (several machines, using PyTorch distributed mode)
    • With Accelerate config and launcher, on each machine:
      accelerate config  # This will create a config file on each server
      accelerate launch ./nlp_example.py  # This will run the script on each server
    • With PyTorch launcher only
      python -m torch.distributed.launch --nproc_per_node 2 \
          --use_env \
          --node_rank 0 \
          --master_addr master_node_ip_address \
          ./nlp_example.py  # On the first server
      python -m torch.distributed.launch --nproc_per_node 2 \
          --use_env \
          --node_rank 1 \
          --master_addr master_node_ip_address \
          ./nlp_example.py  # On the second server
  • (multi) TPUs
    • With Accelerate config and launcher
      accelerate config  # This will create a config file on your TPU server
      accelerate launch ./nlp_example.py  # This will run the script on each server
    • In PyTorch: Add an xmp.spawn line in your script as you usually do.

文章出处登录后可见!

已经登录?立即刷新

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
xiaoxingxing的头像xiaoxingxing管理团队
上一篇 2022年5月25日
下一篇 2022年5月25日

相关推荐