accelerate加速器指定GPU卡号进行训练多个进程

假如你的服务器有 4 GPUs.

首先，确保安装了accelerate命令。没有安装的话执行

pip install accelerate

第二，确保CUDA_VISIBLE_DEVICES命令存在。

第三，直接指定GPU命令

指定任务1为卡0
CUDA_VISIBLE_DEVICES=0 nohup accelerate launch a.py >log.txt &
指定任务2为卡1
CUDA_VISIBLE_DEVICES=1 nohup accelerate launch –main_process_port 20655 a.py >log.txt &

这个方法可以跑成功

其中nohup为守候进程，>为将标准输出打印到日志文件，&为后台进程运行。

====================================================

后面的方法还有问题，会报错

第三，配置一个默认的运行配置文件 default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

第三，配置第二个运行配置文件second_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: 20655
main_training_function: main
num_machines: 1
num_processes: 2

第四，运行模型代码

CUDA_VISIBLE_DEVICES=0,1 accelerate launch default_config.yaml train_script.py
或者
CUDA_VISIBLE_DEVICES=0,1 accelerate launch train_script.py
CUDA_VISIBLE_DEVICES=2,3 accelerate launch second_config.yaml train_script.py
#指定process id
CUDA_VISIBLE_DEVICES=2,3 accelerate launch –main_process_port 20655 train_script.py
# pytorch指定GPU和nohup同时使用的时候出错”no such directory or file”
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py >log.txt &
#注意CUDA_VISIBLE_DEVICES在nohup前面

The same script can be run in any of the following configurations:

single CPU or single GPU
multi GPUs (using PyTorch distributed mode)
(multi) TPUs
fp16 (mixed-precision) or fp32 (normal precision)

To run it in each of these various modes, use the following commands:

single CPU:
- from a server without GPU
```
python ./nlp_example.py
```
- from any server by passing cpu=True to the Accelerator.
```
python ./nlp_example.py --cpu
```
- from any server with Accelerate launcher
```
accelerate launch --cpu ./nlp_example.py
```

single GPU:

python ./nlp_example.py  # from a server with a GPU

with fp16 (mixed-precision)
- from any server by passing fp16=True to the Accelerator.
```
python ./nlp_example.py --fp16
```
- from any server with Accelerate launcher
```
accelerate launch --fp16 ./nlp_example.py
```

multi GPUs (using PyTorch distributed mode)

With Accelerate config and launcher

accelerate config  # This will create a config file on your server
accelerate launch ./nlp_example.py  # This will run the script on your server

With traditional PyTorch launcher

python -m torch.distributed.launch --nproc_per_node 2 --use_env ./nlp_example.py

multi GPUs, multi node (several machines, using PyTorch distributed mode)

With Accelerate config and launcher, on each machine:

accelerate config  # This will create a config file on each server
accelerate launch ./nlp_example.py  # This will run the script on each server

With PyTorch launcher only

python -m torch.distributed.launch --nproc_per_node 2 \
    --use_env \
    --node_rank 0 \
    --master_addr master_node_ip_address \
    ./nlp_example.py  # On the first server
python -m torch.distributed.launch --nproc_per_node 2 \
    --use_env \
    --node_rank 1 \
    --master_addr master_node_ip_address \
    ./nlp_example.py  # On the second server

(multi) TPUs

With Accelerate config and launcher

accelerate config  # This will create a config file on your TPU server
accelerate launch ./nlp_example.py  # This will run the script on each server

In PyTorch: Add an xmp.spawn line in your script as you usually do.

文章出处登录后可见！

已经登录？立即刷新

accelerate加速器指定GPU卡号进行训练多个进程

相关推荐