假如你的服务器有 4 GPUs.
首先,确保安装了accelerate命令。没有安装的话执行
pip install accelerate
第二,确保CUDA_VISIBLE_DEVICES命令存在。
第三,直接指定GPU命令
指定任务1为卡0
CUDA_VISIBLE_DEVICES=0 nohup accelerate launch a.py >log.txt &
指定任务2为卡1
CUDA_VISIBLE_DEVICES=1 nohup accelerate launch –main_process_port 20655 a.py >log.txt &
这个方法可以跑成功
其中nohup为守候进程,>为将标准输出打印到日志文件,&为后台进程运行。
====================================================
后面的方法还有问题,会报错
第三,配置一个默认的运行配置文件 default_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2
第三,配置第二个运行配置文件second_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: 20655
main_training_function: main
num_machines: 1
num_processes: 2
第四,运行模型代码
CUDA_VISIBLE_DEVICES=0,1 accelerate launch default_config.yaml train_script.py
或者
CUDA_VISIBLE_DEVICES=0,1 accelerate launch train_script.py
CUDA_VISIBLE_DEVICES=2,3 accelerate launch second_config.yaml train_script.py
#指定process id
CUDA_VISIBLE_DEVICES=2,3 accelerate launch –main_process_port 20655 train_script.py
# pytorch指定GPU和nohup同时使用的时候出错”no such directory or file”
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py >log.txt &
#注意CUDA_VISIBLE_DEVICES在nohup前面
The same script can be run in any of the following configurations:
- single CPU or single GPU
- multi GPUs (using PyTorch distributed mode)
- (multi) TPUs
- fp16 (mixed-precision) or fp32 (normal precision)
To run it in each of these various modes, use the following commands:
- single CPU:
- from a server without GPU
python ./nlp_example.py
- from any server by passing
cpu=True
to theAccelerator
.python ./nlp_example.py --cpu
- from any server with Accelerate launcher
accelerate launch --cpu ./nlp_example.py
- from a server without GPU
- single GPU:
python ./nlp_example.py # from a server with a GPU
- with fp16 (mixed-precision)
- from any server by passing
fp16=True
to theAccelerator
.python ./nlp_example.py --fp16
- from any server with Accelerate launcher
accelerate launch --fp16 ./nlp_example.py
- from any server by passing
- multi GPUs (using PyTorch distributed mode)
- With Accelerate config and launcher
accelerate config # This will create a config file on your server accelerate launch ./nlp_example.py # This will run the script on your server
- With traditional PyTorch launcher
python -m torch.distributed.launch --nproc_per_node 2 --use_env ./nlp_example.py
- With Accelerate config and launcher
- multi GPUs, multi node (several machines, using PyTorch distributed mode)
- With Accelerate config and launcher, on each machine:
accelerate config # This will create a config file on each server accelerate launch ./nlp_example.py # This will run the script on each server
- With PyTorch launcher only
python -m torch.distributed.launch --nproc_per_node 2 \ --use_env \ --node_rank 0 \ --master_addr master_node_ip_address \ ./nlp_example.py # On the first server python -m torch.distributed.launch --nproc_per_node 2 \ --use_env \ --node_rank 1 \ --master_addr master_node_ip_address \ ./nlp_example.py # On the second server
- With Accelerate config and launcher, on each machine:
- (multi) TPUs
- With Accelerate config and launcher
accelerate config # This will create a config file on your TPU server accelerate launch ./nlp_example.py # This will run the script on each server
- In PyTorch: Add an
xmp.spawn
line in your script as you usually do.
- With Accelerate config and launcher
文章出处登录后可见!