为什么我在使用 Google Cloud 计算引擎时收到“RuntimeError：CUDA 错误：启动超时并被终止”

xiaoxingxing 2年前 pytorch 687

原文标题 ：Why I get “RuntimeError: CUDA error: the launch timed out and was terminated” when using Google Cloud compute engine

我有一个带有 4 个 Nvidia K80 GPU 和 Ubuntu 20.04 (python 3.8) 的谷歌云计算引擎。当我尝试训练 yolo5 模型时，我收到以下错误：

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f62be2c17d2 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x239de (0x7f62f6ea69de in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x22d (0x7f62f6ea857d in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x300568 (0x7f63736d9568 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f62be2aa005 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2e9 (0x7f62fa8ca5e9 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x205 (0x7f62fa8bcd25 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f6373bb7212 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f63735c7506 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x7e182f (0x7f6373bba82f in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1f5b20 (0x7f63735ceb20 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x1f6cce (0x7f63735cfcce in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: /usr/bin/python3() [0x5d1ec4]
frame #13: /usr/bin/python3() [0x5a958d]
frame #14: /usr/bin/python3() [0x5ed1a0]
frame #15: /usr/bin/python3() [0x544188]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: /usr/bin/python3() [0x5441da]
frame #18: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #19: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #20: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #21: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #22: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #23: __libc_start_main + 0xf3 (0x7f6378be40b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: _start + 0x2e (0x5fb12e in /usr/bin/python3)

我正在用这个命令训练这个模型：

python3 -m torch.distributed.run  --nproc_per_node 4 train.py --batch 16 --data coco128.yaml --weights yolov5s.pt --device 0,1,2,3

我在这里错过了什么吗？

谢谢

原文链接：https://stackoverflow.com//questions/71491932/why-i-get-runtimeerror-cuda-error-the-launch-timed-out-and-was-terminated-wh

我来回复

CruelCow 评论
我们还在 Google Cloud 中运行 CUDA，当您发布问题时，我们的服务器大致重新启动。虽然我们无法检测到任何更改，但由于“RuntimeError: No CUDA GPUs are available”，我们的服务无法启动。所以有一些相似之处，但也有一些不同之处。

无论如何，我们选择了良好的卸载并重新安装并修复了它：

卸载：
```
sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"
sudo apt-get --purge remove "*nvidia*"
```
加上删除 /usr/local/*cuda* 中的任何内容

安装：
```
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/ /"
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-11-3
```
我们还重新安装了 CUDNN，但这可能会或可能不会成为您堆栈的一部分。
2年前 0条评论

为什么我在使用 Google Cloud 计算引擎时收到“RuntimeError：CUDA 错误：启动超时并被终止”

回复

相关问题