目录
在学习pytorch时,官方文档 有一段示例代码
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
有三个关键步骤:
1. 梯度归零
2. 反向传播,得到每个参数的梯度
3. 通过梯度下降执行一步参数更新
下面详细了解一下这几个步骤
optimizer.zero_grad()
代码解读
d代码在这,
def zero_grad(self, set_to_none: bool = False):
r"""Sets the gradients of all optimized :class:`torch.Tensor` s to zero.
Args:
set_to_none (bool): instead of setting to zero, set the grads to None.
This will in general have lower memory footprint, and can modestly improve performance.
However, it changes certain behaviors. For example:
1. When the user tries to access a gradient and perform manual ops on it,
a None attribute or a Tensor full of 0s will behave differently.
2. If the user requests ``zero_grad(set_to_none=True)`` followed by a backward pass, ``.grad``\ s
are guaranteed to be None for params that did not receive a gradient.
3. ``torch.optim`` optimizers have a different behavior if the gradient is 0 or None
(in one case it does the step with a gradient of 0 and in the other it skips
the step altogether).
"""
foreach = self.defaults.get('foreach', False)
if not hasattr(self, "_zero_grad_profile_name"):
self._hook_for_profile()
if foreach:
per_device_and_dtype_grads = defaultdict(lambda: defaultdict(list))
with torch.autograd.profiler.record_function(self._zero_grad_profile_name):
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
if (not foreach or p.grad.is_sparse):
p.grad.zero_()
else:
per_device_and_dtype_grads[p.grad.device][p.grad.dtype].append(p.grad)
if foreach:
for _, per_dtype_grads in per_device_and_dtype_grads.items():
for grads in per_dtype_grads.values():
torch._foreach_zero_(grads)
遍历模型所有参数,
如果选择将梯度清空,则通过p.grad.detach_()方法截断反向传播的梯度流,再通过p.grad.zero_()函数将每个参数的梯度值设为0.即不会叠加上次梯度。
如果选择将梯度置None,那么以后也不能改变梯度了,会影响后续的计算。
因为训练的过程通常使用mini-batch方法,所以如果不将梯度清零的话,梯度会与上一个batch的数据相关,因此该函数要写在反向传播和梯度下降之前。
调用频率和结果
一般每个batch需要调用一次gradient.zero_grad()函数,把参数梯度清零;
也可以多个batch调用一次,相当于增大了batch_size
loss.backward()
pyroch的反向传播时通过autograd实现的,官方教程
在创建tensor时,可以设置requires_grad
=True,这样后面就可以在执行backward()后拿到参数梯度。
optimizer.step()
step()函数的作用是执行一次优化步骤.
文章出处登录后可见!
已经登录?立即刷新