(4.1和4.2的内容在上篇NNDL实验博客中,本次实验使用的数据依旧为moon1000数据集)
pytorch与paddle对应表
目录
4.3 自动梯度计算
虽然我们能够通过模块化的方式比较好地对神经网络进行组装,但是每个模块的梯度计算过程仍然十分繁琐且容易出错。在深度学习框架中,已经封装了自动梯度计算的功能,我们只需要聚焦模型架构,不再需要耗费精力进行计算梯度。
pytorch提供了类,来方便快速的实现自己的层和模型。模型和层都可以基于
扩充实现,模型只是一种特殊的层。继承了
类的算子中,可以在内部直接调用其它继承
类的算子,
框架会自动识别算子中内嵌的
类算子,并自动计算它们的梯度,并在优化时更新它们的参数。
是专门为神经网络设计的模块化接口.
构建于
之上,可以用来定义和运行神经网络。
是
中十分重要的类,包含网络各层的定义及
方法。
4.3.1 利用预定义算子重新实现FNN二分类任务
1、使用pytorch的预定义算子来重新实现二分类任务。
其中主要使用到的预定义算子为torch.nn.Linear:
class torch.nn.Linear(in_features, out_features, weight_attr=None, bias_attr=None, name=None)
torch.nn.Linear算子可以接受一个形状为[batch_size,∗,in_features]的输入张量,其中"∗"表示张量中可以有任意的其它额外维度,并计算它与形状为的权重矩阵的乘积,然后生成形状为
的输出张量。
算子默认有偏置参数,可以通过
设置不带偏置。
运行代码如下:
import torch.nn as nn
import torch.nn.functional as F
import torch
class Model_MLP_L2_V2(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(Model_MLP_L2_V2, self).__init__()
# 使用'torch.nn.Linear'定义线性层。
# 其中第一个参数(in_features)为线性层输入维度;第二个参数(out_features)为线性层输出维度
# weight_attr为权重参数属性,这里使用'torch.nn.initializer.Normal'进行随机高斯分布初始化
# bias_attr为偏置参数属性,这里使用'torch.nn.initializer.Constant'进行常量初始化
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
# 使用'torch.nn.functional.sigmoid'定义 Logistic 激活函数
self.act_fn = F.sigmoid
# 前向计算
def forward(self, inputs):
z1 = self.fc1(inputs)
a1 = self.act_fn(z1)
z2 = self.fc2(a1)
a2 = self.act_fn(z2)
return a2
4.3.2 完善Runner类
基于上一节实现的 类,本节的
类在训练过程中使用自动梯度计算;模型保存时,使用
方法获取模型参数;模型加载时,使用
方法加载模型参数。
import torch
class RunnerV2_2(object):
def __init__(self, model, optimizer, metric, loss_fn, **kwargs):
self.model = model
self.optimizer = optimizer
self.loss_fn = loss_fn
self.metric = metric
# 记录训练过程中的评估指标变化情况
self.train_scores = []
self.dev_scores = []
# 记录训练过程中的评价指标变化情况
self.train_loss = []
self.dev_loss = []
def train(self, train_set, dev_set, **kwargs):
# 将模型切换为训练模式
self.model.train()
# 传入训练轮数,如果没有传入值则默认为0
num_epochs = kwargs.get("num_epochs", 0)
# 传入log打印频率,如果没有传入值则默认为100
log_epochs = kwargs.get("log_epochs", 100)
# 传入模型保存路径,如果没有传入值则默认为"best_model.pdparams"
save_path = kwargs.get("save_path", "best_model.pdparams")
# log打印函数,如果没有传入则默认为"None"
custom_print_log = kwargs.get("custom_print_log", None)
# 记录全局最优指标
best_score = 0
# 进行num_epochs轮训练
for epoch in range(num_epochs):
X, y = train_set
# 获取模型预测
logits = self.model(X)
# 计算交叉熵损失
trn_loss = self.loss_fn(logits, y)
self.train_loss.append(trn_loss.item())
# 计算评估指标
trn_score = self.metric(logits, y).item()
self.train_scores.append(trn_score)
# 自动计算参数梯度
trn_loss.backward()
if custom_print_log is not None:
# 打印每一层的梯度
custom_print_log(self)
# 参数更新
self.optimizer.step()
# 清空梯度
self.optimizer.zero_grad()
dev_score, dev_loss = self.evaluate(dev_set)
# 如果当前指标为最优指标,保存该模型
if dev_score > best_score:
self.save_model(save_path)
print(f"[Evaluate] best accuracy performence has been updated: {best_score:.5f} --> {dev_score:.5f}")
best_score = dev_score
if log_epochs and epoch % log_epochs == 0:
print(f"[Train] epoch: {epoch}/{num_epochs}, loss: {trn_loss.item()}")
# 模型评估阶段,使用'torch.no_grad()'控制不计算和存储梯度
@torch.no_grad()
def evaluate(self, data_set):
# 将模型切换为评估模式
self.model.eval()
X, y = data_set
# 计算模型输出
logits = self.model(X)
# 计算损失函数
loss = self.loss_fn(logits, y).item()
self.dev_loss.append(loss)
# 计算评估指标
score = self.metric(logits, y).item()
self.dev_scores.append(score)
return score, loss
def predict(self, X):
# 将模型切换为评估模式
self.model.eval()
return self.model(X)
# 使用'model.state_dict()'获取模型参数,并进行保存
def save_model(self, saved_path):
torch.save(self.model.state_dict(), saved_path)
# 使用'model.set_state_dict'加载模型参数
def load_model(self, model_path):
state_dict = torch.load(model_path)
self.model.set_state_dict(state_dict)
4.3.3 模型训练
实例化类,并传入训练配置,代码实现如下:
from nndl.metric import accuracy
# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V2(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# 设置损失函数
loss_fn = F.binary_cross_entropy
# 设置优化器
learning_rate = 0.2
optimizer = torch.optim.SGD(model.parameters(), learning_rate)
# 设置评价指标
metric = accuracy
# 其他参数
epoch_num = 1000
saved_path = 'best_model.pdparams'
# 设置1000个moon样本
n_samples = 1000
X, y = make_moons(n_samples=n_samples, shuffle=True, noise=0.02)
num_train = 640
num_dev = 160
num_test = 200
X_train, y_train = X[:num_train], y[:num_train]
X_dev, y_dev = X[num_train:num_train + num_dev], y[num_train:num_train + num_dev]
X_test, y_test = X[num_train + num_dev:], y[num_train + num_dev:]
y_train = y_train.reshape([-1, 1])
y_dev = y_dev.reshape([-1, 1])
y_test = y_test.reshape([-1, 1])
# 实例化RunnerV2类,并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=epoch_num, log_epochs=50, save_path="best_model.pdparams")
运行结果如下:
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.45000
[Train] epoch: 0/1000, loss: 0.7110588550567627
[Evaluate] best accuracy performence has been updated: 0.45000 --> 0.55000
[Evaluate] best accuracy performence has been updated: 0.55000 --> 0.55625
[Evaluate] best accuracy performence has been updated: 0.55625 --> 0.63125
[Evaluate] best accuracy performence has been updated: 0.63125 --> 0.70000
[Evaluate] best accuracy performence has been updated: 0.70000 --> 0.74375
[Evaluate] best accuracy performence has been updated: 0.74375 --> 0.75000
[Evaluate] best accuracy performence has been updated: 0.75000 --> 0.76250
[Evaluate] best accuracy performence has been updated: 0.76250 --> 0.78125
[Evaluate] best accuracy performence has been updated: 0.78125 --> 0.80625
[Evaluate] best accuracy performence has been updated: 0.80625 --> 0.81250
[Evaluate] best accuracy performence has been updated: 0.81250 --> 0.82500
[Evaluate] best accuracy performence has been updated: 0.82500 --> 0.83125
[Evaluate] best accuracy performence has been updated: 0.83125 --> 0.83750
[Evaluate] best accuracy performence has been updated: 0.83750 --> 0.84375
[Train] epoch: 50/1000, loss: 0.666471004486084
[Train] epoch: 100/1000, loss: 0.600623607635498
[Train] epoch: 150/1000, loss: 0.5047504305839539
[Train] epoch: 200/1000, loss: 0.419649213552475
[Train] epoch: 250/1000, loss: 0.36309003829956055
[Train] epoch: 300/1000, loss: 0.32698655128479004
[Train] epoch: 350/1000, loss: 0.30272698402404785
[Evaluate] best accuracy performence has been updated: 0.84375 --> 0.85000
[Evaluate] best accuracy performence has been updated: 0.85000 --> 0.85625
[Train] epoch: 400/1000, loss: 0.2856598496437073
[Train] epoch: 450/1000, loss: 0.27338916063308716
[Train] epoch: 500/1000, loss: 0.26450374722480774
[Evaluate] best accuracy performence has been updated: 0.85625 --> 0.86250
[Train] epoch: 550/1000, loss: 0.2580506205558777
[Train] epoch: 600/1000, loss: 0.25334611535072327
[Train] epoch: 650/1000, loss: 0.2498963177204132
[Evaluate] best accuracy performence has been updated: 0.86250 --> 0.86875
[Train] epoch: 700/1000, loss: 0.2473476678133011
[Train] epoch: 750/1000, loss: 0.24544887244701385
[Train] epoch: 800/1000, loss: 0.24402157962322235
[Train] epoch: 850/1000, loss: 0.24293871223926544
[Train] epoch: 900/1000, loss: 0.2421092540025711
[Train] epoch: 950/1000, loss: 0.24146752059459686
将训练过程中训练集与验证集的准确率变化情况进行可视化。
import matplotlib.pyplot as plt
# 可视化观察训练集与验证集的指标变化情况
def plot(runner, fig_name):
plt.figure(figsize=(10, 5))
epochs = [i for i in range(len(runner.train_scores))]
plt.subplot(1, 2, 1)
plt.plot(epochs, runner.train_loss, color='#e4007f', label="Train loss")
plt.plot(epochs, runner.dev_loss, color='#f19ec2', linestyle='--', label="Dev loss")
# 绘制坐标轴和图例
plt.ylabel("loss", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='upper right', fontsize='x-large')
plt.subplot(1, 2, 2)
plt.plot(epochs, runner.train_scores, color='#e4007f', label="Train accuracy")
plt.plot(epochs, runner.dev_scores, color='#f19ec2', linestyle='--', label="Dev accuracy")
# 绘制坐标轴和图例
plt.ylabel("score", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='lower right', fontsize='x-large')
plt.savefig(fig_name)
plt.show()
plot(runner, 'fw-acc.pdf')
运行结果如下:
4.3.4 性能评价
# 模型评价
torch.load("best_model.pdparams")
score, loss = runner.evaluate([X_test, y_test])
print("[Test] score/loss: {:.4f}/{:.4f}".format(score, loss))
运行结果如下:
import torch.nn as nn
import torch.nn.functional as F
import torch
import matplotlib.pyplot as plt
import time
import torch.nn as nn
import torch.nn.functional as F
import torch
class Model_MLP_L2_V2(nn.Module):
def __init__(self, input_size, hidden_size, hidden_size_2, output_size):
super(Model_MLP_L2_V2, self).__init__()
# 使用'torch.nn.Linear'定义线性层。
# 其中第一个参数(in_features)为线性层输入维度;第二个参数(out_features)为线性层输出维度
# weight_attr为权重参数属性,这里使用'torch.nn.initializer.Normal'进行随机高斯分布初始化
# bias_attr为偏置参数属性,这里使用'torch.nn.initializer.Constant'进行常量初始化
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size_2)
self.fc3 = nn.Linear(hidden_size_2, output_size)
# 使用'torch.nn.functional.sigmoid'定义 Logistic 激活函数
self.act_fn = torch.sigmoid
# 前向计算
def forward(self, inputs):
z1 = self.fc1(inputs)
a1 = self.act_fn(z1)
z2 = self.fc2(a1)
a2 = self.act_fn(z2)
z3 = self.fc3(a2)
a3 = self.act_fn(z3)
return a3
class RunnerV2_2(object):
def __init__(self, model, optimizer, metric, loss_fn, **kwargs):
self.model = model
self.optimizer = optimizer
self.loss_fn = loss_fn
self.metric = metric
# 记录训练过程中的评估指标变化情况
self.train_scores = []
self.dev_scores = []
# 记录训练过程中的评价指标变化情况
self.train_loss = []
self.dev_loss = []
def train(self, train_set, dev_set, **kwargs):
# 将模型切换为训练模式
self.model.train()
# 传入训练轮数,如果没有传入值则默认为0
num_epochs = kwargs.get("num_epochs", 0)
# 传入log打印频率,如果没有传入值则默认为100
log_epochs = kwargs.get("log_epochs", 100)
# 传入模型保存路径,如果没有传入值则默认为"best_model.pdparams"
save_path = kwargs.get("save_path", "best_model.pt")
# log打印函数,如果没有传入则默认为"None"
custom_print_log = kwargs.get("custom_print_log", None)
# 记录全局最优指标
best_score = 0
# 进行num_epochs轮训练
for epoch in range(num_epochs):
X, y = train_set
# 获取模型预测
logits = self.model(X)
# 计算交叉熵损失
trn_loss = self.loss_fn(logits, y)
self.train_loss.append(trn_loss.item())
# 计算评估指标
trn_score = self.metric(logits, y).item()
self.train_scores.append(trn_score)
# 自动计算参数梯度
trn_loss.backward()
if custom_print_log is not None:
# 打印每一层的梯度
custom_print_log(self)
# 参数更新
self.optimizer.step()
# 清空梯度
self.optimizer.zero_grad()
dev_score, dev_loss = self.evaluate(dev_set)
# 如果当前指标为最优指标,保存该模型
# print("best_score=%d dev_score=%d" % (best_score, dev_score))
if dev_score > best_score:
self.save_model(save_path)
print(f"[Evaluate] best accuracy performence has been updated: {best_score:.5f} --> {dev_score:.5f}")
best_score = dev_score
if log_epochs and epoch % log_epochs == 0:
print(f"[Train] epoch: {epoch}/{num_epochs}, loss: {trn_loss.item()}")
# 模型测试阶段,使用'torch.no_grad()'控制不计算和存储梯度
@torch.no_grad()
def evaluate(self, data_set):
X, y = data_set
# 计算模型输出
logits = self.model(X)
# 计算损失函数
loss = self.loss_fn(logits, y).item()
self.dev_loss.append(loss)
# 计算评价指标
score = self.metric(logits, y).item()
self.dev_scores.append(score)
return score, loss
def predict(self, X):
# 将模型切换为评估模式
self.model.eval()
return self.model(X)
# 使用'model.state_dict()'获取模型参数,并进行保存
def save_model(self, saved_path):
torch.save(self.model.state_dict(), saved_path)
# 使用'model.set_state_dict'加载模型参数
def load_model(self, model_path):
state_dict = torch.load(model_path)
self.model.set_state_dict(state_dict)
from nndl.dataset import make_moons
# 采样1000个样本
n_samples = 1000
X, y = make_moons(n_samples=n_samples, shuffle=True, noise=0.02)
num_train = 640
num_dev = 160
num_test = 200
X_train, y_train = X[:num_train], y[:num_train]
X_dev, y_dev = X[num_train:num_train + num_dev], y[num_train:num_train + num_dev]
X_test, y_test = X[num_train + num_dev:], y[num_train + num_dev:]
y_train = y_train.reshape([-1, 1])
y_dev = y_dev.reshape([-1, 1])
y_test = y_test.reshape([-1, 1])
from nndl.metric import accuracy
# 设置模型
input_size = 2
hidden_size = 5
hidden_size_2 = 3
output_size = 1
model = Model_MLP_L2_V2(input_size=input_size, hidden_size=hidden_size, hidden_size_2=hidden_size_2, output_size=output_size)
# 设置损失函数
loss_fn = F.binary_cross_entropy
# 设置优化器
learning_rate = 5
optimizer = torch.optim.SGD(model.parameters(), learning_rate)
# 设置评价指标
metric = accuracy
# 其他参数
epoch_num = 1000
saved_path = 'best_model.pdparams'
# 实例化RunnerV2类,并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=epoch_num, log_epochs=50, save_path="best_model.pdparams")
# 可视化观察训练集与验证集的指标变化情况
def plot(runner, fig_name):
plt.figure(figsize=(10, 5))
epochs = [i for i in range(len(runner.train_scores))]
plt.subplot(1, 2, 1)
plt.plot(epochs, runner.train_loss, color='#e4007f', label="Train loss")
plt.plot(epochs, runner.dev_loss, color='#f19ec2', linestyle='--', label="Dev loss")
# 绘制坐标轴和图例
plt.ylabel("loss", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='upper right', fontsize='x-large')
plt.subplot(1, 2, 2)
plt.plot(epochs, runner.train_scores, color='#e4007f', label="Train accuracy")
plt.plot(epochs, runner.dev_scores, color='#f19ec2', linestyle='--', label="Dev accuracy")
# 绘制坐标轴和图例
plt.ylabel("score", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='lower right', fontsize='x-large')
plt.savefig(fig_name)
plt.show()
plot(runner, 'fw-acc.pdf')
# 模型评价
torch.load("best_model.pdparams")
score, loss = runner.evaluate([X_test, y_test])
print("[Test] score/loss: {:.4f}/{:.4f}".format(score, loss))
运行结果如下:
import math
x1, x2 = torch.meshgrid(torch.linspace(-math.pi, math.pi, 200), torch.linspace(-math.pi, math.pi, 200), indexing='ij')
x = torch.stack([torch.flatten(x1), torch.flatten(x2)], dim=1)
# 预测对应类别
y = runner.predict(x)
y = torch.squeeze((y >= 0.5).to(torch.float32), dim=-1)
# 绘制类别区域
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(x[:, 0].tolist(), x[:, 1].tolist(), c=y.tolist(), cmap=plt.cm.Spectral)
plt.scatter(X_train[:, 0].tolist(), X_train[:, 1].tolist(), marker='*', c=torch.squeeze(y_train, dim=-1).tolist())
plt.scatter(X_dev[:, 0].tolist(), X_dev[:, 1].tolist(), marker='*', c=torch.squeeze(y_dev, dim=-1).tolist())
plt.scatter(X_test[:, 0].tolist(), X_test[:, 1].tolist(), marker='*', c=torch.squeeze(y_test, dim=-1).tolist())
plt.show()
运行结果如下:
【思考题】自定义梯度计算和自动梯度计算:从计算性能、计算结果等多方面比较,谈谈自己的看法。
4.4 优化问题
在本节中,我们通过实践来发现神经网络模型的优化问题,并思考如何改进。
4.4.1 参数初始化
实现一个神经网络前,需要先初始化模型参数。如果对每一层的权重和偏置都用0初始化,那么通过第一遍前向计算,所有隐藏层神经元的激活值都相同;在反向传播时,所有权重的更新也都相同,这样会导致隐藏层神经元没有差异性,出现对称权重现象。
接下来,将模型参数全都初始化为0,看实验结果。这里重新定义了一个类,两个线性层的参数全都初始化为0。
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.init import constant_, normal_, uniform_
class Model_MLP_L2_V4(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(Model_MLP_L2_V4, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
constant_(self.fc1.weight, val=0.0)
constant_(self.fc1.bias, val=0.0)
self.fc2 = nn.Linear(hidden_size, output_size)
constant_(self.fc2.weight, val=0.0)
constant_(self.fc2.bias, val=0.0)
self.act_fn = torch.sigmoid
# 前向计算
def forward(self, inputs):
z1 = self.fc1(inputs)
a1 = self.act_fn(z1)
z2 = self.fc2(a1)
a2 = self.act_fn(z2)
return a2
def print_weights(runner):
print('The weights of the Layers:')
for item in runner.model.named_parameters():
print(item)
利用Runner类训练模型:
# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V4(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# 设置损失函数
loss_fn = F.binary_cross_entropy
# 设置优化器
learning_rate = 0.2 # 5e-2
optimizer = torch.optim.SGD(model.parameters(), learning_rate)
# 设置评价指标
metric = accuracy
# 其他参数
epoch = 2000
saved_path = 'best_model.pdparams'
# 实例化RunnerV2类,并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=5, log_epochs=50, save_path="best_model.pdparams", custom_print_log=print_weights)
运行结果如下:
outer_circ_x.shape: torch.Size([500]) outer_circ_y.shape: torch.Size([500])
inner_circ_x.shape: torch.Size([500]) inner_circ_y.shape: torch.Size([500])
after cat shape: torch.Size([1000])
X shape: torch.Size([1000, 2])
y shape: torch.Size([1000])
The weights of the Layers:
('fc1.weight', Parameter containing:
tensor([[ 0.4968, 0.5866],
[-0.0979, -0.3070],
[ 0.5947, 0.3038],
[-0.6433, -0.2954],
[ 0.2408, 0.6601]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6201, 0.5617, -0.3732, 0.5566, -0.3341], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.2932, -0.0201, -0.1784, 0.0685, -0.3482]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.4307], requires_grad=True))
('fc1.weight', Parameter containing:
tensor([[ 0.4968, 0.5866],
[-0.0979, -0.3070],
[ 0.5947, 0.3038],
[-0.6433, -0.2954],
[ 0.2408, 0.6601]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6201, 0.5617, -0.3732, 0.5566, -0.3341], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.2932, -0.0201, -0.1784, 0.0685, -0.3482]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.4307], requires_grad=True))
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.53750
[Train] epoch: 0/5, loss: 0.730908215045929
The weights of the Layers:
('fc1.weight', Parameter containing:
tensor([[ 0.5013, 0.5842],
[-0.0982, -0.3069],
[ 0.5920, 0.3053],
[-0.6423, -0.2959],
[ 0.2353, 0.6629]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6182, 0.5616, -0.3744, 0.5571, -0.3364], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3059, -0.0018, -0.1603, 0.0782, -0.3383]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.4033], requires_grad=True))
('fc1.weight', Parameter containing:
tensor([[ 0.5013, 0.5842],
[-0.0982, -0.3069],
[ 0.5920, 0.3053],
[-0.6423, -0.2959],
[ 0.2353, 0.6629]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6182, 0.5616, -0.3744, 0.5571, -0.3364], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3059, -0.0018, -0.1603, 0.0782, -0.3383]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.4033], requires_grad=True))
The weights of the Layers:
('fc1.weight', Parameter containing:
tensor([[ 0.5059, 0.5817],
[-0.0982, -0.3068],
[ 0.5897, 0.3066],
[-0.6412, -0.2966],
[ 0.2300, 0.6657]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6164, 0.5616, -0.3753, 0.5575, -0.3384], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3173, 0.0147, -0.1438, 0.0863, -0.3299]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3790], requires_grad=True))
('fc1.weight', Parameter containing:
tensor([[ 0.5059, 0.5817],
[-0.0982, -0.3068],
[ 0.5897, 0.3066],
[-0.6412, -0.2966],
[ 0.2300, 0.6657]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6164, 0.5616, -0.3753, 0.5575, -0.3384], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3173, 0.0147, -0.1438, 0.0863, -0.3299]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3790], requires_grad=True))
The weights of the Layers:
('fc1.weight', Parameter containing:
tensor([[ 0.5105, 0.5791],
[-0.0980, -0.3070],
[ 0.5877, 0.3078],
[-0.6400, -0.2973],
[ 0.2250, 0.6685]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6148, 0.5616, -0.3760, 0.5580, -0.3401], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3276, 0.0295, -0.1287, 0.0930, -0.3230]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3573], requires_grad=True))
('fc1.weight', Parameter containing:
tensor([[ 0.5105, 0.5791],
[-0.0980, -0.3070],
[ 0.5877, 0.3078],
[-0.6400, -0.2973],
[ 0.2250, 0.6685]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6148, 0.5616, -0.3760, 0.5580, -0.3401], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3276, 0.0295, -0.1287, 0.0930, -0.3230]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3573], requires_grad=True))
The weights of the Layers:
('fc1.weight', Parameter containing:
tensor([[ 0.5152, 0.5763],
[-0.0976, -0.3072],
[ 0.5859, 0.3089],
[-0.6387, -0.2981],
[ 0.2202, 0.6712]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6133, 0.5618, -0.3766, 0.5584, -0.3416], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3369, 0.0429, -0.1149, 0.0984, -0.3174]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3381], requires_grad=True))
('fc1.weight', Parameter containing:
tensor([[ 0.5152, 0.5763],
[-0.0976, -0.3072],
[ 0.5859, 0.3089],
[-0.6387, -0.2981],
[ 0.2202, 0.6712]], requires_grad=True))
('fc1.bias', Parameter containing:
tensor([-0.6133, 0.5618, -0.3766, 0.5584, -0.3416], requires_grad=True))
('fc2.weight', Parameter containing:
tensor([[ 0.3369, 0.0429, -0.1149, 0.0984, -0.3174]], requires_grad=True))
('fc2.bias', Parameter containing:
tensor([-0.3381], requires_grad=True))
可视化训练和验证集上的主准确率和loss变化:
def plot(runner, fig_name):
plt.figure(figsize=(10, 5))
epochs = [i for i in range(len(runner.train_scores))]
plt.subplot(1, 2, 1)
plt.plot(epochs, runner.train_loss, color='#e4007f', label="Train loss")
plt.plot(epochs, runner.dev_loss, color='#f19ec2', linestyle='--', label="Dev loss")
# 绘制坐标轴和图例
plt.ylabel("loss", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='upper right', fontsize='x-large')
plt.subplot(1, 2, 2)
plt.plot(epochs, runner.train_scores, color='#e4007f', label="Train accuracy")
plt.plot(epochs, runner.dev_scores, color='#f19ec2', linestyle='--', label="Dev accuracy")
# 绘制坐标轴和图例
plt.ylabel("score", fontsize='large')
plt.xlabel("epoch", fontsize='large')
plt.legend(loc='lower right', fontsize='x-large')
plt.savefig(fig_name)
plt.show()
plot(runner, "fw-zero.pdf")
运行结果:
从输出结果看,二分类准确率为50%左右,说明模型没有学到任何内容。训练和验证loss几乎没有怎么下降。
为了避免对称权重现象,可以使用高斯分布或均匀分布初始化神经网络的参数。
高斯分布和均匀分布采样的实现和可视化代码如下:
gausian_weights = torch.normal(mean=0.0, std=1.0, size=[10000])
uniform_weights = torch.Tensor(10000)
uniform_weights.uniform_(-1, 1)
# 绘制两种参数分布
plt.figure()
plt.subplot(1, 2, 1)
plt.title('Gausian Distribution')
plt.hist(gausian_weights, bins=200, density=True, color='#f19ec2')
plt.subplot(1, 2, 2)
plt.title('Uniform Distribution')
plt.hist(uniform_weights, bins=200, density=True, color='#e4007f')
plt.savefig('fw-gausian-uniform.pdf')
plt.show()
运行结果如下:
4.4.2 梯度消失问题
梯度消失:梯度趋近于零,网络权重无法更新或更新的很微小,网络训练再久也不会有效果。
在神经网络的构建过程中,随着网络层数的增加,理论上网络的拟合能力也应该是越来越好的。但是随着网络变深,参数学习更加困难,容易出现梯度消失问题。
由于型函数的饱和性,饱和区的导数更接近于0,误差经过每一层传递都会不断衰减。当网络层数很深时,梯度就会不停衰减,甚至消失,使得整个网络很难训练,这就是所谓的梯度消失问题。
在深度神经网络中,减轻梯度消失问题的方法有很多种,一种简单有效的方式就是使用导数比较大的激活函数,如:。
下面通过一个简单的实验观察前馈神经网络的梯度消失现象和改进方法。
4.4.2.1 模型构建
定义一个前馈神经网络,包含4个隐藏层和1个输出层,通过传入的参数指定激活函数。代码实现如下:
a = torch.Tensor(10000)
class Model_MLP_L5(nn.Module):
def __init__(self, input_size, output_size, act='sigmoid', w_init=normal_(a, mean=0.0, std=0.01), b_init=constant_(a, val=1.0)):
super(Model_MLP_L5, self).__init__()
self.fc1 = torch.nn.Linear(input_size, 3)
self.fc2 = torch.nn.Linear(3, 3)
self.fc3 = torch.nn.Linear(3, 3)
self.fc4 = torch.nn.Linear(3, 3)
self.fc5 = torch.nn.Linear(3, output_size)
# 定义网络使用的激活函数
if act == 'sigmoid':
self.act = torch.sigmoid
elif act == 'relu':
self.act = F.relu
elif act == 'lrelu':
self.act = F.leaky_relu
else:
raise ValueError("Please enter sigmoid relu or lrelu!")
# 初始化线性层权重和偏置参数
self.init_weights(w_init, b_init)
# 初始化线性层权重和偏置参数
def init_weights(self, w_init, b_init):
# 使用'named_sublayers'遍历所有网络层
for n, m in self.named_parameters():
# 如果是线性层,则使用指定方式进行参数初始化
if isinstance(m, nn.Linear):
w_init(m.weight)
b_init(m.bias)
def forward(self, inputs):
outputs = self.fc1(inputs)
outputs = self.act(outputs)
outputs = self.fc2(outputs)
outputs = self.act(outputs)
outputs = self.fc3(outputs)
outputs = self.act(outputs)
outputs = self.fc4(outputs)
outputs = self.act(outputs)
outputs = self.fc5(outputs)
outputs = torch.sigmoid(outputs)
return outputs
4.4.2.2 使用Sigmoid型函数进行训练
使用型函数作为激活函数,为了便于观察梯度消失现象,只进行一轮网络优化。代码实现如下:
定义梯度打印函数:
def print_grads(runner):
# 打印每一层的权重的模
print('The gradient of the Layers:')
for name, item in runner.model.named_parameters():
if len(item.size()) == 2:
print(name, torch.norm(input=item, p=2))
# 学习率大小
lr = 0.01
# 定义网络,激活函数使用sigmoid
model = Model_MLP_L5(input_size=2, output_size=1, act='sigmoid')
# 定义优化器
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义损失函数,使用交叉熵损失函数
loss_fn = F.binary_cross_entropy
# 定义评价指标
metric = accuracy
# 指定梯度打印函数
custom_print_log=print_grads
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=1, log_epochs=None, save_path="best_model.pdparams", custom_print_log=custom_print_log)
实例化RunnerV2_2类,并传入训练配置。代码实现如下:
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
模型训练,打印网络每层梯度值的ℓ2范数。代码实现如下:
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=1, log_epochs=None, save_path="best_model.pdparams", custom_print_log=custom_print_log)
运行结果如下:
观察实验结果可以发现,梯度经过每一个神经层的传递都会不断衰减,最终传递到第一个神经层时,梯度几乎完全消失。
4.4.2.3 使用ReLU函数进行模型训练
torch.manual_seed(102)
lr = 0.01 # 学习率大小
# 定义网络,激活函数使用relu
model = Model_MLP_L5(input_size=2, output_size=1, act='relu')
# 定义优化器
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义损失函数
# 定义损失函数,这里使用交叉熵损失函数
loss_fn = F.binary_cross_entropy
# 定义评估指标
metric = accuracy
# 实例化Runner
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=1, log_epochs=None, save_path="best_model.pdparams", custom_print_log=custom_print_log)
运行结果如下:
下图展示了使用不同激活函数时,网络每层梯度值的范数情况。从结果可以看到,5层的全连接前馈神经网络使用
型函数作为激活函数时,梯度经过每一个神经层的传递都会不断衰减,最终传递到第一个神经层时,梯度几乎完全消失。改为
激活函数后,梯度消失现象得到了缓解,每一层的参数都具有梯度值。
4.4.3 死亡ReLU问题
激活函数可以一定程度上改善梯度消失问题,但是
函数在某些情况下容易出现死亡
问题,使得网络难以训练。这是由于当x<0时,
函数的输出恒为0。在训练过程中,如果参数在一次不恰当的更新后,某个
神经元在所有训练数据上都不能被激活(即输出为0),那么这个神经元自身参数的梯度永远都会是0,在以后的训练过程中永远都不能被激活。而一种简单有效的优化方式就是将激活函数更换为
、
等
的变种。
4.4.3.1 使用ReLU进行模型训练
使用第4.4.2节中定义的多层全连接前馈网络进行实验,使用ReLU作为激活函数,观察死亡ReLU现象和优化方法。当神经层的偏置被初始化为一个相对于权重较大的负值时,可以想像,输入经过神经层的处理,最终的输出会为负值,从而导致死亡ReLU现象。
# 定义网络,并使用较大的负值来初始化偏置
model = Model_MLP_L5(input_size=2, output_size=1, act='relu', b_init=constant_(a, val=-8.0))
实例化RunnerV2类,启动模型训练,打印网络每层梯度值的ℓ2ℓ2范数。代码实现如下:
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=1, log_epochs=0, save_path="best_model.pdparams", custom_print_log=custom_print_log)
运行结果:
从输出结果可以发现,使用 作为激活函数,当满足条件时,会发生死亡
问题,网络训练过程中
神经元的梯度始终为0,参数无法更新。
针对死亡问题,一种简单有效的优化方式就是将激活函数更换为
等
的变种。接下来,观察将激活函数更换为
时的梯度情况。
4.4.3.2 使用Leaky ReLU进行模型训练
将激活函数更换为Leaky ReLU(带泄露的ReLU函数)进行模型训练,观察梯度情况。代码实现如下:
# 重新定义网络,使用Leaky ReLU激活函数
model = Model_MLP_L5(input_size=2, output_size=1, act='lrelu', b_init=constant_(a, val=-8.0))
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=1, log_epochps=None, save_path="best_model.pdparams", custom_print_log=custom_print_log)
运行结果如下:
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.48750
[Train] epoch: 0/10, loss: 0.6964393854141235
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
The gradient of the Layers:
fc1.weight tensor(0.7638, grad_fn=<NormBackward1>)
fc2.weight tensor(1.1522, grad_fn=<NormBackward1>)
fc3.weight tensor(1.0465, grad_fn=<NormBackward1>)
fc4.weight tensor(1.0655, grad_fn=<NormBackward1>)
fc5.weight tensor(0.5679, grad_fn=<NormBackward1>)
从输出结果可以看到,将激活函数更换为后,死亡
问题得到了改善,梯度恢复正常,参数也可以正常更新。但是由于
中,
时的斜率默认只有0.01,所以反向传播时,随着网络层数的加深,梯度值越来越小。如果想要改善这一现象,将
中,x<0时的斜率调大即可。
参考文献
paddle-pytorch API对应表
Pytorch nn.Module模块详解
NNDL 实验五 前馈神经网络(2)自动梯度计算 & 优化问题
NNDL 实验4(上)
心得体会
不知不觉粉丝竟然破百了,还是让人觉得很有成就感的。通过本次实验的内容,详细学习了torch.nn.Module的使用,以及自定义梯度计算和自动梯度计算之间的区别。还是很有收获的。
文章出处登录后可见!