Tensorflow 2 踩坑问题记录

【问题 1】 TypeError: call() got an unexpected keyword argument ‘use_causal_mask’

【源代码】

class CausalSelfAttention(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
        self.add = tf.keras.layers.Add()
        self.layer_norm = tf.keras.layers.LayerNormalization()

    def call(self, x):
        attn_output = self.mha(query=x, value=x, key=x, use_causal_mask=True)
        x = self.add([x, attn_output])
        x = self.layer_norm(x)
        return x

【解决办法】

参考: TypeError: call() got an unexpected keyword argument \’use_causal_mask\’ —> 在 flickr8k/flickr30k 数据集上出现这个错误

我刚开始配置的环境是 Tensorflow-gpu 2.6.0,但 tf.keras.layers.MultiHeadAttention 的参数 use_causal_mask 是在 Tensorflow 2.10.0 版本中才引入的……遂安装 Tensorflow-gpu 2.10.0 版本(只因不想再写一遍 mask…)

(cuda 11.2-11.4 和 cudnn 8.1 可参考: tensorflow 2.10.0安装所需依赖库版本确定方法)

conda create -n tensorflow python=3.8
conda activate tensorflow
pip install tensorflow-gpu==2.10.0 -i https://pypi.douban.com/simple/

Anaconda Prompt 中验证是否安装成功

python
import tensorflow as tf
tf.test.is_gpu_available()

【问题 2】 ValueError: Cannot convert a partially known TensorShape (None, None, 200) to a Tensor.

【源代码】

def call(x):  # x is a Tensor
    random_data = tf.random.normal(x.shape, mean=0.0, stddev=1.0)
    data_score = tf.nn.softmax(x + random_data, axis=-1)
    return data_score

【解决办法】

tf.random.normal(shape, mean=0.0, stddev=1.0, dtype=tf.dtypes.float32)

该随机生成函数接受 1-D 整数 Tensor 或 python 列表作为 shape,但传的是 元组(x.shape)

因此,将 x.shape 改为 tf.shape(x) 即可

random_data = tf.random.normal(tf.shape(x), mean=0.0, stddev=1.0)

【扩展】 Tensor 各种 shape 的用法

参考: Cannot convert a partially known TensorShape to a Tensor

上面的问题涉及到 Tensor 各种 shape 的用法,其中 tf.shape() 返回的是 Tensor,

Tensor.shapeTensor.get_shape() 返回的则是 TensorShape 类型的元组(也就是元组)

(tf.Tensor.get_shape() 和 tf.Tensor.shape 等价)

feature = tf.constant([[3.7, 3.8, 2.5], [3.6, 3.7, 3.8]])
# tf.Tensor(
# [[3.7 3.8 2.5]
#  [3.6 3.7 3.8]], shape=(2, 3), dtype=float32)

shape1 = tf.shape(feature)
print(shape1)  # tf.Tensor([2 3], shape=(2,), dtype=int32)

shape2 = feature.shape
print(shape2)  # TensorShape (2, 3)

shape3 = feature.get_shape()
print(shape3)  # TensorShape (2, 3)

对比,取出 shape 中某个维度的值,区别更加明显:

shape1 = tf.shape(feature)[-1]
print(shape1)  # tf.Tensor(3, shape=(), dtype=int32)

shape2 = feature.shape[-1]
print(shape2)  # {int} 3

shape3 = feature.get_shape()[-1]
print(shape3)  # {int} 3

因此在使用循环迭代时,

for i in range(tensor.shape[0])  # int
for i in tf.range(tf.shape(tensor)[0])  # tf.Tensor

【问题 3】 AttributeError: ‘Tensor’ object has no attribute ‘numpy’

【源代码】

features = features.numpy()  # features is a Tensor

【解决办法】

好像和 eager execution 有关系(换成 Tensor.eval() 也不行)

对于 Tensor.eval(),官方说(tensorflow.google.cn/versions/r2.10/api_docs/python/tf/Tensor#eval)如果没有使用 compat.v1 库,在 eager execution (or within tf.function) ,不需要调用 eval。

Note: Before invoking Tensor.eval(), its graph must have been launched in a session, and either a default session must be available, or session must be specified explicitly.

但是,控制台打印 numpy() 没有问题(该问题好像发生在 call_fn,记不太清了)

print(features.numpy())  # ndarray

1)参考: Tensorflow 2.3: AttributeError: ‘Tensor’ object has no attribute ‘numpy’ with eager mode enabled

调用 model.compile 时添加参数 experimental_run_tf_function=False,试了但没有用仍然报该错;

2)参考: AttributeError: ‘Tensor’ object has no attribute ‘numpy’ in custom loss function (Tensorflow 2.1.0)

调用 model.compile 时设置 run_eagerly=True,试了也不行,报错如下(没解决…)

W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.

[[{ {node PyFunc}}]]

【问题 4】 The tensor <tf.Tensor ‘mo_c_item_rec/while/ExpandDims:0’ shape=(1, None, 300) dtype=float32> cannot be accessed from FuncGraph(name=train_function, id=2786494381984), because it was defined in FuncGraph(name=mo_c_item_rec_while_body_6692810, id=2786386237952), which is out of scope.

【源代码】

def _batch_encoder(self, x):
    item_representations = []  # python list

    for i in tf.range(tf.shape(x)[0]):
        item_embed = self.item_encoder(x[i])
        item_representations.append(tf.expand_dims(news_embed, axis=0))

    batch_item_representations = tf.concat(item_representations, axis=0)
    return batch_item_representations

【解决办法】

参考: Tensorflow:此处无法访问:它是在另一个函数或代码块中定义的

该错误是由于 使用python列表临时保存tensor对象 导致的。内存回收机制将删除跟踪此函数后保存的内容,因此无法实现访问。 如果要保存这些临时张量,必须使用 tf.TensorArray 作为替换。

修改后的代码如下:

def _batch_item_encoder(self, x):
    item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)

    for i in tf.range(tf.shape(x)[0]):
        item_embed = self.item_encoder(x[i])
        item_representations = item_representations.write(i, item_embed)

    batch_item_representations = item_representations.stack()
    return batch_item_representations

【问题 5】 ERROR:tensorflow:==================================

Object was never used (type <class ‘tensorflow.python.ops.tensor_array_ops.TensorArray’>):<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x0000024FB744FC70>

If you want to mark it as used call its “mark_used()” method.

【源代码】

item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
for i in tf.range(tf.shape(x)[0]):
    item_embed = self.item_encoder(x[i])
    item_representations.write(i, item_embed)  # wrong

【解决办法】

参考: tensorflow循环操作tensor array的一些坑

将value写入 tf.TensorArray 的index位置,write方法需要返回给写入的TensorArray!对比:

ta = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
ta.write(ids, tensor)       # wrong 
ta = ta.write(ids, tensor)  # right 

因此上述错误的代码应修改如下:

item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
for i in tf.range(tf.shape(x)[0]):
    item_embed = self.item_encoder(x[i])
    item_representations = item_representations.write(i, item_embed)  # right

【问题 6】 OperatorNotAllowedInGraphError: Iterating over a symbolic `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature. 不允许在“tf.Tensor”上迭代

【源代码】

feature = tf.constant([[3.7, 3.8, 2.5], [3.6, 2.6, 3.8]])
indices = tf.constant([[0, 2], [0, 2]])
updates = tf.constant([[0.12, 0.13], [0.26, 0.23]])

for index, value in enumerate(feature):  # token is a 1-D Tensor
    token_idx = indices[index]  # idx 1-D Tensor
    token_val = updates[index]  # val 1-D Tensor
    for idx, val in enumerate(token_idx):
        value[val] = token_val[idx]  # The usage of this line is wrong

Note:上述源代码的最后一行,不可以用这种方式给 Tensor 赋值,会报错

NotImplementedError: Cannot convert a symbolic tf.Tensor to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported.

【解决办法】

参考: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function

我的目的是,按照 indices 中的索引来更新 Tensor 的值,因此想同时使用 index 和 value 来遍历整个 Tensor

但显然,Tensor 是不可以迭代的,因此使用 tf.range 和 tf.tensor_scatter_nd_update(或 tf.TensorArray)来替换,修改后的代码如下:

def function(x):  # x is a 2-D tensor
    token_top_mask = tf.zeros_like(x)
    for index in tf.range(tf.shape(x)[0]):  # index is a 0-D Tensor
        indices = tf.constant([[0], [2]])
        updates = tf.constant([0.13, 0.12])
        token_mask = tf.tensor_scatter_nd_update(x[index], indices, updates)

        indices = tf.expand_dims(tf.expand_dims(index, axis=-1), axis=-1)
        token_mask = tf.expand_dims(token_mask, axis=0)
        token_top_mask = tf.tensor_scatter_nd_update(token_top_mask, indices, token_mask)
    return token_top_mask

【问题 7】 AttributeError: ‘tuple’ object has no attribute ‘shape’
Traceback (most recent call last):
  File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
AttributeError: in user code:

    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 998, in train_step
        return self.compute_metrics(x, y, y_pred, sample_weight)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1092, in compute_metrics
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 577, in update_state
        self.build(y_pred, y_true)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 483, in build
        self._metrics = tf.__internal__.nest.map_structure_up_to(
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 631, in _get_metric_objects
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 631, in <listcomp>
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 652, in _get_metric_object
        y_t_rank = len(y_t.shape.as_list())

    AttributeError: 'tuple' object has no attribute 'shape'

【源代码】

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(lr=0.00005),
                  metrics=['acc'])

【解决办法】

参考:as_list() is not defined on an unknown TensorShape on y_t_rank = len(y_t.shape.as_list()) and related to metrics

找到报错位置在 metrics 处,不知道为什么 ‘acc’ 不管用,反正换成 tf.keras.metrics.Accuracy() 就好了

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(lr=0.00005),
                  metrics=[tf.keras.metrics.Accuracy()])

【问题 8】 通过分配器 GPU_0_bfc 分配张量时,GPU 资源耗尽:OOM

分配器 GPU_0_bfc 在试图分配给某模块需要的10.0KiB 时内存不足

如果是内存碎片的原因,那么环境变量“TF_GPU_ALLOCATOR=cuda_malloc_async”可能会改善这种情况

当前分配汇总如下

......
2023-03-08 17:45:50.687264: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 6039a5f00 of size 3514624 next 18446744073709551615
2023-03-08 17:45:50.687542: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2023-03-08 17:45:50.687791: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 25248 Chunks of size 256 totalling 6.16MiB
2023-03-08 17:45:50.688040: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 58 Chunks of size 768 totalling 43.5KiB
......
2023-03-08 17:45:50.763248: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 3.88GiB
2023-03-08 17:45:50.763478: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 4163895296 memory_limit_: 4163895296 available bytes: 0 curr_region_allocation_bytes_: 8327790592
2023-03-08 17:45:50.763885: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      4163895296
InUse:                      4163893504
MaxInUse:                   4163895296
NumAllocs:                     1745563
MaxAllocSize:                 65280000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
2023-03-08 17:45:50.765619: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ****************************************************************************************************
2023-03-08 17:45:50.765978: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at matmul_op_impl.h:681 : Resource exhausted: OOM when allocating tensor with shape[150,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-03-08 17:46:00.774886: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 10.0KiB (rounded to 10240)requested by op mo_e_news_rec/while/body/_1/mo_c_item_rec/while/item_encoder/naive_layer/dense/Tensordot_1/MatMul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-03-08 17:46:00.776133: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2023-03-08 17:46:00.776371: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256): Total Chunks: 25255, Chunks in use: 25248. 6.17MiB allocated for chunks. 6.16MiB in use in bin. 330.2KiB client-requested in use in bin.
......
2023-03-08 17:46:00.794044: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 10.0KiB was 8.0KiB, Chunk State: 
2023-03-08 17:46:00.794290: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 4163895296
2023-03-08 17:46:00.794513: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 50ba00000 of size 256 next 1
2023-03-08 17:46:00.794750: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 50ba00100 of size 1280 next 2
......

【解决办法】

查到了几个参考的解决思路,大概就是 gpu 显存不够了

所以,要么减小batch_size,要么减少模型参数,要么换个多 gpu 大显存的机器。

(减小 batch_size 对于超出太多的模型参数来说基本没有用,仍然内存不够,只能减小模型参数。。。)

Tensorflow 不适用于 gpu – 使用了太多内存。如何解决?

关于python:Tensorflow内存不足和CPU / GPU使用率

TensorFlow耗尽GPU内存:分配器(GPU_0_BFC)在尝试分配时耗尽内存

【问题 9】 尝试在不支持BLAS的情况下使用StreamExecutor执行BLAS操作,多个内核 kernel 正在运行
2023-03-08 22:56:44.493180: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-03-08 22:56:44.493261: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at matmul_op_impl.h:442 : Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/training.py", line 1975, in fit_generator
    return self.fit(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/training.py", line 1134, in fit
    data_handler = data_adapter.get_data_handler(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 917, in __init__
    super(KerasSequenceAdapter, self).__init__(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 801, in __init__
    model.distribute_strategy.run(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1286, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2849, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3632, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 802, in <lambda>
    lambda x: model(x, training=False), args=(concrete_x,))
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1037, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/layers/einsum_dense.py", line 197, in call
    ret = tf.einsum(self.equation, inputs, self.kernel)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/special_math_ops.py", line 751, in einsum
    return _einsum_v2(equation, *inputs, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/special_math_ops.py", line 1180, in _einsum_v2
    return gen_linalg_ops.einsum(inputs, resolved_equation)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 1076, in einsum
    _ops.raise_from_not_ok_status(e, name)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6941, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:Einsum]

【解决办法】

参考:InternalError: Attempting to perform BLAS operation using StreamExecutor without BLAS support

【tensorflow报错记录】InternalError: Attempting to perform BLAS operation using StreamExecutor without BL

“Attempting to perform BLAS operation using StreamExecutor without BLAS support” error occurs

【问题 10】 Tensorflow 中的 masking 问题

以下几篇文章可供参考:

Tensorflow Keras中的masking与padding的学习笔记

tensorflow2中的遮盖和填充(padding&mask)以及dnamic_rnn学习笔记

代码讲解 Tensorflow 2.0 中的 mask

【问题 11】 Tensorflow 训练过程中 GPU 显存占用高但利用率很低

参考:TensorFlow 显存占用率高 GPU利用率低

【解决方案】tensorflow显存占比高但是GPU利用率低的问题

首先查看 GPU 使用情况

nvidia-smi -l 5  # 每隔5秒刷新

以下总结网上见过的方法:

【1】增大 batch_size(适用于:显存占用低 + GPU利用率低

【2】减小 batch_size(适用于:显存占用高 + GPU利用率低

【3】代码的问题:

数据预处理,CPU 将数据从磁盘读入内存,GPU 从内存读取数据斌训练,CPU 的读取速度跟不上 GPU 读取加训练的速度,使得 GPU 长时间等待 CPU 送过来的数据,处于空闲状态;

参考:TensorFlow学习- GPU显存占满而利用率(util)很低原因&提高方法

训练效率低?GPU利用率上不去?快来看看别人家的tricks吧~

文章出处登录后可见!

已经登录?立即刷新

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
xiaoxingxing的头像xiaoxingxing管理团队
上一篇 2023年12月1日 下午7:30
下一篇 2023年12月1日

相关推荐