Tensorflow 2 踩坑问题记录

【问题 1】 TypeError: call() got an unexpected keyword argument ‘use_causal_mask’

【源代码】

class CausalSelfAttention(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
        self.add = tf.keras.layers.Add()
        self.layer_norm = tf.keras.layers.LayerNormalization()

    def call(self, x):
        attn_output = self.mha(query=x, value=x, key=x, use_causal_mask=True)
        x = self.add([x, attn_output])
        x = self.layer_norm(x)
        return x

【解决办法】

参考： TypeError: call() got an unexpected keyword argument \’use_causal_mask\’ —> 在 flickr8k/flickr30k 数据集上出现这个错误

我刚开始配置的环境是 Tensorflow-gpu 2.6.0，但 tf.keras.layers.MultiHeadAttention 的参数 use_causal_mask 是在 Tensorflow 2.10.0 版本中才引入的……遂安装 Tensorflow-gpu 2.10.0 版本（只因不想再写一遍 mask…）

（cuda 11.2-11.4 和 cudnn 8.1 可参考： tensorflow 2.10.0安装所需依赖库版本确定方法）

conda create -n tensorflow python=3.8
conda activate tensorflow
pip install tensorflow-gpu==2.10.0 -i https://pypi.douban.com/simple/

Anaconda Prompt 中验证是否安装成功

python
import tensorflow as tf
tf.test.is_gpu_available()

【问题 2】 ValueError: Cannot convert a partially known TensorShape (None, None, 200) to a Tensor.

【源代码】

def call(x):  # x is a Tensor
    random_data = tf.random.normal(x.shape, mean=0.0, stddev=1.0)
    data_score = tf.nn.softmax(x + random_data, axis=-1)
    return data_score

【解决办法】

tf.random.normal(shape, mean=0.0, stddev=1.0, dtype=tf.dtypes.float32)

该随机生成函数接受 1-D 整数 Tensor 或 python 列表作为 shape，但传的是元组（x.shape）

因此，将 x.shape 改为 tf.shape(x) 即可

random_data = tf.random.normal(tf.shape(x), mean=0.0, stddev=1.0)

【扩展】 Tensor 各种 shape 的用法

参考： Cannot convert a partially known TensorShape to a Tensor

上面的问题涉及到 Tensor 各种 shape 的用法，其中 tf.shape() 返回的是 Tensor，

而 Tensor.shape 和 Tensor.get_shape() 返回的则是 TensorShape 类型的元组（也就是元组）

（tf.Tensor.get_shape() 和 tf.Tensor.shape 等价）

feature = tf.constant([[3.7, 3.8, 2.5], [3.6, 3.7, 3.8]])
# tf.Tensor(
# [[3.7 3.8 2.5]
#  [3.6 3.7 3.8]], shape=(2, 3), dtype=float32)

shape1 = tf.shape(feature)
print(shape1)  # tf.Tensor([2 3], shape=(2,), dtype=int32)

shape2 = feature.shape
print(shape2)  # TensorShape (2, 3)

shape3 = feature.get_shape()
print(shape3)  # TensorShape (2, 3)

对比，取出 shape 中某个维度的值，区别更加明显：

shape1 = tf.shape(feature)[-1]
print(shape1)  # tf.Tensor(3, shape=(), dtype=int32)

shape2 = feature.shape[-1]
print(shape2)  # {int} 3

shape3 = feature.get_shape()[-1]
print(shape3)  # {int} 3

因此在使用循环迭代时，

for i in range(tensor.shape[0])  # int
for i in tf.range(tf.shape(tensor)[0])  # tf.Tensor

【问题 3】 AttributeError: ‘Tensor’ object has no attribute ‘numpy’

【源代码】

features = features.numpy()  # features is a Tensor

【解决办法】

好像和 eager execution 有关系（换成 Tensor.eval() 也不行）

对于 Tensor.eval()，官方说（tensorflow.google.cn/versions/r2.10/api_docs/python/tf/Tensor#eval）如果没有使用 compat.v1 库，在 eager execution (or within tf.function) ，不需要调用 eval。

Note: Before invoking Tensor.eval(), its graph must have been launched in a session, and either a default session must be available, or session must be specified explicitly.

但是，控制台打印 numpy() 没有问题（该问题好像发生在 call_fn，记不太清了）

print(features.numpy())  # ndarray

1）参考： Tensorflow 2.3: AttributeError: ‘Tensor’ object has no attribute ‘numpy’ with eager mode enabled

调用 model.compile 时添加参数 experimental_run_tf_function=False，试了但没有用仍然报该错；

2）参考： AttributeError: ‘Tensor’ object has no attribute ‘numpy’ in custom loss function (Tensorflow 2.1.0)

调用 model.compile 时设置 run_eagerly=True，试了也不行，报错如下（没解决…）

W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.

[[{ {node PyFunc}}]]

【问题 4】 The tensor <tf.Tensor ‘mo_c_item_rec/while/ExpandDims:0’ shape=(1, None, 300) dtype=float32> cannot be accessed from FuncGraph(name=train_function, id=2786494381984), because it was defined in FuncGraph(name=mo_c_item_rec_while_body_6692810, id=2786386237952), which is out of scope.

【源代码】

def _batch_encoder(self, x):
    item_representations = []  # python list

    for i in tf.range(tf.shape(x)[0]):
        item_embed = self.item_encoder(x[i])
        item_representations.append(tf.expand_dims(news_embed, axis=0))

    batch_item_representations = tf.concat(item_representations, axis=0)
    return batch_item_representations

【解决办法】

参考： Tensorflow:此处无法访问：它是在另一个函数或代码块中定义的

该错误是由于使用python列表临时保存tensor对象导致的。内存回收机制将删除跟踪此函数后保存的内容，因此无法实现访问。如果要保存这些临时张量，必须使用 tf.TensorArray 作为替换。

修改后的代码如下：

def _batch_item_encoder(self, x):
    item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)

    for i in tf.range(tf.shape(x)[0]):
        item_embed = self.item_encoder(x[i])
        item_representations = item_representations.write(i, item_embed)

    batch_item_representations = item_representations.stack()
    return batch_item_representations

【问题 5】 ERROR:tensorflow:==================================

Object was never used (type <class ‘tensorflow.python.ops.tensor_array_ops.TensorArray’>):<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x0000024FB744FC70>

If you want to mark it as used call its “mark_used()” method.

【源代码】

item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
for i in tf.range(tf.shape(x)[0]):
    item_embed = self.item_encoder(x[i])
    item_representations.write(i, item_embed)  # wrong

【解决办法】

参考： tensorflow循环操作tensor array的一些坑

将value写入 tf.TensorArray 的index位置，write方法需要返回给写入的TensorArray！对比：

ta = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
ta.write(ids, tensor)       # wrong 
ta = ta.write(ids, tensor)  # right

因此上述错误的代码应修改如下：

item_representations = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)
for i in tf.range(tf.shape(x)[0]):
    item_embed = self.item_encoder(x[i])
    item_representations = item_representations.write(i, item_embed)  # right

【问题 6】 OperatorNotAllowedInGraphError: Iterating over a symbolic `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature. 不允许在“tf.Tensor”上迭代

【源代码】

feature = tf.constant([[3.7, 3.8, 2.5], [3.6, 2.6, 3.8]])
indices = tf.constant([[0, 2], [0, 2]])
updates = tf.constant([[0.12, 0.13], [0.26, 0.23]])

for index, value in enumerate(feature):  # token is a 1-D Tensor
    token_idx = indices[index]  # idx 1-D Tensor
    token_val = updates[index]  # val 1-D Tensor
    for idx, val in enumerate(token_idx):
        value[val] = token_val[idx]  # The usage of this line is wrong

Note：上述源代码的最后一行，不可以用这种方式给 Tensor 赋值，会报错

NotImplementedError: Cannot convert a symbolic tf.Tensor to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported.

【解决办法】

参考： iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function

我的目的是，按照 indices 中的索引来更新 Tensor 的值，因此想同时使用 index 和 value 来遍历整个 Tensor

但显然，Tensor 是不可以迭代的，因此使用 tf.range 和 tf.tensor_scatter_nd_update（或 tf.TensorArray）来替换，修改后的代码如下：

def function(x):  # x is a 2-D tensor
    token_top_mask = tf.zeros_like(x)
    for index in tf.range(tf.shape(x)[0]):  # index is a 0-D Tensor
        indices = tf.constant([[0], [2]])
        updates = tf.constant([0.13, 0.12])
        token_mask = tf.tensor_scatter_nd_update(x[index], indices, updates)

        indices = tf.expand_dims(tf.expand_dims(index, axis=-1), axis=-1)
        token_mask = tf.expand_dims(token_mask, axis=0)
        token_top_mask = tf.tensor_scatter_nd_update(token_top_mask, indices, token_mask)
    return token_top_mask

【问题 7】 AttributeError: ‘tuple’ object has no attribute ‘shape’

Traceback (most recent call last):
  File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
AttributeError: in user code:

    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 998, in train_step
        return self.compute_metrics(x, y, y_pred, sample_weight)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\training.py", line 1092, in compute_metrics
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 577, in update_state
        self.build(y_pred, y_true)
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 483, in build
        self._metrics = tf.__internal__.nest.map_structure_up_to(
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 631, in _get_metric_objects
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 631, in <listcomp>
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "D:\Python\Anaconda3\envs\tensorflow2\lib\site-packages\keras\engine\compile_utils.py", line 652, in _get_metric_object
        y_t_rank = len(y_t.shape.as_list())

    AttributeError: 'tuple' object has no attribute 'shape'

【源代码】

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(lr=0.00005),
                  metrics=['acc'])

【解决办法】

参考：as_list() is not defined on an unknown TensorShape on y_t_rank = len(y_t.shape.as_list()) and related to metrics

找到报错位置在 metrics 处，不知道为什么 ‘acc’ 不管用，反正换成 tf.keras.metrics.Accuracy() 就好了

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(lr=0.00005),
                  metrics=[tf.keras.metrics.Accuracy()])

【问题 8】通过分配器 GPU_0_bfc 分配张量时，GPU 资源耗尽：OOM

分配器 GPU_0_bfc 在试图分配给某模块需要的10.0KiB 时内存不足

如果是内存碎片的原因，那么环境变量“TF_GPU_ALLOCATOR=cuda_malloc_async”可能会改善这种情况

当前分配汇总如下

......
2023-03-08 17:45:50.687264: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 6039a5f00 of size 3514624 next 18446744073709551615
2023-03-08 17:45:50.687542: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2023-03-08 17:45:50.687791: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 25248 Chunks of size 256 totalling 6.16MiB
2023-03-08 17:45:50.688040: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 58 Chunks of size 768 totalling 43.5KiB
......
2023-03-08 17:45:50.763248: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 3.88GiB
2023-03-08 17:45:50.763478: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 4163895296 memory_limit_: 4163895296 available bytes: 0 curr_region_allocation_bytes_: 8327790592
2023-03-08 17:45:50.763885: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      4163895296
InUse:                      4163893504
MaxInUse:                   4163895296
NumAllocs:                     1745563
MaxAllocSize:                 65280000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
2023-03-08 17:45:50.765619: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ****************************************************************************************************
2023-03-08 17:45:50.765978: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at matmul_op_impl.h:681 : Resource exhausted: OOM when allocating tensor with shape[150,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-03-08 17:46:00.774886: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 10.0KiB (rounded to 10240)requested by op mo_e_news_rec/while/body/_1/mo_c_item_rec/while/item_encoder/naive_layer/dense/Tensordot_1/MatMul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-03-08 17:46:00.776133: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2023-03-08 17:46:00.776371: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256): Total Chunks: 25255, Chunks in use: 25248. 6.17MiB allocated for chunks. 6.16MiB in use in bin. 330.2KiB client-requested in use in bin.
......
2023-03-08 17:46:00.794044: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 10.0KiB was 8.0KiB, Chunk State: 
2023-03-08 17:46:00.794290: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 4163895296
2023-03-08 17:46:00.794513: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 50ba00000 of size 256 next 1
2023-03-08 17:46:00.794750: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 50ba00100 of size 1280 next 2
......

【解决办法】

查到了几个参考的解决思路，大概就是 gpu 显存不够了

所以，要么减小batch_size，要么减少模型参数，要么换个多 gpu 大显存的机器。

（减小 batch_size 对于超出太多的模型参数来说基本没有用，仍然内存不够，只能减小模型参数。。。）

Tensorflow 不适用于 gpu – 使用了太多内存。如何解决？

关于python：Tensorflow内存不足和CPU / GPU使用率

TensorFlow耗尽GPU内存：分配器(GPU_0_BFC)在尝试分配时耗尽内存

【问题 9】尝试在不支持BLAS的情况下使用StreamExecutor执行BLAS操作，多个内核 kernel 正在运行

2023-03-08 22:56:44.493180: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-03-08 22:56:44.493261: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at matmul_op_impl.h:442 : Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/training.py", line 1975, in fit_generator
    return self.fit(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/training.py", line 1134, in fit
    data_handler = data_adapter.get_data_handler(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 917, in __init__
    super(KerasSequenceAdapter, self).__init__(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 801, in __init__
    model.distribute_strategy.run(
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1286, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2849, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3632, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/data_adapter.py", line 802, in <lambda>
    lambda x: model(x, training=False), args=(concrete_x,))
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1037, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/keras/layers/einsum_dense.py", line 197, in call
    ret = tf.einsum(self.equation, inputs, self.kernel)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/special_math_ops.py", line 751, in einsum
    return _einsum_v2(equation, *inputs, **kwargs)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/special_math_ops.py", line 1180, in _einsum_v2
    return gen_linalg_ops.einsum(inputs, resolved_equation)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 1076, in einsum
    _ops.raise_from_not_ok_status(e, name)
  File "/home/xxx/miniconda3/envs/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6941, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:Einsum]

【解决办法】

参考：InternalError: Attempting to perform BLAS operation using StreamExecutor without BLAS support

【tensorflow报错记录】InternalError: Attempting to perform BLAS operation using StreamExecutor without BL

“Attempting to perform BLAS operation using StreamExecutor without BLAS support” error occurs

【问题 10】 Tensorflow 中的 masking 问题

以下几篇文章可供参考：

Tensorflow Keras中的masking与padding的学习笔记

tensorflow2中的遮盖和填充(padding&mask)以及dnamic_rnn学习笔记

代码讲解 Tensorflow 2.0 中的 mask

【问题 11】 Tensorflow 训练过程中 GPU 显存占用高但利用率很低

参考：TensorFlow 显存占用率高 GPU利用率低

【解决方案】tensorflow显存占比高但是GPU利用率低的问题

首先查看 GPU 使用情况

nvidia-smi -l 5  # 每隔5秒刷新

以下总结网上见过的方法：

【1】增大 batch_size（适用于：显存占用低 + GPU利用率低）

【2】减小 batch_size（适用于：显存占用高 + GPU利用率低）

【3】代码的问题：

数据预处理，CPU 将数据从磁盘读入内存，GPU 从内存读取数据斌训练，CPU 的读取速度跟不上 GPU 读取加训练的速度，使得 GPU 长时间等待 CPU 送过来的数据，处于空闲状态；

参考：TensorFlow学习- GPU显存占满而利用率（util）很低原因&提高方法

训练效率低？GPU利用率上不去？快来看看别人家的tricks吧～

文章出处登录后可见！

已经登录？立即刷新

Tensorflow 2 踩坑问题记录

【问题 1】 TypeError: call() got an unexpected keyword argument ‘use_causal_mask’

【问题 2】 ValueError: Cannot convert a partially known TensorShape (None, None, 200) to a Tensor.

【扩展】 Tensor 各种 shape 的用法

【问题 3】 AttributeError: ‘Tensor’ object has no attribute ‘numpy’

【问题 5】 ERROR:tensorflow:==================================

【问题 6】 OperatorNotAllowedInGraphError: Iterating over a symbolic `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature. 不允许在“tf.Tensor”上迭代

【问题 7】 AttributeError: ‘tuple’ object has no attribute ‘shape’

【问题 8】 通过分配器 GPU_0_bfc 分配张量时，GPU 资源耗尽：OOM

【问题 9】 尝试在不支持BLAS的情况下使用StreamExecutor执行BLAS操作，多个内核 kernel 正在运行

【问题 10】 Tensorflow 中的 masking 问题

【问题 11】 Tensorflow 训练过程中 GPU 显存占用高但利用率很低

相关推荐

【问题 8】通过分配器 GPU_0_bfc 分配张量时，GPU 资源耗尽：OOM

【问题 9】尝试在不支持BLAS的情况下使用StreamExecutor执行BLAS操作，多个内核 kernel 正在运行