Tensorflow2中tf.data.Dataset对象的使用(常用函数总结)

tf.data.Dataset接口是一个生成Dataset数据的高级借口，在对于大型数据集的处理中有很大帮助，同时这也是官方推荐使用的数据处理方式。

1 导包

import tensorflow as tf
import numpy as np

2 Dataset数据创建

1 从列表中创建

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7])
print(dataset)
"""输入如下
<TensorSliceDataset shapes: (), types: tf.int32>
"""

迭代数据

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6, 7])
for ele in dataset:
    print(ele,"   ",ele.numpy())
"""
tf.Tensor(1, shape=(), dtype=int32)     1
tf.Tensor(2, shape=(), dtype=int32)     2
tf.Tensor(3, shape=(), dtype=int32)     3
tf.Tensor(4, shape=(), dtype=int32)     4
tf.Tensor(5, shape=(), dtype=int32)     5
tf.Tensor(6, shape=(), dtype=int32)     6
tf.Tensor(7, shape=(), dtype=int32)     7
 """

从多维列表创建

dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4], [5, 6]])
for ele in dataset:
    print(ele.numpy())
"""
[1 2]
[3 4]
[5 6]
"""

从字典创建

dataset_dic = tf.data.Dataset.from_tensor_slices({'a': [1,2,3,4],
                                                  'b': [6,7,8,9],
                                                  'c': [12,13,14,15]
    
})
for ele in dataset_dic:
    print(ele)

{'a': <tf.Tensor: id=60, shape=(), dtype=int32, numpy=1>, 'b': <tf.Tensor: id=61, shape=(), dtype=int32, numpy=6>, 'c': <tf.Tensor: id=62, shape=(), dtype=int32, numpy=12>}
{'a': <tf.Tensor: id=66, shape=(), dtype=int32, numpy=2>, 'b': <tf.Tensor: id=67, shape=(), dtype=int32, numpy=7>, 'c': <tf.Tensor: id=68, shape=(), dtype=int32, numpy=13>}
{'a': <tf.Tensor: id=72, shape=(), dtype=int32, numpy=3>, 'b': <tf.Tensor: id=73, shape=(), dtype=int32, numpy=8>, 'c': <tf.Tensor: id=74, shape=(), dtype=int32, numpy=14>}
{'a': <tf.Tensor: id=78, shape=(), dtype=int32, numpy=4>, 'b': <tf.Tensor: id=79, shape=(), dtype=int32, numpy=9>, 'c': <tf.Tensor: id=80, shape=(), dtype=int32, numpy=15>}

从numpy中创建

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,100]))
# 取出前四个数
for ele in dataset.take(4):
    print(ele.numpy())

取第一个数字

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,100]))
print(next(iter(dataset.take(1))))

<tf.Tensor: id=109, shape=(), dtype=int64, numpy=1>

从元组中创建，该方式也是我们常用的，如从（featrue，label）中创建Dataset

featrue = np.array([[1,2],[3,4],[5,6]])
print("featrue shape:",featrue.shape)
label = np.array(['pig','dog','cat'])
print("label shape:",label.shape)
mydataset = tf.data.Dataset.from_tensor_slices((featrue,label)) 
for element_numpy in mydataset.as_numpy_iterator(): # 注意as_numpy_iterator在2.0版本中没有 需提高版本
    print(element_numpy)

featrue shape: (3, 2)
label shape: (3,)
(array([1, 2]), b'pig')
(array([3, 4]), b'dog')
(array([5, 6]), b'cat')

3 数据随机打散

shuffle(buffer_size,seed=None,reshuffle_each_iteration=None)

buffer_size：随机打乱元素排序的大小
seed：随机种子
reshuffle_each_iteration：是否每次迭代都随机乱序

一般情况下使用shuffle(buffer_size)方法可以用来打散数据的顺序，可以防止每次训练时的数据固定顺序出场。buffer_size用于指定缓冲池的大小，一般设置一个较大的数

# 随机打散
dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.shuffle(7)
for ele in dataset:
    print(ele)

tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)

4 设置批大小

batch(batch_size,drop_remainder) 批量组合数据集的元素

batch_size：批次大小
drop_remainder：是否忽略批次组合后剩余的数据

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,8,9,10]))
dataset = dataset.batch(3)
for ele in dataset:
    print(ele)

tf.Tensor([1 2 3], shape=(3,), dtype=int64)
tf.Tensor([4 5 6], shape=(3,), dtype=int64)
tf.Tensor([7 8 9], shape=(3,), dtype=int64)
tf.Tensor([10], shape=(1,), dtype=int64)

5 重复数据

repeat(count=None)生成重复的数据集，count代表重复读取的次数。例如原数据为{1,2}，通过repeat(2)之后，则为{1,2,1,2}。另外，参数为空时也可以无限次读取。

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4,5])).repeat(3).batch(3)
for ele in dataset:
    print(ele)

tf.Tensor([1 2 3], shape=(3,), dtype=int64)
tf.Tensor([4 5 1], shape=(3,), dtype=int64)
tf.Tensor([2 3 4], shape=(3,), dtype=int64)
tf.Tensor([5 1 2], shape=(3,), dtype=int64)
tf.Tensor([3 4 5], shape=(3,), dtype=int64)

6 数据映射

map(map_func,num_parallel_calls=None)通过map_func函数将数据集中的每一个元素进行处理转换，返回一个新的数据集。

map_func：处理函数

num_parallel_calls：并行处理的线程数

示例1:

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(tf.square)
for ele in dataset:
    print(ele.numpy())

示例2

dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(lambda x:x+1)
for ele in dataset:
    print(ele.numpy())

示例3

def re_xxx(x): # 定义处理函数
    return x*x*x
dataset = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7]))
dataset = dataset.map(re_xxx)
for ele in dataset:
    print(ele.numpy())

7 数据拼接

A.concatenate(B) 组合输入序列或数据集

dataset_A = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3])).shuffle(3)
dataset_B = tf.data.Dataset.from_tensor_slices(np.array([4, 5, 6])).shuffle(3)
dataset_AB = dataset_A.concatenate(dataset_B)
for ele in dataset_AB:
    print(ele.numpy())

8 打包成元组

zip(datasets)将多个数据集打包成新的元组序列，与python内置函数zip作用相同

dataset_fea = tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3]))
dataset_lab = tf.data.Dataset.from_tensor_slices(np.array([4, 5, 6]))
datasets = tf.data.Dataset.zip((dataset_fea,dataset_lab))
for ele in datasets:
    print(ele)

(<tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=4>)
(<tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=5>)
(<tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=6>)

文章出处登录后可见！

已经登录？立即刷新