如何使用 tf.keras.utils.image_dataset_from_directory 应用 kfold 交叉验证

乘风 tensorflow 1023

原文标题How apply kfold cross validation using tf.keras.utils.image_dataset_from_directory

我的目标是应用 k 折交叉验证来训练 VGG19 模型。为此,我使用以下代码从目录中读取图像:

DIR = "/Images"
data_dir = pathlib.Path(os.getcwd() + '\\Images')

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(224, 224),
  batch_size=32)

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(224, 224),
  batch_size=32)

它在不使用 kfold 交叉验证的情况下正常工作。但是当我想使用 K-fold 交叉验证时,我必须将标签和图像分开,我找不到解决方案,除非我需要使用另一种方法读取图像。因此,我决定使用ImageDataGeneratorflow_from_directory来读取图像。但据我了解,为了使用flow_from_directory加载图像,我必须有两个单独的子集作为图像中的traningtest,而在我的情况下我没有traningtest文件夹。这些方法中的任何一种都有解决方案吗?

此外,使用第一种方法,即tf.keras.utils.image_dataset_from_directory,将找到的图像数量与flow_from_directory不同。这是第一种方法的输出:

找到属于 4 个类的 1060 个文件。使用 848 个文件进行训练。

这是第二种方法的输出:

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1.0 / 255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    vertical_flip=True)
Wheat_data = img_gen.flow_from_directory(data_dir,
                                         subset="training",
                                         seed=123)

找到属于 4 个类别的 849 张图像。

原文链接:https://stackoverflow.com//questions/71676222/how-apply-kfold-cross-validation-using-tf-keras-utils-image-dataset-from-directo

回复

我来回复
  • AloneTogether的头像
    AloneTogether 评论

    您可以将数据集转换为numpyarrays,它应该像往常一样工作:

    import tensorflow as tf
    import pathlib
    import numpy as np
    from sklearn.model_selection import KFold
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    
    batch_size = 32
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size)
    
    val_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="validation",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size)
    
    train_images = np.concatenate(list(train_ds.map(lambda x, y:x)))
    train_labels = np.concatenate(list(train_ds.map(lambda x, y:y)))
    
    val_images = np.concatenate(list(val_ds.map(lambda x, y:x)))
    val_labels = np.concatenate(list(val_ds.map(lambda x, y:y)))
    
    inputs = np.concatenate((train_images, val_images), axis=0)
    targets = np.concatenate((train_labels, val_labels), axis=0)
    
    kfold = KFold(n_splits=5, shuffle=True)
    
    for train, test in kfold.split(inputs, targets):
      
      model = tf.keras.Sequential([
      tf.keras.layers.Rescaling(1./255, input_shape=(180, 180, 3)),
      tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(5)])
    
      model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
      history = model.fit(inputs[train], targets[train],
                  batch_size=batch_size,
                  epochs=2)
      scores = model.evaluate(inputs[test], targets[test], verbose=0)
    

    或者你可以使用批量大小为 1 的tf.keras.utils.image_dataset_from_directoryshuffle=False,但效率不高:

    import tensorflow as tf
    import pathlib
    import numpy as np
    from sklearn.model_selection import KFold
    
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
    data_dir = pathlib.Path(data_dir)
    
    batch_size = 1
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="training",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size,
      shuffle = False)
    
    val_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.2,
      subset="validation",
      seed=123,
      image_size=(180, 180),
      batch_size=batch_size,
      shuffle = False)
    
    ds = train_ds.concatenate(val_ds)
    
    kfold = KFold(n_splits=5, shuffle=True)
    
    for train, test in kfold.split(np.arange(len(ds))):
      train = [x+1 for x in train]
      test = [x+1 for x in test]
      train_ds = tf.data.Dataset.from_tensor_slices([ds.skip(t-1).take(t) for t in train]).flat_map(lambda x: x).map(lambda x, y: (x[0, ...], y[0, ...]))
      test_ds = tf.data.Dataset.from_tensor_slices([ds.skip(t-1).take(t) for t in test]).flat_map(lambda x: x).map(lambda x, y: (x[0, ...], y[0, ...]))
      train_ds = train_ds.take(len(train)).batch(64, drop_remainder=True)
      test_ds = test_ds.take(len(test)).batch(64, drop_remainder=True)
      
      model = tf.keras.Sequential([
      tf.keras.layers.Rescaling(1./255, input_shape=(180, 180, 3)),
      tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(5)])
    
      model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
      history = model.fit(train_ds,
                  epochs=2)
      scores = model.evaluate(test_ds, verbose=0)
    

    另一种选择是使用字典来存储索引和张量:

    #...
    ds = train_ds.concatenate(val_ds)
    
    lookup_images = {}
    lookup_labels = {}
    for i, (x, y) in enumerate(ds):
      lookup_images[i] = x
      lookup_labels[i] = y
    
    kfold = KFold(n_splits=5, shuffle=True)
    
    for train, test in kfold.split(np.arange(len(ds))):
    
      images_train = np.concatenate(list(map(lookup_images.get, train)))
      labels_train = np.concatenate(list(map(lookup_labels.get, train)))
    
      images_test = np.concatenate(list(map(lookup_images.get, test)))
      labels_test = np.concatenate(list(map(lookup_labels.get, test)))
    
      model = tf.keras.Sequential([
      tf.keras.layers.Rescaling(1./255, input_shape=(180, 180, 3)),
      tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(5)])
    
      model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
      history = model.fit(images_train, labels_train, epochs=2)
      scores = model.evaluate(images_test, labels_test, verbose=0)
    
    2年前 0条评论