tensorflow 数据验证 tfdv 在谷歌云数据流上失败,“无法获取属性‘NumExamplesStatsGenerator’”

乘风 tensorflow 211

原文标题tensorflow data validation tfdv fails on google cloud dataflow with “Can’t get attribute ‘NumExamplesStatsGenerator’ “

我正在关注这个“入门”张量流教程,了解如何在谷歌云数据流上的 Apache Beam 上运行 tfdv。我的代码与教程中的代码非常相似:

import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions

PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'

# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'


# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'

setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]

# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2

print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")

上面的代码启动了一个数据流作业,但不幸的是它失败并出现以下错误:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>

我在互联网上找不到任何类似的东西。如果有帮助,在我的本地机器(MACOS)上,我有以下版本:

Apache Beam version: 2.34.0Tensorflow version: 2.6.2TensorFlow Transform version: 1.4.0TFDV version: 1.4.0

云上的 Apache Beam 运行与Apache Beam Python 3.8 SDK 2.34.0

奖励问题:我的另一个问题是关于PATH_TO_WHL_FILE。我试图把它放在一个存储桶上,但 Beam 似乎无法把它捡起来。仅在本地,这实际上是一个问题,因为它会使分发此代码更加困难。分发此轮文件的好做法是什么?

原文链接:https://stackoverflow.com//questions/71510943/tensorflow-data-validation-tfdv-fails-on-google-cloud-dataflow-with-cant-get-a

回复

我来回复
  • ningk的头像
    ningk 评论

    根据属性NumExamplesStatsGenerator的名称,它是一个不可腌制的生成器。

    但我现在无法从模块中找到该属性。搜索表明在 1.4.0 中此模块包含此属性。因此您可能想尝试更新版本的 TFDV。

    PATH_TO_WHL_FILE indicates a local file to stage/distribute to Dataflow for execution, so you can use a local file instead of a file on GCS.

    2年前 0条评论