这是indexloc提供的服务,不要输入任何密码
Skip to content
This repository was archived by the owner on Nov 11, 2022. It is now read-only.
This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Dataflow jobs using the SDK for Java 1.6.0 and reading compressed files from TextIO with compression mode set may be subject to data loss. #356

@dhalperi

Description

@dhalperi

We have identified an issue with Dataflow jobs reading from TextIO with compression type set to GZIP or BZIP2, potentially losing data during processing.

Specifically, using TextIO:

  • TextIO.from(...).withCompressionType(CompressionType.GZIP) or
  • TextIO.from(...).withCompressionType(CompressionType.BZIP2)

This is a silent issue so you will not see any error messages or visible symptoms. The problem occurs under the following circumstances: Using the Dataflow SDK for Java 1.6.0, reading compressed files, and setting the compression mode using withCompressionType to either GZIP or BZIP2.

Current known workarounds:

  • Recommended option: Use AUTO mode instead of GZIP or BZIP2 mode.

    Use withCompressionType(CompressionType.AUTO) or leave it unset (it is the default) with the TextIO source. NOTE: compressed files must have .gz or .bz2 (case-insensitive) extension for this to work.

  • Switch to version 1.5.1 of the Dataflow SDK for Java. If you are using mvn, this can be done by specifying version 1.5.1 in your pom.xml

We are actively working to resolve this and will update this issue with all developments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions