This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Description
We have identified an issue with Dataflow jobs reading from TextIO with compression type set to GZIP or BZIP2, potentially losing data during processing.
Specifically, using TextIO:
TextIO.from(...).withCompressionType(CompressionType.GZIP) or
TextIO.from(...).withCompressionType(CompressionType.BZIP2)
This is a silent issue so you will not see any error messages or visible symptoms. The problem occurs under the following circumstances: Using the Dataflow SDK for Java 1.6.0, reading compressed files, and setting the compression mode using withCompressionType to either GZIP or BZIP2.
Current known workarounds:
-
Recommended option: Use AUTO mode instead of GZIP or BZIP2 mode.
Use withCompressionType(CompressionType.AUTO) or leave it unset (it is the default) with the TextIO source. NOTE: compressed files must have .gz or .bz2 (case-insensitive) extension for this to work.
-
Switch to version 1.5.1 of the Dataflow SDK for Java. If you are using mvn, this can be done by specifying version 1.5.1 in your pom.xml
We are actively working to resolve this and will update this issue with all developments.