Releases · GoogleCloudPlatform/DataflowJavaSDK

Fixed an issue with Dataflow jobs reading from TextIO with compression type set to GZIP or BZIP2. For more information, see Issue #356.

Added InProcessPipelineRunner, an improvement over the DirectPipelineRunner that better implements the Dataflow model. InProcessPipelineRunner runs on a user's local machine and supports multithreaded execution, unbounded PCollections, and triggers for speculative and late outputs.
Added display data, which allows annotating user functions (DoFn, CombineFn, and WindowFn), Sources, and Sinks with static metadata to be displayed in the Dataflow Monitoring Interface. Display data has been implemented for core components and is automatically applied to all PipelineOptions.
Added the ability to compose multiple CombineFns into a single CombineFn using CombineFns.compose or CombineFns.composeKeyed.
Added the methods getSplitPointsConsumed and getSplitPointsRemaining to the BoundedReader API to improve Dataflow's ability to automatically scale a job reading from these sources. Default implementations of these functions have been provided, but reader implementers should override them to provide better information when available.
Improved performance of side inputs when using workers with many cores.
Improved efficiency when using CombineFnWithContext.
Fixed several issues related to stability in the streaming mode.

Fixed an issue that hid BigtableIO.Read.withRowFilter, which allows Cloud Bigtable rows to be filtered in the Read transform.
Fixed support for concatenated GZip files.
Fixed an issue that prevented Write.to to be used with merging windows.
Fixed an issue that caused excessive triggering with repeated composite triggers.
Fixed an issue with merging windows and triggers that finish before the end of the window.

With this release, we have begun preparing the Dataflow SDK for Java for an eventual move to Apache Beam (incubating). Specifically, we have refactored a number of internal APIs and removed from the SDK classes used only within the worker, which will now be provided by the Google Cloud Dataflow Service during job execution. This refactoring should not affect any user code.

Additionally, the 1.5.0 release includes the following changes:

Enabled an indexed side input format for batch pipelines executed on the Google Cloud Dataflow service. Indexed side inputs significantly increase performance for View.asList, View.asMap, View.asMultimap, and any non-globally-windowed PCollectionViews.
Upgraded to Protocol Buffers version 3.0.0-beta-1. If you use custom Protocol Buffers, you should recompile them with the corresponding version of the protoc compiler. You can continue using both version 2 and 3 of the Protocol Buffers syntax, and no user pipeline code needs to change.
Added ProtoCoder, which is a Coder for Protocol Buffers messages that supports both version 2 and 3 of the Protocol Buffers syntax. This coder can detect when messages can be encoded deterministically. Proto2Coder is now deprecated; we recommend that all users switch to ProtoCoder.
Added withoutResultFlattening to BigQueryIO.Read to disable flattening query results when reading from BigQuery.
Added BigtableIO, enabling support for reading from and writing to Google Cloud Bigtable.
Improved CompressedSource to detect compression format according to the file extension. Added support for reading .gz files that are transparently decompressed by the underlying transport logic.

Added a series of batch and streaming example pipelines in a mobile gaming domain that illustrate some advanced topics, including windowing and triggers.
Added support for Combine functions to access pipeline options and side inputs through a context. See GlobalCombineFn and PerKeyCombineFn for further details.
Modified ParDo.withSideInputs() such that successive calls are cumulative.
Modified automatic coder detection of Protocol Buffer messages; such classes now have their coders provided automatically.
Added support for limiting the number of results returned by DatastoreIO.Source. However, when this limit is set, the operation that reads from Cloud Datastore is performed by a single worker rather than executing in parallel across the worker pool.
Modified definition of PaneInfo.{EARLY, ON_TIME, LATE} so that panes with only late data are always LATE, and an ON_TIME pane can never cause a later computation to yield a LATE pane.
Modified GroupByKey to drop late data when that late data arrives for a window that has expired. An expired window means the end of the window is passed by more than the allowed lateness.
When using GlobalWindows, you are no longer required to specify withAllowedLateness(), since no data is ever dropped.
Added support for obtaining the default project ID from the default project configuration produced by newer versions of the gcloud utility. If the default project configuration does not exist, Dataflow reverts to using the old project configuration generated by older versions of the gcloud utility.

Improved IterableLikeCoder to efficiently encode small values. This change is backward compatible; however, if you have a running pipeline that was constructed with SDK version 1.3.0 or later, it may not be possible to "update" that pipeline with a replacement that was constructed using SDK version 1.2.1 or older. Updating a running pipeline with a pipeline constructed using a new SDK version, however, should be successful.
When TextIO.Write or AvroIO.Write outputs to a fixed number of files, added a reshard (shuffle) step immediately prior to the write step. The cost of this reshard is often exceeded by additional parallelism available to the preceding stage.
Added support for RFC 3339 timestamps in PubsubIO. This allows reading from Cloud Pub/Sub topics published by Cloud Logging without losing timestamp information.
Improved memory management to help prevent pipelines in the streaming execution mode from stalling when running with high memory utilization. This particularly benefits pipelines with large GroupByKey results.
Added ability to customize timestamps of emitted windows. Previously, the watermark was held to the earliest timestamp of any buffered input. With this change, you can choose a later time to allow the watermark to progress further. For example, using the end of the window will prevent long-lived sessions from holding up the output. See Window.Bound.withOutputTime().
Added a simplified syntax for early and late firings with an AfterWatermark trigger, as follows: AfterWatermark.pastEndOfWindow().withEarlyFirings(...).withLateFirings(...).

Fixed a regression in BigQueryIO that unnecessarily printed a lot of messages when executed using DirectPipelineRunner.

Added Java 8 support. Added new MapElements and FlatMapElements transforms that accept Java 8 lambdas, for those cases when the full power of ParDo is not required. Filter and Partition accept lambdas as well. Java 8 functionality is demonstrated in a new MinimalWordCountJava8 example.
Enabled @DefaultCoder annotations for generic types. Previously, a @DefaultCoder annotation on a generic type was ignored, resulting in diminished functionality and confusing error messages. It now works as expected.
DatastoreIO now supports (parallel) reads within namespaces. Entities can be written to namespaces by setting the namespace in the Entity key.
Limited the slf4j-jdk14 dependency to the test scope. When a Dataflow job is executing, the slf4j-api, slf4j-jdk14, jcl-over-slf4j, log4j-over-slf4j, and log4j-to-slf4j dependencies will be provided by the system.

Added a coder for type Set<T> to the coder registry, when type T has its own registered coder.
Added NullableCoder, which can be used in conjunction with other coders to encode a PCollection whose elements may possibly contain null values.
Added Filter as a composite PTransform. Deprecated static methods in the old Filter implementation that return ParDo transforms.
Added SourceTestUtils, which is a set of helper functions and test harnesses for testing correctness of Source implementations.

The initial General Availability (GA) version, open to all developers, and considered stable and fully qualified for production use. It coincides with the General Availability of the Dataflow Service.
Removed the default values for numWorkers, maxNumWorkers, and similar settings. If these are unspecified, the Dataflow Service will pick an appropriate value.
Added checks to DirectPipelineRunner to help ensure that DoFns obey the existing requirement that inputs and outputs must not be modified.
Added support in AvroCoder for @Nullable fields with deterministic encoding.
Added a requirement that anonymous CustomCoder subclasses override getEncodingId method.
Changed Source.Reader, BoundedSource.BoundedReader, UnboundedSource.UnboundedReader to be abstract classes, instead of interfaces. AbstractBoundedReader has been merged into BoundedSource.BoundedReader.
Renamed ByteOffsetBasedSource and ByteOffsetBasedReader to OffsetBasedSource and OffsetBasedReader, introducing getBytesPerOffset as a translation layer.
Changed OffsetBasedReader, such that the subclass now has to override startImpl and advanceImpl, rather than start and advance. The protected variable rangeTracker is now hidden and updated by base class automatically. To indicate split points, use the method isAtSplitPoint.
Removed methods for adjusting watermark triggers.
Removed an unecessary generic parameter from TimeTrigger.
Removed generation of empty panes unless explicitly requested.

Releases: GoogleCloudPlatform/DataflowJavaSDK

Version 1.6.1

Uh oh!

Version 1.6.0

Uh oh!

Version 1.5.1

Uh oh!

Version 1.5.0

Uh oh!

Version 1.4.0

Uh oh!

Version 1.3.0

Uh oh!

Version 1.2.1

Uh oh!

Version 1.2.0

Uh oh!

Version 1.1.0

Uh oh!

Version 1.0.0

Uh oh!