This repository was archived by the owner on Nov 11, 2022. It is now read-only.
Releases: GoogleCloudPlatform/DataflowJavaSDK
Releases · GoogleCloudPlatform/DataflowJavaSDK
Version 1.6.1
- Fixed an issue with Dataflow jobs reading from
TextIOwith compression type set toGZIPorBZIP2. For more information, see Issue #356.
Version 1.6.0
- Added
InProcessPipelineRunner, an improvement over theDirectPipelineRunnerthat better implements the Dataflow model.InProcessPipelineRunnerruns on a user's local machine and supports multithreaded execution, unboundedPCollections, and triggers for speculative and late outputs. - Added display data, which allows annotating user functions (
DoFn,CombineFn, andWindowFn),Sources, andSinks with static metadata to be displayed in the Dataflow Monitoring Interface. Display data has been implemented for core components and is automatically applied to allPipelineOptions. - Added the ability to compose multiple
CombineFns into a singleCombineFnusingCombineFns.composeorCombineFns.composeKeyed. - Added the methods
getSplitPointsConsumedandgetSplitPointsRemainingto theBoundedReaderAPI to improve Dataflow's ability to automatically scale a job reading from these sources. Default implementations of these functions have been provided, but reader implementers should override them to provide better information when available. - Improved performance of side inputs when using workers with many cores.
- Improved efficiency when using
CombineFnWithContext. - Fixed several issues related to stability in the streaming mode.
Version 1.5.1
- Fixed an issue that hid
BigtableIO.Read.withRowFilter, which allows Cloud Bigtable rows to be filtered in theReadtransform. - Fixed support for concatenated GZip files.
- Fixed an issue that prevented
Write.toto be used with merging windows. - Fixed an issue that caused excessive triggering with repeated composite triggers.
- Fixed an issue with merging windows and triggers that finish before the end of the window.
Version 1.5.0
With this release, we have begun preparing the Dataflow SDK for Java for an eventual move to Apache Beam (incubating). Specifically, we have refactored a number of internal APIs and removed from the SDK classes used only within the worker, which will now be provided by the Google Cloud Dataflow Service during job execution. This refactoring should not affect any user code.
Additionally, the 1.5.0 release includes the following changes:
- Enabled an indexed side input format for batch pipelines executed on the Google Cloud Dataflow service. Indexed side inputs significantly increase performance for
View.asList,View.asMap,View.asMultimap, and any non-globally-windowedPCollectionViews. - Upgraded to Protocol Buffers version
3.0.0-beta-1. If you use custom Protocol Buffers, you should recompile them with the corresponding version of theprotoccompiler. You can continue using both version 2 and 3 of the Protocol Buffers syntax, and no user pipeline code needs to change. - Added
ProtoCoder, which is aCoderfor Protocol Buffers messages that supports both version 2 and 3 of the Protocol Buffers syntax. This coder can detect when messages can be encoded deterministically.Proto2Coderis now deprecated; we recommend that all users switch toProtoCoder. - Added
withoutResultFlatteningtoBigQueryIO.Readto disable flattening query results when reading from BigQuery. - Added
BigtableIO, enabling support for reading from and writing to Google Cloud Bigtable. - Improved
CompressedSourceto detect compression format according to the file extension. Added support for reading.gzfiles that are transparently decompressed by the underlying transport logic.
Version 1.4.0
- Added a series of batch and streaming example pipelines in a mobile gaming domain that illustrate some advanced topics, including windowing and triggers.
- Added support for
Combinefunctions to access pipeline options and side inputs through a context. SeeGlobalCombineFnandPerKeyCombineFnfor further details. - Modified
ParDo.withSideInputs()such that successive calls are cumulative. - Modified automatic coder detection of Protocol Buffer messages; such classes now have their coders provided automatically.
- Added support for limiting the number of results returned by
DatastoreIO.Source. However, when this limit is set, the operation that reads from Cloud Datastore is performed by a single worker rather than executing in parallel across the worker pool. - Modified definition of
PaneInfo.{EARLY, ON_TIME, LATE}so that panes with only late data are alwaysLATE, and anON_TIMEpane can never cause a later computation to yield aLATEpane. - Modified
GroupByKeyto drop late data when that late data arrives for a window that has expired. An expired window means the end of the window is passed by more than the allowed lateness. - When using
GlobalWindows, you are no longer required to specifywithAllowedLateness(), since no data is ever dropped. - Added support for obtaining the default project ID from the default project configuration produced by newer versions of the
gcloudutility. If the default project configuration does not exist, Dataflow reverts to using the old project configuration generated by older versions of thegcloudutility.
Version 1.3.0
- Improved
IterableLikeCoderto efficiently encode small values. This change is backward compatible; however, if you have a running pipeline that was constructed with SDK version 1.3.0 or later, it may not be possible to "update" that pipeline with a replacement that was constructed using SDK version 1.2.1 or older. Updating a running pipeline with a pipeline constructed using a new SDK version, however, should be successful. - When
TextIO.WriteorAvroIO.Writeoutputs to a fixed number of files, added a reshard (shuffle) step immediately prior to the write step. The cost of this reshard is often exceeded by additional parallelism available to the preceding stage. - Added support for RFC 3339 timestamps in
PubsubIO. This allows reading from Cloud Pub/Sub topics published by Cloud Logging without losing timestamp information. - Improved memory management to help prevent pipelines in the streaming execution mode from stalling when running with high memory utilization. This particularly benefits pipelines with large
GroupByKeyresults. - Added ability to customize timestamps of emitted windows. Previously, the watermark was held to the earliest timestamp of any buffered input. With this change, you can choose a later time to allow the watermark to progress further. For example, using the end of the window will prevent long-lived sessions from holding up the output. See
Window.Bound.withOutputTime(). - Added a simplified syntax for early and late firings with an
AfterWatermarktrigger, as follows:AfterWatermark.pastEndOfWindow().withEarlyFirings(...).withLateFirings(...).
Version 1.2.1
- Fixed a regression in
BigQueryIOthat unnecessarily printed a lot of messages when executed usingDirectPipelineRunner.
Version 1.2.0
- Added Java 8 support. Added new
MapElementsandFlatMapElementstransforms that accept Java 8 lambdas, for those cases when the full power ofParDois not required.FilterandPartitionaccept lambdas as well. Java 8 functionality is demonstrated in a newMinimalWordCountJava8example. - Enabled
@DefaultCoderannotations for generic types. Previously, a@DefaultCoderannotation on a generic type was ignored, resulting in diminished functionality and confusing error messages. It now works as expected. DatastoreIOnow supports (parallel) reads within namespaces. Entities can be written to namespaces by setting the namespace in the Entity key.- Limited the
slf4j-jdk14dependency to thetestscope. When a Dataflow job is executing, theslf4j-api,slf4j-jdk14,jcl-over-slf4j,log4j-over-slf4j, andlog4j-to-slf4jdependencies will be provided by the system.
Version 1.1.0
- Added a coder for type
Set<T>to the coder registry, when typeThas its own registered coder. - Added
NullableCoder, which can be used in conjunction with other coders to encode aPCollectionwhose elements may possibly contain null values. - Added
Filteras a compositePTransform. Deprecated static methods in the oldFilterimplementation that returnParDotransforms. - Added
SourceTestUtils, which is a set of helper functions and test harnesses for testing correctness ofSourceimplementations.
Version 1.0.0
- The initial General Availability (GA) version, open to all developers, and considered stable and fully qualified for production use. It coincides with the General Availability of the Dataflow Service.
- Removed the default values for
numWorkers,maxNumWorkers, and similar settings. If these are unspecified, the Dataflow Service will pick an appropriate value. - Added checks to
DirectPipelineRunnerto help ensure thatDoFns obey the existing requirement that inputs and outputs must not be modified. - Added support in
AvroCoderfor@Nullablefields with deterministic encoding. - Added a requirement that anonymous
CustomCodersubclasses overridegetEncodingIdmethod. - Changed
Source.Reader,BoundedSource.BoundedReader,UnboundedSource.UnboundedReaderto be abstract classes, instead of interfaces.AbstractBoundedReaderhas been merged intoBoundedSource.BoundedReader. - Renamed
ByteOffsetBasedSourceandByteOffsetBasedReadertoOffsetBasedSourceandOffsetBasedReader, introducinggetBytesPerOffsetas a translation layer. - Changed
OffsetBasedReader, such that the subclass now has to overridestartImplandadvanceImpl, rather thanstartandadvance. The protected variablerangeTrackeris now hidden and updated by base class automatically. To indicate split points, use the methodisAtSplitPoint. - Removed methods for adjusting watermark triggers.
- Removed an unecessary generic parameter from
TimeTrigger. - Removed generation of empty panes unless explicitly requested.