这是indexloc提供的服务,不要输入任何密码
Skip to content

MSSQLToGCSOperator throws PyArrow Error when exporting from a query with Bit Fields #57461

@juarezr

Description

@juarezr

Apache Airflow version

3.1.1

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

MSSQLToGCSOperator fails with ArrowTypeError: Expected bytes, got a 'bool' object when exporting MSSQL bit fields to Parquet format due to incorrect type mapping in the type_map property.

Actual Behavior

The task fails with the following error:

ArrowTypeError: Expected bytes, got a 'bool' object

Full stack trace:

File "/opt/airflow/pyarrow/table.pxi", line 1968, in pyarrow.lib._Tabular.from_pydict
File "/opt/airflow/pyarrow/table.pxi", line 6303, in pyarrow.lib._from_pydict
File "/opt/airflow/pyarrow/array.pxi", line 400, in pyarrow.lib.asarray
File "/opt/airflow/pyarrow/array.pxi", line 370, in pyarrow.lib.array
File "/opt/airflow/pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "/opt/airflow/pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "/opt/airflow/pyarrow/error.pxi", line 92, in pyarrow.lib.check_status

What you think should happen instead?

The operator should successfully export MSSQL bit fields to Parquet format, converting them to boolean types that PyArrow can handle.

How to reproduce

  1. Create a DAG using MSSQLToGCSOperator with a table containing bit fields
  2. Set export_format="parquet"
  3. Include bit fields in the bit_fields parameter
  4. Execute the DAG

Example DAG:

from airflow.providers.google.cloud.transfers.mssql_to_gcs import MSSQLToGCSOperator

extract_task = MSSQLToGCSOperator(
    task_id="extract_to_gcs",
    mssql_conn_id="mssql_default",
    sql="SELECT id, name, isactive FROM dbo.DriverGroup",  # isactive is BIT type
    gcp_conn_id="google_cloud_default",
    bucket="my-bucket",
    filename="data.parquet",
    export_format="parquet",
    bit_fields=["isactive"],  # This causes the error
)

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.12.4
apache-airflow-providers-common-compat==1.8.0
apache-airflow-providers-common-io==1.6.3
apache-airflow-providers-common-messaging==2.0.0
apache-airflow-providers-common-sql==1.28.1
apache-airflow-providers-docker==4.4.3
apache-airflow-providers-google==18.0.0
apache-airflow-providers-grpc==3.8.2
apache-airflow-providers-microsoft-azure==12.7.1
apache-airflow-providers-microsoft-mssql==4.3.2
apache-airflow-providers-odbc==4.10.2
apache-airflow-providers-openlineage==2.7.2
apache-airflow-providers-standard==1.9.0

Deployment

Docker-Compose

Deployment details

Tested in local docker containter.

Root Cause Analysis

Cause

The issue is in the MSSQLToGCSOperator class at line 70:

# Current (incorrect) implementation:
class MSSQLToGCSOperator(BaseSQLToGCSOperator):
    type_map = {2: "BOOLEAN", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
    #                    ^^^^^^^^
    #                    This should be "BOOL"

The BaseSQLToGCSOperator._convert_parquet_schema() method expects "BOOL" for boolean types, but MSSQLToGCSOperator maps bit fields (type 2) to "BOOLEAN". This mismatch causes PyArrow to fail when converting the schema:

    # Type map in base class BaseSQLToGCSOperator:

    def _convert_parquet_schema(self, cursor):
        type_map = {
            "INTEGER": pa.int64(),
            "FLOAT": pa.float64(),
            "NUMERIC": pa.float64(),
            "BIGNUMERIC": pa.float64(),
            "BOOL": pa.bool_(),  ## This should be the correct key instead of BOOLEAN
            "STRING": pa.string(),
            "BYTES": pa.binary(),
            "DATE": pa.date32(),
            "DATETIME": pa.date64(),
            "TIMESTAMP": pa.timestamp("s"),
        }

Proposed Fix

Change line 70 in airflow/providers/google/cloud/transfers/mssql_to_gcs.py:

# Fix:
type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
#                    ^^^^
#                    Changed from "BOOLEAN" to "BOOL"

This was tested with airflow version v3.1.1 with python3.12.

Anything else?

Workaround

Users can create a custom operator to override the type mapping:

from airflow.providers.google.cloud.transfers.mssql_to_gcs import MSSQLToGCSOperator

class FixedMSSQLToGCSOperator(MSSQLToGCSOperator):
    type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}

# Use FixedMSSQLToGCSOperator instead of MSSQLToGCSOperator

Impact

  • Severity: Low - Blocks Parquet export functionality for MSSQL bit fields
  • Affected Users: All users exporting MSSQL bit fields to Parquet
  • Workaround Available: Yes (custom operator)
  • Data Loss Risk: No (workarounds preserve data integrity)

Additional Context

This bug affects the core functionality of MSSQL to GCS data transfer when using Parquet format. The issue is specific to bit field handling and doesn't affect other data types or export formats (CSV, JSON work correctly).

Files affected:

  • airflow/providers/google/cloud/transfers/mssql_to_gcs.py (line 70)

Related Issues and PRs:

Related components:

  • BaseSQLToGCSOperator._convert_parquet_schema()
  • PyArrow schema conversion
  • MSSQL bit field type mapping

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions