-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Description
Apache Airflow version
3.1.1
If "Other Airflow 2/3 version" selected, which one?
No response
What happened?
MSSQLToGCSOperator fails with ArrowTypeError: Expected bytes, got a 'bool' object when exporting MSSQL bit fields to Parquet format due to incorrect type mapping in the type_map property.
Actual Behavior
The task fails with the following error:
ArrowTypeError: Expected bytes, got a 'bool' object
Full stack trace:
File "/opt/airflow/pyarrow/table.pxi", line 1968, in pyarrow.lib._Tabular.from_pydict
File "/opt/airflow/pyarrow/table.pxi", line 6303, in pyarrow.lib._from_pydict
File "/opt/airflow/pyarrow/array.pxi", line 400, in pyarrow.lib.asarray
File "/opt/airflow/pyarrow/array.pxi", line 370, in pyarrow.lib.array
File "/opt/airflow/pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "/opt/airflow/pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "/opt/airflow/pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
What you think should happen instead?
The operator should successfully export MSSQL bit fields to Parquet format, converting them to boolean types that PyArrow can handle.
How to reproduce
- Create a DAG using
MSSQLToGCSOperatorwith a table containing bit fields - Set
export_format="parquet" - Include bit fields in the
bit_fieldsparameter - Execute the DAG
Example DAG:
from airflow.providers.google.cloud.transfers.mssql_to_gcs import MSSQLToGCSOperator
extract_task = MSSQLToGCSOperator(
task_id="extract_to_gcs",
mssql_conn_id="mssql_default",
sql="SELECT id, name, isactive FROM dbo.DriverGroup", # isactive is BIT type
gcp_conn_id="google_cloud_default",
bucket="my-bucket",
filename="data.parquet",
export_format="parquet",
bit_fields=["isactive"], # This causes the error
)Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
apache-airflow-providers-celery==3.12.4
apache-airflow-providers-common-compat==1.8.0
apache-airflow-providers-common-io==1.6.3
apache-airflow-providers-common-messaging==2.0.0
apache-airflow-providers-common-sql==1.28.1
apache-airflow-providers-docker==4.4.3
apache-airflow-providers-google==18.0.0
apache-airflow-providers-grpc==3.8.2
apache-airflow-providers-microsoft-azure==12.7.1
apache-airflow-providers-microsoft-mssql==4.3.2
apache-airflow-providers-odbc==4.10.2
apache-airflow-providers-openlineage==2.7.2
apache-airflow-providers-standard==1.9.0
Deployment
Docker-Compose
Deployment details
Tested in local docker containter.
Root Cause Analysis
Cause
The issue is in the MSSQLToGCSOperator class at line 70:
# Current (incorrect) implementation:
class MSSQLToGCSOperator(BaseSQLToGCSOperator):
type_map = {2: "BOOLEAN", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
# ^^^^^^^^
# This should be "BOOL"The BaseSQLToGCSOperator._convert_parquet_schema() method expects "BOOL" for boolean types, but MSSQLToGCSOperator maps bit fields (type 2) to "BOOLEAN". This mismatch causes PyArrow to fail when converting the schema:
# Type map in base class BaseSQLToGCSOperator:
def _convert_parquet_schema(self, cursor):
type_map = {
"INTEGER": pa.int64(),
"FLOAT": pa.float64(),
"NUMERIC": pa.float64(),
"BIGNUMERIC": pa.float64(),
"BOOL": pa.bool_(), ## This should be the correct key instead of BOOLEAN
"STRING": pa.string(),
"BYTES": pa.binary(),
"DATE": pa.date32(),
"DATETIME": pa.date64(),
"TIMESTAMP": pa.timestamp("s"),
}Proposed Fix
Change line 70 in airflow/providers/google/cloud/transfers/mssql_to_gcs.py:
# Fix:
type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
# ^^^^
# Changed from "BOOLEAN" to "BOOL"This was tested with airflow version v3.1.1 with python3.12.
Anything else?
Workaround
Users can create a custom operator to override the type mapping:
from airflow.providers.google.cloud.transfers.mssql_to_gcs import MSSQLToGCSOperator
class FixedMSSQLToGCSOperator(MSSQLToGCSOperator):
type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
# Use FixedMSSQLToGCSOperator instead of MSSQLToGCSOperatorImpact
- Severity: Low - Blocks Parquet export functionality for MSSQL bit fields
- Affected Users: All users exporting MSSQL bit fields to Parquet
- Workaround Available: Yes (custom operator)
- Data Loss Risk: No (workarounds preserve data integrity)
Additional Context
This bug affects the core functionality of MSSQL to GCS data transfer when using Parquet format. The issue is specific to bit field handling and doesn't affect other data types or export formats (CSV, JSON work correctly).
Files affected:
airflow/providers/google/cloud/transfers/mssql_to_gcs.py(line 70)
Related Issues and PRs:
- MSSQLToBigQuery incorrect type conversion bit-->integer #11874
- Add "BOOLEAN" to type map of MSSQLToGCSOperator and fix incorrect BIT->INT conversion #29902
Related components:
BaseSQLToGCSOperator._convert_parquet_schema()- PyArrow schema conversion
- MSSQL bit field type mapping
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct