07 Aug '23

HIVE_CURSOR_ERROR in Athena when reading parquet files written by pandas

In a recent project, a colleague asked me to look at a HIVE_CURSOR_ERROR in Athena that they weren’t able to get rid of. Since the error message was not incredibly helpful and the way this error appeared is not that uncommon, I thought writing this may help you, dear reader, and future me when I inevitably forget about it again.

Error Message

The error message is as follows:

HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://bucket/table/file.parquet

It doesn’t tell us much, and neither does the AWS documentation at the time of writing this. Let’s figure out where it’s coming from.

Diagnosis

The original query was a simple select * from data_table limit 10, nothing fancy. This was a strong indicator that something in the table configuration didn’t match the underlying data in the parquet file.

As a first step, we tried reading individual columns like this:

select column_a
  from data_table
  limit 10

Depending on which column we selected, the query either returned the same error or the expected result, which pointed to a mismatch between some of the data types in the underlying parquet and table.

The next step was downloading and inspecting the parquet file indicated in the error message. There’s a very useful python tool that we can leverage to gain access to the parquet metadata called parquet-tools.

$ pip install parquet-tools
$ parquet-tools inspect my_suspicious.parquet
[...]
############ Column(column_a) ############
name: column_a
path: column_a
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)
[...]

The parquet tools inspect the metadata and show us the underlying data type of the columns. Here it turned out that the columns were of type DOUBLE in parquet, whereas the Glue Data Catalog described them as a BIGINT. That explains why Athena couldn’t read it, although the error message could be improved.

This leaves the question of why there is a DOUBLE when the table defines a BIGINT. In this case, BIGINT was the correct data type because the column is supposed to contain nullable integers, i.e., integers or null. So what happened here?

Root Cause Analysis

The parquets are written through pandas/pyarrow, and the pandas library is causing our problem here. When the data is requested from an API, it’s converted into a list of dictionaries. The attribute/column either contains the integer or null (None in Python). In order to turn this into a parquet file, we create a Pandas Dataframe from our list of dictionaries.

Here, something interesting is happening. When pandas notices the None/null in the data, it replaces it with numpy.NaN, which has the datatype float since integers in the underlying Numpy array can’t be null.

>>> import numpy as np
>>> type(np.NaN)
<class 'float'>

When writing this to a dataframe, pandas must choose a type that fits all the data in the column. Since there is at least one floating point number in there, and you can represent integers as floats, it chooses the underlying double (more precise float) data type for the parquet.

This is how we can create the my_suspicious.parquet file:

import pandas as pd

def main():
    data = {
        "column_a": [1, None, 3, 4],
    }

    frame = pd.DataFrame(data)

    frame.to_parquet("my_suspicious.parquet", index=False)

if __name__ == "__main__":
    main()

Fixing it

Since the desire to have nullable integers is not that uncommon, the pandas documentation offers some ways to work around this. We just need to cast the column to a special nullable Integer type like this:

import pandas as pd

def main():
    data = {
        "column_a": [1, None, 3, 4],
    }

    frame = pd.DataFrame(data)
    frame.column_a = frame.column_a.astype(pd.Int64Dtype())

    frame.to_parquet("my_suspicious.parquet", index=False)

    # Assert that our column is now of type integer
    assert pd.api.types.is_integer_dtype(frame.column_a)


if __name__ == "__main__":
    main()

If we run the parquet-tools on the resulting file again, we can see that it now has the proper data type INT64, which is compatible with Athena’s BIGINT.

$ parquet-tools inspect my_suspicious.parquet                       [...]
############ Column(column_a) ############
name: column_a
path: column_a
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -1%)

Under the hood, pandas uses the pandas.NA value to represent the nulls, which allows it to store them in a numpy array.

Conclusion

This HIVE_CURSOR_ERROR may indicate a mismatch between the table’s expected data type and the parquet file’s underlying data type. I showed you how to diagnose and fix this issue if it’s caused by nullable integers in pandas.

— Maurice

level-300 athena

Maurice Borgmeier

Maurice is a former tecRacer Employee.

HIVE_CURSOR_ERROR in Athena when reading parquet files written by pandas

Diagnosis

Root Cause Analysis

Fixing it

Conclusion

Maurice Borgmeier

Similar Posts You Might Enjoy

Building Data Aggregation Pipelines using Apache Airflow and Athena

Making the TPC-H dataset available in Athena using Airflow

Glue Crawlers don't correctly recognize Ion data - here's how you fix that

Share

HIVE_CURSOR_ERROR in Athena when reading parquet files written by pandas

Diagnosis

Root Cause Analysis

Fixing it

Conclusion

Maurice Borgmeier

Similar Posts You Might Enjoy

Building Data Aggregation Pipelines using Apache Airflow and Athena

Making the TPC-H dataset available in Athena using Airflow

Glue Crawlers don't correctly recognize Ion data - here's how you fix that