-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223
Conversation
… signed integers Previously, when combining UInt64 with any signed integer, the resulting type would be Int64, which would result in lost information. Now, combining UInt64 with a signed integer results in a Decimal(20, 0), which is able to encode all (64-bit) integer types.
// accommodates all values of both types. Note that to avoid information | ||
// loss when combining UInt64 with signed integers we use Decimal128(20, 0). | ||
(Decimal128(20, 0), _) | ||
| (_, Decimal128(20, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks not correct. For example, combining Decimal128(20, 0)
with Decimal128(30, 0)
should not result in Decimal128(20, 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when both types are decimal they are handled above before this match, when calling the decimal_coercion
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rechecked it, and decimal_coercion
already covers them in
datafusion/datafusion/expr-common/src/type_coercion/binary.rs
Lines 932 to 940 in 2f28327
fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option<DataType> { | |
use arrow::datatypes::DataType::*; | |
// This conversion rule is from spark | |
// https://github.com/apache/spark/blob/1c81ad20296d34f137238dadd67cc6ae405944eb/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L127 | |
match numeric_type { | |
Int8 => Some(Decimal128(3, 0)), | |
Int16 => Some(Decimal128(5, 0)), | |
Int32 => Some(Decimal128(10, 0)), | |
Int64 => Some(Decimal128(20, 0)), |
Although it doesn't handle unsigned integer types, we can supplement it there, maybe as a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we shouldn't combine decimal with integer types here because decimal_coercion
has already handled it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that initially but then it would not handle UInt64
and Decimal128(20, 0)
:
Cannot infer common argument type for comparison operation UInt64 = Decimal128(20, 0)
So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal
to include unsigned integers as well?
match numeric_type {
Int8 => Some(Decimal128(3, 0)),
Int16 => Some(Decimal128(5, 0)),
Int32 => Some(Decimal128(10, 0)),
Int64 => Some(Decimal128(20, 0)),
Float32 => Some(Decimal128(14, 7)),
Float64 => Some(Decimal128(30, 15)),
_ => None,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe it would be best to add new arms to the
coerce_numeric_type_to_decimal
to include unsigned integers as well?
I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Perhaps we should add a sqllogictest test for #14208. |
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @nuno-faria
Thanks again @nuno-faria - it is great to see you contributing ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @nuno-faria and the review @jonahgao
However, this change doesn't seem like a good one to me
// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer) | ||
// accommodates all values of both types. Note that to avoid information | ||
// loss when combining UInt64 with signed integers we use Decimal128(20, 0). | ||
(UInt64, Int64 | Int32 | Int16 | Int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has potentially (large) performance implications.
My understanding is that this means that Int64+Int64 will result in (always) a 128bit result?
So even though performing int64+int64 will never overflow, all queries will pay the price of 2x space (and some time) overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Int64+Int64
is not affected, it uses mathematics_numerical_coercion.
I did some validation on this PR branch.
DataFusion CLI v44.0.0
> create table test(a bigint, b bigint unsigned) as values(1,1);
0 row(s) fetched.
Elapsed 0.008 seconds.
> select arrow_typeof(a+b), arrow_typeof(a+a), arrow_typeof(a), arrow_typeof(b) from test;
+-------------------------------+-------------------------------+----------------------+----------------------+
| arrow_typeof(test.a + test.b) | arrow_typeof(test.a + test.a) | arrow_typeof(test.a) | arrow_typeof(test.b) |
+-------------------------------+-------------------------------+----------------------+----------------------+
| Int64 | Int64 | Int64 | UInt64 |
+-------------------------------+-------------------------------+----------------------+----------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:
04)------TableScan: aggregate_test_100 projection=[c1, c9] | ||
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Int64) AS c9 | ||
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Decimal128(20, 0)) AS c9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a regression to me (there is now 2x the space needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query unions Int16
with UInt64
. We need to find a common type that can accommodate all possible values of these two types, such as -1
and u64::MAX
. Despite increasing the space, it makes the following query available.
create table t1(a smallint) as values(1);
create table t2(a bigint unsigned) as values(10000000000000000000);
select * from t1 union select * from t2;
3 | ||
|
||
query I rowsort | ||
select * from unsigned_bigint_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the case of overflow on *
of 2 64 bit numbers is more likely and automatically coercing to Decimal128
may make sense.
However, I would argue that if the user cares about avoiding overflows when doing Intger arithmetic they should use Decimal128
in their input types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overflow is unexpected in this case, as all these values are valid unsigned bigint.
The problem is that in values (10000000000000000001), (1)
, 10000000000000000001 is parsed as UInt64
, and 1 is parsed as Int64
. They were coerced to Int64
, which can't accommodate 10000000000000000001.
This is very similar to the union case mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nuno-faria and @jonahgao for the explanation. I agree this PR makes sense and my concerns did not apply.
// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer) | ||
// accommodates all values of both types. Note that to avoid information | ||
// loss when combining UInt64 with signed integers we use Decimal128(20, 0). | ||
(UInt64, Int64 | Int32 | Int16 | Int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:
Thanks @nuno-faria and @alamb for the review. |
When we combine a |
We always parse
This is because, during type coercion, we only consider the data types of the binary operands and do not take their actual data into account. Some binary operations, such as union and comparisons, require that the operands have the same data type, so the coerced type must be a superset of both operand types and therefore can accommodate all possible operand values. For this case, maybe we can use statistical information to examine the operands' data and perform some physical optimizations. |
This is a problem with a large potential performance impact in many use cases. I do not think it is a good idea to defer its solution with an open-ended timeframe. I see two ways to proceed:
I am fine with both ways forward. |
Previously, when combining
UInt64
with any signed integer, the resulting type would beInt64
, which would result in lost information. Now, combiningUInt64
with a signed integer results in aDecimal(20, 0)
, which is able to encode all (64-bit) integer types. Thanks @jonahgao for the pointers.The function
bitwise_coercion
remains the same, since it's probably not a good idea to introduce decimals when performing bitwise operations. In this case, it converts(UInt64 | _)
toUInt64
.Which issue does this PR close?
Closes #14208.
What changes are included in this PR?
binary_numeric_coercion
inexpr-common/type_coercion/binary.rs
.expr-common/type_coercion/binary.rs
.Are these changes tested?
Yes.
Are there any user-facing changes?
No.