User Defined Coercion Rules #14296

alamb · 2025-01-25T17:38:48Z

Is your feature request related to a problem or challenge?

Coercion is (TODO find definition)

At the moment DataFusion has one set of built in coercion rules. However, with a single set of coercion rules, we'll always end up with conflicting requirements such as

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223
Coercion between floating point types and exact numeric types should result in floating point type as the common super type #14272
DataFusion Regression (Starting in v43): Type Coercion for UDF Arguments #14230 from @shehabgamin

It also makes it hard, as @findepi has pointed out several times, to extend DataFusion with new data types / logica types.

My conclusion is we will never be able to have the behavior that works for everyone, and any change in coercion is likely to make some other tradeoff.

Rather than having to go back/forth in the core, I think an API would give us an escape hatch.

Describe the solution you'd like

While the user can in theory supply their own coercion rules by adding a new AnalyzerRule instead of the current [TypeCoercion] rule

datafusion/datafusion/optimizer/src/analyzer/type_coercion.rs

Line 62 in 18f14ab

pub struct TypeCoercion {}

I think the coercion code is called in so many places this would only be a partial fix

Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want the users can override them

It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general

Describe alternatives you've considered

I was imagining something like

struct MyCoercer {}

/// Implement custom coercion ehre
impl TypeCoercion for MyCoercer { 
...
}

// register coercion rules like
let ctx: SessionContext = SessionStateBuilder::new()
  .with_type_coercion_rules(Arc::new(MyCoercer{}));
...

The trait might look like

/// Defines coercion rules
pub trait TypeCoercion {
  /// Given the types of arguments to a comparison operation (`=`, `<`, etc) what single type should all arguments be cast to
  fn coerce_comparison(args: &[DataType]) -> DataType ;
 ...
}

Maybe there should be methods for Field instead of DataType 🤔

Additional context

No response

The text was updated successfully, but these errors were encountered:

jayzhan211 · 2025-01-26T02:24:56Z

It would be great if we can take LogicalType into consideration as well

shehabgamin · 2025-01-26T04:00:24Z

I like the idea of "user-defined coercion rules" and think we should take it a step further by striving for a paradigm of "user-defined configurability" in DataFusion.

Just as SQLParser has GenericDialect, which is permissive and general-purpose, applying a similar principle to components of DataFusion would provide greater flexibility and adaptability.

Attaching with_type_coercion_rules to a SessionContext is an interesting idea, but I worry this approach might not be flexible enough for nuanced situations. Functions, expressions, and other components often have unique coercion rules specific to their context or intended use cases. A one-size-fits-all approach at the SessionContext level might not address these needs, so a more granular mechanism to define and override coercion rules at a localized scope could be necessary.

For instance, with the built-in UDFs that DataFusion offers, it would be powerful if users could customize various components of a UDF.

Take the AsciiFunc as an example, both of the following signatures could be valid depending on user requirements:

Signature::string(1, Volatility::Immutable)

Signature::coercible(
    vec![TypeSignatureClass::Native(logical_string())],
    Volatility::Immutable,
)

The default implementation of AsciiFunc would prioritize being permissive and general-purpose:

impl AsciiFunc {
    pub fn new() -> Self {
        Self {
            signature: Signature::one_of(
                vec![
                    TypeSignature::Coercible(vec![TypeSignatureClass::Native(
                        logical_string(),
                    )]),
                ],
                Volatility::Immutable,
            ),
        }
    }
}

However, what if users could further customize the behavior of AsciiFunc like this?

let ascii = AsciiFunc::new()
    .with_signature(Signature::string(1, Volatility::Immutable))
    // Define other customizations as well
    .with_display_name(vec![lit("Meow")])
    .with_is_nullable(false)
    ...
    .with_...
    ...
    ?;

jayzhan211 · 2025-01-26T07:20:08Z

We have similar mechanism for aggregate function -- AggregateExprBuilder

let agg_expr = AggregateExprBuilder::new(func.to_owned(), physical_args.to_vec())
                        .order_by(ordering_reqs)
                        .schema(Arc::new(physical_input_schema.to_owned()))
                        .alias(name)
                        .with_ignore_nulls(ignore_nulls)
                        .with_distinct(*distinct)
                        .build()
                        .map(Arc::new)?;

We can also introduced similar builder for ScalarFunction, it does not exist because there is no such requirement so far.

jayzhan211 · 2025-01-26T07:23:07Z

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

Logical layer handle coercion that doesn't care about the difference between the decoding, cast i.e. Integer to String, while physical layer handle coercion that care about the decoding, i.e. cast Utf8 to Utf8View.

alamb · 2025-01-26T10:56:36Z

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

I think it would make more sense to insist that the types line up exactly (have been resolved) in the Physical plans / physical types. But I haven't really thought through the implications of type coercion with a logical / physical type split

alamb · 2025-01-26T10:59:16Z

For instance, with the built-in UDFs that DataFusion offers, it would be powerful if users could customize various components of a UDF.

That is an interesting idea @shehabgamin

I do want to point out that users can customize the entire UDF, by implementing their own version of ascii (including coercion rules via ScalarUDFImpl::coerce_types) and registering that instead.

However, using their own functions brings a substantial burden on to the user as you now have to maintain the functions in your downstream project 🤔

shehabgamin · 2025-01-26T11:38:47Z

I do want to point out that users can customize the entire UDF, by implementing their own version of ascii (including coercion rules via ScalarUDFImpl::coerce_types) and registering that instead.

However, using their own functions brings a substantial burden on to the user as you now have to maintain the functions in your downstream project 🤔

@alamb Right, for UDFs that overlap between Spark and DataFusion, Sail already customizes several versions that differ only slightly from the DataFusion implementation, simply because we need to align with Spark's expectations.

More importantly, beyond the burden this imposes, having the default behavior of various DataFusion components be permissive and general-purpose, while also offering flexibility for customization, ensures we don't have to worry about conflicting requirements.

alamb · 2025-01-26T11:54:32Z

@alamb Right, for UDFs that overlap between Spark and DataFusion, Sail already customizes several versions that differ only slightly from the DataFusion implementation, simply because we need to align with Spark's expectations.

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:

[DISCUSSION] Add separate crate to cover spark builtin functions #5600 (comment)

shehabgamin · 2025-01-26T12:07:39Z

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:
* [[DISCUSSION] Add separate crate to cover spark builtin functions #5600 (comment)](https://github.com/apache/datafusion/issues/5600#issuecomment-2614342453)

Sounds great! I'll chime in after the weekend.

Omega359 · 2025-01-26T15:13:32Z

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:
* [[DISCUSSION] Add separate crate to cover spark builtin functions #5600 (comment)](https://github.com/apache/datafusion/issues/5600#issuecomment-2614342453)

To really be able to customize the behaviour of UDF's I believe DataFusion will really need #13519 such that some of it could possibly just be driven by configuration.

Omega359 · 2025-01-26T15:26:52Z

There was an issue that was encountered the other month where the type to be inferred was context specific. For example, to_timestamp('-1'::varchar) should not attempt to coerce the varchar back to an i32 or some other data type. Thus, I think that we may need additional information beyond just simple type -> type.

jayzhan211 · 2025-01-28T09:00:29Z

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

I think it would make more sense to insist that the types line up exactly (have been resolved) in the Physical plans / physical types. But I haven't really thought through the implications of type coercion with a logical / physical type split

I used to think type coercion for physical types (arrow::DataType) should be handled at the physical layer. However, if we consider it from another angle, the key distinction between the logical plan and the physical plan is that the logical plan has no knowledge of the actual data, only the metadata. At the logical level, we deal with abstractions like the Schema or scalar values, without interacting with the RecordBatch data itself.

From this perspective, it makes more sense for type coercion to be part of the logical layer. Logical types can be seen as a more abstract or categorized version of physical types. A better way to describe this might be to think of them as "DataFusionType" and "ArrowType"—both of which are essentially logical concepts!

In this context, I guess we don't need LogicalSchema from #12622 as well but interact with the arrow::Schema directly

findepi · 2025-01-29T17:28:01Z

I am OK having user-defined coercion rules as long as they are applied reasonably in the DataFusion core.
In particular, the coercion rules should be applied by the analyzer, during the logical plan construction.
The optimizers should never perform any coercions. Period.
In particular, for DataFusion usage as a library in a system like ~~SDF~~ dbt compiler, where frontend is separate and DataFusion is handed over fully resolved logical plan nodes, the DF core should not invoke any coercion-related logic during plan optimizations or query processing. This is substantial for DataFusion to be a reliable building block for other systems (#12723).

alamb · 2025-01-30T22:47:19Z

In particular, the coercion rules should be applied by the analyzer, during the logical plan construction.

I agree with this goal 100%

alamb added the enhancement New feature or request label Jan 25, 2025

alamb mentioned this issue Jan 25, 2025

Fix Float and Decimal coercion #14273

Open

shehabgamin mentioned this issue Jan 26, 2025

Fix Type Coercion for UDF Arguments #14268

Open

alamb mentioned this issue Jan 26, 2025

[DISCUSSION] Add separate crate to cover spark builtin functions #5600

Open

findepi mentioned this issue Jan 30, 2025

Core: Fix UNION field nullability tracking #14356

Merged

shehabgamin mentioned this issue Feb 2, 2025

feat: Add datafusion-spark crate #14392

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Defined Coercion Rules #14296

User Defined Coercion Rules #14296

alamb commented Jan 25, 2025 •

edited

Loading

jayzhan211 commented Jan 26, 2025

shehabgamin commented Jan 26, 2025

jayzhan211 commented Jan 26, 2025

jayzhan211 commented Jan 26, 2025

alamb commented Jan 26, 2025

alamb commented Jan 26, 2025

shehabgamin commented Jan 26, 2025 •

edited

Loading

alamb commented Jan 26, 2025

shehabgamin commented Jan 26, 2025

Omega359 commented Jan 26, 2025

Omega359 commented Jan 26, 2025

jayzhan211 commented Jan 28, 2025 •

edited

Loading

findepi commented Jan 29, 2025

alamb commented Jan 30, 2025

User Defined Coercion Rules #14296

User Defined Coercion Rules #14296

Comments

alamb commented Jan 25, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jayzhan211 commented Jan 26, 2025

shehabgamin commented Jan 26, 2025

jayzhan211 commented Jan 26, 2025

jayzhan211 commented Jan 26, 2025

alamb commented Jan 26, 2025

alamb commented Jan 26, 2025

shehabgamin commented Jan 26, 2025 • edited Loading

alamb commented Jan 26, 2025

shehabgamin commented Jan 26, 2025

Omega359 commented Jan 26, 2025

Omega359 commented Jan 26, 2025

jayzhan211 commented Jan 28, 2025 • edited Loading

findepi commented Jan 29, 2025

alamb commented Jan 30, 2025

alamb commented Jan 25, 2025 •

edited

Loading

shehabgamin commented Jan 26, 2025 •

edited

Loading

jayzhan211 commented Jan 28, 2025 •

edited

Loading