Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Defined Coercion Rules #14296

Open
alamb opened this issue Jan 25, 2025 · 14 comments
Open

User Defined Coercion Rules #14296

alamb opened this issue Jan 25, 2025 · 14 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

Is your feature request related to a problem or challenge?

Coercion is (TODO find definition)

At the moment DataFusion has one set of built in coercion rules. However, with a single set of coercion rules, we'll always end up with conflicting requirements such as

It also makes it hard, as @findepi has pointed out several times, to extend DataFusion with new data types / logica types.

My conclusion is we will never be able to have the behavior that works for everyone, and any change in coercion is likely to make some other tradeoff.

Rather than having to go back/forth in the core, I think an API would give us an escape hatch.

Describe the solution you'd like

While the user can in theory supply their own coercion rules by adding a new AnalyzerRule instead of the current [TypeCoercion] rule

I think the coercion code is called in so many places this would only be a partial fix

Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want the users can override them

It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general

Describe alternatives you've considered

I was imagining something like

struct MyCoercer {}

/// Implement custom coercion ehre
impl TypeCoercion for MyCoercer { 
...
}

// register coercion rules like
let ctx: SessionContext = SessionStateBuilder::new()
  .with_type_coercion_rules(Arc::new(MyCoercer{}));
...

The trait might look like

/// Defines coercion rules
pub trait TypeCoercion {
  /// Given the types of arguments to a comparison operation (`=`, `<`, etc) what single type should all arguments be cast to
  fn coerce_comparison(args: &[DataType]) -> DataType ;
 ...
}

Maybe there should be methods for Field instead of DataType 🤔

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jan 25, 2025
@jayzhan211
Copy link
Contributor

It would be great if we can take LogicalType into consideration as well

@shehabgamin
Copy link
Contributor

I like the idea of "user-defined coercion rules" and think we should take it a step further by striving for a paradigm of "user-defined configurability" in DataFusion.

Just as SQLParser has GenericDialect, which is permissive and general-purpose, applying a similar principle to components of DataFusion would provide greater flexibility and adaptability.

Attaching with_type_coercion_rules to a SessionContext is an interesting idea, but I worry this approach might not be flexible enough for nuanced situations. Functions, expressions, and other components often have unique coercion rules specific to their context or intended use cases. A one-size-fits-all approach at the SessionContext level might not address these needs, so a more granular mechanism to define and override coercion rules at a localized scope could be necessary.

For instance, with the built-in UDFs that DataFusion offers, it would be powerful if users could customize various components of a UDF.

Take the AsciiFunc as an example, both of the following signatures could be valid depending on user requirements:

Signature::string(1, Volatility::Immutable)

Signature::coercible(
    vec![TypeSignatureClass::Native(logical_string())],
    Volatility::Immutable,
)

The default implementation of AsciiFunc would prioritize being permissive and general-purpose:

impl AsciiFunc {
    pub fn new() -> Self {
        Self {
            signature: Signature::one_of(
                vec![
                    TypeSignature::Coercible(vec![TypeSignatureClass::Native(
                        logical_string(),
                    )]),
                ],
                Volatility::Immutable,
            ),
        }
    }
}

However, what if users could further customize the behavior of AsciiFunc like this?

let ascii = AsciiFunc::new()
    .with_signature(Signature::string(1, Volatility::Immutable))
    // Define other customizations as well
    .with_display_name(vec![lit("Meow")])
    .with_is_nullable(false)
    ...
    .with_...
    ...
    ?;

@jayzhan211
Copy link
Contributor

We have similar mechanism for aggregate function -- AggregateExprBuilder

let agg_expr = AggregateExprBuilder::new(func.to_owned(), physical_args.to_vec())
                        .order_by(ordering_reqs)
                        .schema(Arc::new(physical_input_schema.to_owned()))
                        .alias(name)
                        .with_ignore_nulls(ignore_nulls)
                        .with_distinct(*distinct)
                        .build()
                        .map(Arc::new)?;

We can also introduced similar builder for ScalarFunction, it does not exist because there is no such requirement so far.

@jayzhan211
Copy link
Contributor

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

Logical layer handle coercion that doesn't care about the difference between the decoding, cast i.e. Integer to String, while physical layer handle coercion that care about the decoding, i.e. cast Utf8 to Utf8View.

@alamb
Copy link
Contributor Author

alamb commented Jan 26, 2025

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

I think it would make more sense to insist that the types line up exactly (have been resolved) in the Physical plans / physical types. But I haven't really thought through the implications of type coercion with a logical / physical type split

@alamb
Copy link
Contributor Author

alamb commented Jan 26, 2025

For instance, with the built-in UDFs that DataFusion offers, it would be powerful if users could customize various components of a UDF.

That is an interesting idea @shehabgamin

I do want to point out that users can customize the entire UDF, by implementing their own version of ascii (including coercion rules via ScalarUDFImpl::coerce_types) and registering that instead.

However, using their own functions brings a substantial burden on to the user as you now have to maintain the functions in your downstream project 🤔

@shehabgamin
Copy link
Contributor

shehabgamin commented Jan 26, 2025

I do want to point out that users can customize the entire UDF, by implementing their own version of ascii (including coercion rules via ScalarUDFImpl::coerce_types) and registering that instead.

However, using their own functions brings a substantial burden on to the user as you now have to maintain the functions in your downstream project 🤔

@alamb Right, for UDFs that overlap between Spark and DataFusion, Sail already customizes several versions that differ only slightly from the DataFusion implementation, simply because we need to align with Spark's expectations.

More importantly, beyond the burden this imposes, having the default behavior of various DataFusion components be permissive and general-purpose, while also offering flexibility for customization, ensures we don't have to worry about conflicting requirements.

@alamb
Copy link
Contributor Author

alamb commented Jan 26, 2025

@alamb Right, for UDFs that overlap between Spark and DataFusion, Sail already customizes several versions that differ only slightly from the DataFusion implementation, simply because we need to align with Spark's expectations.

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:

@shehabgamin
Copy link
Contributor

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:

* [[DISCUSSION] Add separate crate to cover spark builtin functions #5600 (comment)](https://github.com/apache/datafusion/issues/5600#issuecomment-2614342453)

Sounds great! I'll chime in after the weekend.

@Omega359
Copy link
Contributor

BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here:

* [[DISCUSSION] Add separate crate to cover spark builtin functions #5600 (comment)](https://github.com/apache/datafusion/issues/5600#issuecomment-2614342453)

To really be able to customize the behaviour of UDF's I believe DataFusion will really need #13519 such that some of it could possibly just be driven by configuration.

@Omega359
Copy link
Contributor

There was an issue that was encountered the other month where the type to be inferred was context specific. For example, to_timestamp('-1'::varchar) should not attempt to coerce the varchar back to an i32 or some other data type. Thus, I think that we may need additional information beyond just simple type -> type.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 28, 2025

We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer?

I think it would make more sense to insist that the types line up exactly (have been resolved) in the Physical plans / physical types. But I haven't really thought through the implications of type coercion with a logical / physical type split

I used to think type coercion for physical types (arrow::DataType) should be handled at the physical layer. However, if we consider it from another angle, the key distinction between the logical plan and the physical plan is that the logical plan has no knowledge of the actual data, only the metadata. At the logical level, we deal with abstractions like the Schema or scalar values, without interacting with the RecordBatch data itself.

From this perspective, it makes more sense for type coercion to be part of the logical layer. Logical types can be seen as a more abstract or categorized version of physical types. A better way to describe this might be to think of them as "DataFusionType" and "ArrowType"—both of which are essentially logical concepts!

In this context, I guess we don't need LogicalSchema from #12622 as well but interact with the arrow::Schema directly

@findepi
Copy link
Member

findepi commented Jan 29, 2025

I am OK having user-defined coercion rules as long as they are applied reasonably in the DataFusion core.
In particular, the coercion rules should be applied by the analyzer, during the logical plan construction.
The optimizers should never perform any coercions. Period.
In particular, for DataFusion usage as a library in a system like SDF dbt compiler, where frontend is separate and DataFusion is handed over fully resolved logical plan nodes, the DF core should not invoke any coercion-related logic during plan optimizations or query processing. This is substantial for DataFusion to be a reliable building block for other systems (#12723).

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2025

In particular, the coercion rules should be applied by the analyzer, during the logical plan construction.

I agree with this goal 100%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants