-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User Defined Coercion Rules #14296
Comments
It would be great if we can take LogicalType into consideration as well |
I like the idea of "user-defined coercion rules" and think we should take it a step further by striving for a paradigm of "user-defined configurability" in DataFusion. Just as SQLParser has Attaching For instance, with the built-in UDFs that DataFusion offers, it would be powerful if users could customize various components of a UDF. Take the
The default implementation of
However, what if users could further customize the behavior of
|
We have similar mechanism for aggregate function -- let agg_expr = AggregateExprBuilder::new(func.to_owned(), physical_args.to_vec())
.order_by(ordering_reqs)
.schema(Arc::new(physical_input_schema.to_owned()))
.alias(name)
.with_ignore_nulls(ignore_nulls)
.with_distinct(*distinct)
.build()
.map(Arc::new)?; We can also introduced similar builder for ScalarFunction, it does not exist because there is no such requirement so far. |
We have type coercion in logical plan now, consider the case where we want to separate logical types and physical types, should we add another type coercion layer in physical optimizer? Logical layer handle coercion that doesn't care about the difference between the decoding, cast i.e. Integer to String, while physical layer handle coercion that care about the decoding, i.e. cast Utf8 to Utf8View. |
I think it would make more sense to insist that the types line up exactly (have been resolved) in the Physical plans / physical types. But I haven't really thought through the implications of type coercion with a logical / physical type split |
That is an interesting idea @shehabgamin I do want to point out that users can customize the entire UDF, by implementing their own version of However, using their own functions brings a substantial burden on to the user as you now have to maintain the functions in your downstream project 🤔 |
@alamb Right, for UDFs that overlap between Spark and DataFusion, Sail already customizes several versions that differ only slightly from the DataFusion implementation, simply because we need to align with Spark's expectations. More importantly, beyond the burden this imposes, having the default behavior of various DataFusion components be permissive and general-purpose, while also offering flexibility for customization, ensures we don't have to worry about conflicting requirements. |
BTW I think there are many people interested in spark compatible functions. Maybe now is time to actually start collaborating on a spark compatible UDF function library given the interest. I tried to restart the conversation here: |
Sounds great! I'll chime in after the weekend. |
To really be able to customize the behaviour of UDF's I believe DataFusion will really need #13519 such that some of it could possibly just be driven by configuration. |
There was an issue that was encountered the other month where the type to be inferred was context specific. For example, |
I used to think type coercion for physical types (arrow::DataType) should be handled at the physical layer. However, if we consider it from another angle, the key distinction between the logical plan and the physical plan is that the logical plan has no knowledge of the actual data, only the metadata. At the logical level, we deal with abstractions like the Schema or scalar values, without interacting with the RecordBatch data itself. From this perspective, it makes more sense for type coercion to be part of the logical layer. Logical types can be seen as a more abstract or categorized version of physical types. A better way to describe this might be to think of them as "DataFusionType" and "ArrowType"—both of which are essentially logical concepts! In this context, I guess we don't need LogicalSchema from #12622 as well but interact with the arrow::Schema directly |
I am OK having user-defined coercion rules as long as they are applied reasonably in the DataFusion core. |
I agree with this goal 100% |
Is your feature request related to a problem or challenge?
Coercion is (TODO find definition)
At the moment DataFusion has one set of built in coercion rules. However, with a single set of coercion rules, we'll always end up with conflicting requirements such as
It also makes it hard, as @findepi has pointed out several times, to extend DataFusion with new data types / logica types.
My conclusion is we will never be able to have the behavior that works for everyone, and any change in coercion is likely to make some other tradeoff.
Rather than having to go back/forth in the core, I think an API would give us an escape hatch.
Describe the solution you'd like
While the user can in theory supply their own coercion rules by adding a new
AnalyzerRule
instead of the current [TypeCoercion
] ruledatafusion/datafusion/optimizer/src/analyzer/type_coercion.rs
Line 62 in 18f14ab
Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want the users can override them
It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general
Describe alternatives you've considered
I was imagining something like
The trait might look like
Maybe there should be methods for
Field
instead ofDataType
🤔Additional context
No response
The text was updated successfully, but these errors were encountered: