Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of search_* DB views #1215

Open
LukasKalbertodt opened this issue Jul 25, 2024 · 0 comments
Open

Improve performance of search_* DB views #1215

LukasKalbertodt opened this issue Jul 25, 2024 · 0 comments
Labels

Comments

@LukasKalbertodt
Copy link
Member

(I'm writing this issue simply to write down all my current context about this issue. It's not urgent at all.)

We have a number of search_* views in our DB, which more or less mirror what data is stored in Meili. These are basically the corresponding tables (e.g. events for search_events) enriched with some additional information (e.g. ancestor_names for search_realms and host_realms for all others). There are the following use cases for these:

  • Rebuilding the search index: select * from view
  • Updating the search index: select * from view where id = any($1) (where $1 often will only be one or very few events).

Ideally, the view should be defined in a way that makes all of these cases fast. That turns out to be very tricky for the events, series and playlist views. It is easy to make either one use case fast, but not both at once. Sure, we can define two views, but we don't want to duplicate all that logic. The core issue here is calculating the ancestor_names for the host realms only once per needed realm.

The current definition of search_series looks roughly like this:

select
    series.*,
    coalesce(
        array_agg((
            select row(search_realms.*)::search_realms
            from search_realms
            where search_realms.id = blocks.realm
        )) filter(where blocks.realm is not null),
        '{}'
    ) as host_realms
from series
left join blocks on type = 'series' and blocks.series = series.id
group by series.id;

This first creates a join between series and all blocks including series, so series that are included in multiple places have multiple rows associated with them. These rows are collapsed via group by later. The complex expression in select retrieves the search realm data for each of those post-join rows. But since some realms list multiple series, the lookup is performed multiple times for those realms. In our test data, this is no problem for series as this duplication is very rare.

But for events, there is much more duplication, as a series block means all events from that series belong to that realm, so the calculations/lookup of that realm is done once per event. That's why our current search_events looks roughly like this:

select
    events.*,
    coalesce(
        array_agg(
            distinct
            row(search_realms.*)::search_realms
        ) filter(where search_realms.id is not null),
        '{}'
    ) as host_realms
from events
left join series on events.series = series.id
left join blocks on (
    type = 'series' and blocks.series = events.series
    or type = 'video' and blocks.video = events.id
)
left join search_realms on search_realms.id = blocks.realm
group by events.id, series.id;

Here, the search_realms is joined, which seems to always lead to basically loading all rows from that view before proceeding. This means that the ancestor_names calculation is only done once per realm (good), but also that it is always done once per realm, regardless of the realm is relevant (bad). Thus select * from search_events where id = $1 costs always as much as select * from search_realms plus a bit more. Which are like 250ms on my machine with our test data. Going with the approach of the search_series above, the query runs in roughly 1ms instead for id = $1. However, it takes 1000ms or so if all events are loaded, which is roughly twice as long as the current view takes. This gets worse with playlists, as even more duplicate work is done.

So how can this be improved? Ideally, PostgreSQL's query planner would switch between two different plans depending on the where clause. I don't see a good reason why PG always needs to load all search_realms in the join. The view is not treated as black box, as can be observed in many other situations!

I tried a few approaches with using with-queries. One major blocker is the fact that with-queries, just as views, do not retain the "functional dependencies" properties of the columns. All columns of a table are functionally dependent on the primary key column. Normally, when using aggregate functions, all selected fields with aggregate function need to be listed inside group by. But PG knows that its enough to list th primary key inside group by to be able to select all other columns, well, the functionally dependent ones. But this nifty property is not retained when using views or with-queries, which leads to writing out all selected fields twice.

Further, using search_realms is annoying, as it very very quickly just loads the whole table, even if there are primary key bounds it could use. So one has to duplicate that logic as well and directly join realms instead. Also, when using with-queries, it is difficult/impossible to already use the where id = ... part inside the subqueries. A view is not a function where we have access to that filter, but PG must understand the view definition and optimize the filter itself.

I have worked in this way too long already and will stop now. Here are some of my attempts:

This is meh bc the search_realms functionality needs to be duplicated, and ultimately fails because the where bound would need to be in the the_events query for it to be fast.

with the_events as (
    select events.*, series.title as series_title, 
        array_agg(distinct playlists.id) as containing_playlists,
        array_agg(distinct blocks.realm) as host_realm_ids
    FROM all_events events
    LEFT JOIN series ON events.series = series.id
    LEFT JOIN playlists ON ARRAY[events.opencast_id] <@ event_entry_ids(playlists.entries)
    left join blocks on
        blocks.type = 'series'::block_type AND blocks.series = events.series 
        OR blocks.type = 'video'::block_type AND blocks.video = events.id
        OR blocks.type = 'playlist'::block_type AND blocks.playlist = playlists.id
  group by events.id, series.id
),
realm_ids as (
    select distinct unnest(host_realm_ids) as realm_id, id as event_id
    from the_events
),
the_realms as (
    select event_id, array_agg((select full_path from search_realms where id = realm_id)) as host_realms
    from realm_ids
    group by event_id
)

select *
from the_events
left join the_realms on the_realms.event_id = the_events.id
where the_events.id = 1053531426970406865

The idea here was to only output the realm_ids from the view and leave the "resolve to search realms" to the code querying the view. Works great for the id = $1 case, but I haven't figured out a way to make it work well for the "load all" case. The code below runs in 900ms instead of 600ms of the current definition. Might be possible to do the last step faster in Rust but yeh mh. Joining search_realms should be the better version, but thats even slower for some reason and that would require us to group by all columns.

 with foo as (
    SELECT events.id, series.title AS series_title, events.title,
     coalesce(array_agg(distinct blocks.realm) filter (where blocks.realm is not null), '{}') as host_realms,
    COALESCE(array_agg(playlists.id) FILTER (WHERE playlists.id IS NOT NULL), '{}'::bigint[]) AS containing_playlists
   FROM all_events events
     LEFT JOIN series ON events.series = series.id
     LEFT JOIN playlists ON ARRAY[events.opencast_id] <@ event_entry_ids(playlists.entries)
     LEFT JOIN blocks ON blocks.type = 'series'::block_type AND blocks.series = events.series 
        OR blocks.type = 'video'::block_type AND blocks.video = events.id 
         OR blocks.type = 'playlist'::block_type AND blocks.playlist = playlists.id
         
     --LEFT JOIN realms ON realms.id = blocks.realm
  
  where events.id = 1053531426970406865
  GROUP BY events.id, series.id
)

select title,
    array(select row(search_realms.*) from search_realms where id = any(host_realms))
from foo

select substring(title from 0 for 30), 
    array((select full_path from search_realms where search_realms.id = any(foo.host_realms)))
from foo

One could test this again in newer PG versions, maybe the optimize properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant