Skip to content

Latest commit

 

History

History
398 lines (310 loc) · 15 KB

README.md

File metadata and controls

398 lines (310 loc) · 15 KB

postgres-searchbox adds full text search to your existing Postgres tables, using Postgres itself as a (good enough) search engine. You don't need an external search index (such as Elastic), which is often tedious to set up, keep synchronized with Postgres, and operate. With Postgres keeping both your data and your search index, everything is always guaranteed to be up-to-date, and there's only one server to maintain.

Setting up Postgres as a search index is easy with postgres-searchbox: you tell it which table you want to be searchable, and it gives you SQL commands to execute to create the index.

Implementing a web page with a searchbox is also easy with postgres-searchbox, which connects your Postgres search index with the excellent search-UI library React InstantSearch Hooks. Here is a working rudimentary search page using this approach:

import { InstantSearch, SearchBox, Hits } from 'react-instantsearch-hooks-web';
import { make_client } from 'postgres-searchbox/client';

const client = make_client('api/search');

function Hit({ hit }) {
  return (
    <article>
      <h1>{hit.primarytitle}</h1>
      <p>
        {hit.titletype}, {hit.startyear}, {hit.runtimeminutes} min
      </p>
      <p>{hit.genres}</p>
    </article>
  );
}

export default function Home() {
  return (
    <div>
      <main>
        <h1>Please enter your search terms here:</h1>
        <InstantSearch searchClient={client} indexName="table_name_here">
          <SearchBox />
          <Hits hitComponent={Hit} />
        </InstantSearch>
      </main>
    </div>
  );
}

You hook up a couple of React components, tell them your table's name, and voila -- you have a web interface to search your Postgres data! Please read on for detailed instructions.

Usage Details

Here is how you can make your Postgres data searchable in three easy steps:

Create a Search Index for Your Postgres Table

Install the package to your project with yarn add postgres-searchbox.

postgres-searchbox includes a script that can generate the SQL commands for creating a search index on the table you want to search. The script is at scripts/create-index.js; it reads the table definition and creates a search index in thePostgres database. This index will cover all text columns in the table, allowing a single searchbox to match against all the text the table contains.

From the postgres-searchbox/package folder run PG_SB_TABLE_NAME=table_name yarn script:create-index You should replace table_name with your table name.

When executed, this script will create a new column in your table that serves as a text-search target, plus an index that significantly speeds up matching queries against this new column. Executing this script will likely take a while, depending on the size and nature of your data. Postgres will automatically update the index every time you modify your data; as soon as a database modification completes, the new content will be searchable.

IMPORTANT: For the Postgres connection to work, you must set the values of some environment variables, so the handler can find the right Postgres host, database, user, and password. At a minimum the following should be set

  • PGHOST
  • PGUSER
  • PGPASSWORD
  • PGDATABASE

Set up a Search API Route

For InstantSearch to work with our Postgres client, you need one new route in your web server. This new route accepts search queries and executes them against your Postgres database. When you instantiate the postgres-searchbox client in the <InstantSearch> component in the next step, you'll need to provide the new endpoint's URL.

postgres-searchbox provides a handy way to implement this endpoint. For example, if your server is NextJS, you can simply put this in the file pages/api/search.ts:

import { getSearchHandler } from 'postgres-searchbox';
export default getSearchHandler();

Note that the relative URL of this page is api/search, which is what we'll use in the next step.

IMPORTANT: The above note about environment variables applies here too.

Implement a Search Page

To put up a web page with a searchbox for your table's contents, use the InstantSearch React components, as illustrated in the example at the beginning of this document. Provide the URL from the last step to the make_client function and the table name to the indexName parameter.

In the <Hit> component, you can access any row field using {hit.<fieldName>}, like in the example.

Compatibility

The following components should work.:

  • SearchBox
  • Hits
  • HitsPerPage
  • InfiniteHits
  • Pagination
  • SortBy
  • Highlight
  • DynamicWidgets
  • HierarchicalMenu

Sorting by columns is supported. Use the syntax ?column_name(+asc|+desc)?(+nulls+last)?,column_name_2(+asc|+desc)....

By default Postgres sorts asc and returns null values first. So they can be left off, e.g.

<SortBy
  items={[
    { label: 'Relevance', value: 'table_name_here' },
    { label: 'Title (asc)', value: 'table_name_here?sort=column_name' },
    {
      label: 'Title (desc)',
      value: 'table_name_here?sort=column_name+desc+nulls+last',
    },
  ]}
/>

The Highlight widget works, only because it does not use all properties of the usual Algolia response. If you use a custom UI that relies on properties { matchedWords, matchLevel, fullyHighlighted } then it wont work correctly. See the issue #8

Highlight requires some config to work correctly. See Configuring section or a full explanation.

const client = make_client('api/search');
// ...
<Highlight hit={hit} attribute="column_name_here" className="Hit-label" />;
import { getSearchHandler } from 'postgres-searchbox';
export default getSearchHandler({
  settings: {
    attributesToHighlight: ['column_name_here'],
  },
});

For the HierarchicalMenu component, data should be saved in your table in the following format. The column names are unimportant,

  • Use null values where a row is not categorized at that level.
  • Column values should contain the names of parent categories. e.g. Appliances > Fans
  • If you row is categorized 2 levels deep, it's still required to include values in columns level 0 and level 1.
name your_label_0 your_label_1 your_label_2
An appliance Appliances null null
A generic fan Appliances Appliances > Fans null
A box fan Appliances Appliances > Fans Appliances > Fans > Box fans
const client = make_client('api/search');
// ...
<HierarchicalMenu
  attributes={['your_label_0', 'your_label_1', 'your_label_2']}
/>;

On server side config, include only level zero of hierarchal categories in the renderingContent parameter.

import { getSearchHandler } from 'postgres-searchbox';
export default getSearchHandler({
  settings: {
    renderingContent: {
      facetOrdering: {
        facets: {
          order: [
            'brand'
            'your_label_0', // Don't include 'your_label_1', 'your_label_2' etc.
            'price'
          ],
        },
      },
    },
  },
});

Configuring

There are two ways to configure the behavior of postgres-searchbox: server-side and client-side. You can get started with zero-config, but configuring will alow for a faster and more secure setup.

Important terminology. With instantsearch, Algolia have used terms that don't exactly translate to Postgres or SQL

  • indexName means the source of the search results, this is the Postgres table name.
  • attribute means a value associated with a search result, e.g. id, name, url etc. it's similar to a column value but it's not an exact translation, for example in Postgres a search results attribute could be in a different table. For now, postgres-searchbox does not work with cross-table attributes.
  • facets, these are the attribute keys like color, price etc. In Postgres, they're column names.

Server side

The default server-side config is at [/package/src/constants.ts]. The defaults are fine during development, but they fetch all attributes as facets and return all attributes in the search response.

The defaults should not be used in production for 2 reasons.

  1. Security. You may be exposing data that should not leave the server.
  2. Performance. Returning all attributes in an extra load on the server and network.

This can be addressed by explicitly setting attributesToRetrieve when instantiating getSearchHandler like:

import { getSearchHandler } from 'postgres-searchbox';
export default getSearchHandler({
  settings: { attributesToRetrieve: ['name'] },
});

The settings property map directly to Algolia Settings API Parameters, but are only a subset of Algolia. They can be set with type-safety and autofill ctrl + space in VSCode.

If your searchHandler should handle multiple indexes, instead of passing one config object you can pass in an array of configs like this. Make sure to set the indexName property for each config - and that each index has already been created with the create-index script.

[
  {
    indexName: 'postgres_searchbox_movies',
  },
  {
    indexName: 'bestbuy_product',
    settings: { attributesToRetrieve: ['name'] },
  },
];

Sometimes server-side config is not flexible enough, maybe you have an app and website hitting the same endpoint. And, the app and website need different attributes. In this case, use the client config as explained below, but set some server-side validation with the clientValidation property.

export default getSearchHandler({
  clientValidation: {
    validAttributesToRetrieve: [
      'id',
      'name',
      'price',
      'description',
      'mobile_column',
      'web_column',
    ],
    validAttributesToHighlight: ['column_name_here', 'column_2', 'column_3'],
  },
});

Client side

The client-side options map directly to Algolia Search API Parameters, but are only a subset of Algolia.

They can be set with type-safety and autofill.

import { Configure } from 'react-instantsearch-hooks-web';
import { make_client } from 'postgres-searchbox/client';
import type { SearchOptions } from 'postgres-searchbox/client.types';
const client = make_client('api/search');
const configureProps: SearchOptions = {
  validAttributesToRetrieve: [
    'id',
    'name',
    'price',
    'description',
    'web_column',
  ],
  attributesToHighlight: ['column_name_here', 'column_2'],
};
<InstantSearch searchClient={client} indexName="table_name_here">
  <Configure {...configureProps} />
  <SearchBox />
  <Hits hitComponent={Hit} />
</InstantSearch>;

Limitations

This package is a work in progress, so not all InstantSearch components work yet. Most notably, the highlight components isn't ready for prime-time.

Postgres text-search limitations

Postgres is not quite at the Elastic level of functionality yet. For example, it doesn't offer spell-corrections for mistyped terms, and its multi-language support is uneven.

The search index created by postgres-searchbox is the general search index, whose performance isn't necessarily optimal for all possible use cases. There are other indexing options, which require customization by an experienced developer.

It's also worth mentioning that postgres-searchbox currently requires a precise match for diacritics (accents on non-ASCII letters). This will be remedied in the future by using the unaccent dictionary.

Contributing

Starting with tests

A config for VSCode dev containers and docker-compose file are included for developer convenience, but they don't have to be used.

Getting started with VSCode

  • Open the project in VSCode.
  • In command pallet: Dev Containers: Reopen in Container
  • cd package
  • yarn install
  • yarn test:watch

Getting started with Docker Compose

  • Run docker-compose up from project root
  • In a new terminal, get bash access to the container with docker-compose exec bash
  • /home/default/package
  • yarn install
  • yarn test:watch
  • Stop with ``docker-compose stop`

Getting started without docker

  • Start up a Postgres instance for testing
  • Go to this project cd package
  • yarn install
  • The test scripts expect the following environment variables
    • PGHOST
    • PGUSER
    • PGPASSWORD
    • PGDATABASE
  • yarn test:watch

Real-world data

To work with a modest dataset of 20K rows. You can import an Algolia dataset algolia/instant-search-demo collected from the bestbuy API. A helper script to create a table, download, insert, and index the data is at packages/scripts/create-store.ts. To run this script yarn install and yarn script:create-store, the database is around 20MB.

To work with a dataset of 10M rows. You can import https://datasets.imdbws.com/title.basics.tsv.gz from IMDB. A helper script to create a table, download, insert, and index the data is at packages/scripts/create-movies.ts. To run this script yarn install and yarn script:create-movies this could take 5-10 minutes.

Local development with example(s)

During development it may be useful to see postgres-searchbox in context of a website. In the folder examples/with-nextjs is a default React (NextJS) install with postgres-searchbox installed. See the 3 files:

  • examples/with-nextjs/pages/api/search.ts
  • examples/with-nextjs/pages/movies.tsx
  • examples/with-nextjs/pages/store.tsx

In examples/with-nextjs you can yarn && yarn dev to get the dev. server running.

NextJS pages can import the locally developed postgres-searchbox; see their commented-out imports from package/build. Keep in mind, however, that package/build has to be periodically generated from package/src. You can automate this by running a second yarn dev in the package directory in another terminal. This will keep package/build always up-to-date with package/src.

Using swc here is orders of magnitude faster than tsc. The downside is that it doesn't check for type correctness.

Publishing to npm

As the project source is written in Typescript it's necessary to compile before publishing to npm

  • Ensure al tests are passing with yarn test
  • yarn build this will use tsc to output to package/build. It check type correctness and fail on any Typescript errors.
  • Update the version number in package.json
  • npm publish