Brightball

PostgreSQL functions with Elixir Ecto

PostgreSQL | Elixir | Phoenix | Ecto | - July 27, 2016 // Barry

Functions within PostgreSQL can be setup to return rows and included in queries just like any other table. Continuing with our theme of trying to push Elixir and Phoenix a little on this site rebuild, we will move our site search inside of a database function and experiment with different ways to call it from Ecto.

In our last article I skimmed over the details of setting up PostgreSQL search with Ecto. In the interest of being thorough let's cover the entire setup process for PostgreSQL full text search. The documentation itself provides excellent step by step instructions but here is how we setup ours.

Our Database Migration

mix etco.gen.migration setup_full_text_search
* creating priv/repo/migrations
* creating priv/repo/migrations/20160618041203_setup_full_text_search.exs

Add the tsvector column

When we run our search we compare a tsquery to a tsvector, which you can do on the fly like this...

WHERE to_tsvector('english', title || ' ' || intro || ' ' || main_body) @@ to_tsquery('english', 'postgres')

That requires running the tsvector function on every row in a sequential scan. You can also create an index with the result of the tsvector like this...

CREATE INDEX search_index ON articles 
USING GIN (to_tsvector('english', title || ' ' || intro || ' ' || main_body));

Now when I include the above statement in the where clause of a query it will use the index but the downside is that I have to pass that huge chunk into the query every single time for it to get picked up. If you're only including a single column in your search, that's probably not a big deal. If you're including multiple columns and potentially weighting them for priority it's going to get out of hand fast.

In order to solve the complexity problem and remove any ambiguity, we create a tsvector column with our migration and set an index on that column. Ecto doesn't have anything built in for every piece of PostgreSQL functionality like GIN indexes, so for that we fall back to raw SQL with execute.

def up do
  alter table(:articles) do
    add :search_vector, :tsvector
  end

  execute "CREATE INDEX article_search_index ON articles USING GIN(search_vector)"
end

Which will allow us to just compare against the column knowing the proper index will be used once we have some data in there, like so...

WHERE search_vector @@ to_tsquery('english','postgres')

Weighting the Fields of Our Search

Now, my search includes 4 different fields: title, introduction, main_body and tags. If I include a word or term in the title or tag that's probably more important than if it just happens to be included in the body of the text so I want to weight the fields in the search_vector. I also have a boolean field to indicate presentations that I've given and if somebody searches for "presentation", I'd like that field to be given more weight.

As an added twist my tags column is a postgres array which we need to convert to a string to include in our search. We can do that with the array_to_string function...but that function returns a mutable string. The search index needs an immutable string so we're going to create an additional immutable_array_to_string function that I pulled from Stack Overflow. Here's the result in the migration.

execute "CREATE OR REPLACE FUNCTION immutable_array_to_string(arr ANYARRAY, sep TEXT) 
  RETURNS text AS $$
    SELECT array_to_string(arr, sep);
$$ LANGUAGE SQL IMMUTABLE"

execute "CREATE OR REPLACE FUNCTION article_search_trigger() RETURNS trigger AS $$
  begin
      new.search_vector :=
        setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
        setweight(to_tsvector('pg_catalog.english', coalesce(immutable_array_to_string(new.tags,' '))), 'B') ||
        setweight(to_tsvector('pg_catalog.english', coalesce(new.introduction,'')), 'D') ||
        setweight(to_tsvector('pg_catalog.english', coalesce(new.main_body,'')), 'D') ||
        setweight(to_tsvector('pg_catalog.english', CASE WHEN new.presentation THEN 'presentation' ELSE '' END),'B');
    return new;
  end
$$ LANGUAGE plpgsql"

execute "CREATE TRIGGER article_search_update 
  BEFORE INSERT OR UPDATE OF title, introduction, main_body, tags, presentation 
  ON articles 
  FOR EACH ROW EXECUTE PROCEDURE article_search_trigger()"

The key part here is the setweight function that takes a vector and a weight of A, B, C or D for priority. The vector should also have a dictionary specified, in this case 'english'.

# For explanation only, real code above
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', array_to_string(tags)), 'B') ||
setweight(to_tsvector('english', introduction), 'D') ||
setweight(to_tsvector('english', main_body), 'D') ||
setweight(to_tsvector('english', CASE WHEN presentation THEN 'presentation' END), 'B')

We're creating tsvector's out of all of our fields, giving them weights and inserting the word 'presentation' in the index if presentation is true. Our TRIGGER will ensure that this index is always in sync, avoiding any of the normal delta management, out of sync indexes and infrastructure costs involved with setting up a 3rd party tool.

Aliases table for ts_rewrite

In our example query, we also used the ts_rewrite function to include synonyms and alternate spellings of common terms that people may search for but mean the same thing, such as 'pg', 'postgres' and 'postgresql'. Let's create, populate and index our aliases table like this (including UPSERT since I'd been tinkering locally).

execute "CREATE TABLE IF NOT EXISTS aliases (t tsquery PRIMARY KEY, s tsquery)"
execute "INSERT INTO aliases VALUES(to_tsquery('postgres'), to_tsquery('postgresql|postgres|pg')) ON CONFLICT (t) DO UPDATE SET s = EXCLUDED.s"
execute "INSERT INTO aliases VALUES(to_tsquery('pg'), to_tsquery('postgresql|postgres|pg')) ON CONFLICT (t) DO UPDATE SET s = EXCLUDED.s"
execute "INSERT INTO aliases VALUES(to_tsquery('postgresql'), to_tsquery('postgresql|postgres|pg')) ON CONFLICT (t) DO UPDATE SET s = EXCLUDED.s"

Now just run the migration and we have fully functioning search. That may look like a lot but in the end it's just an expanded way of saying:

  1. What fields do I want to search on?
    - tsvector
  2. Which ones matter more?
    - setweight
  3. How do I keep this up to date automatically?
    - database trigger
  4. What if people can't find what they're looking for?
    - aliases / ts_rewrite

Ecto Schemaless Query

Our initial query looked like this using Ecto an ecto schemaless query.

articles = Repo.all(
  from c in "articles",
    join: sq in fragment("(SELECT ts_rewrite(plainto_tsquery(?), 'SELECT * FROM aliases') search_query)", ^term),
    on: fragment("search_vector @@ ?", sq.search_query),
    where: not is_nil(c.publish_at)  and c.static == false,
    order_by: fragment("rank DESC"),
    select: %{
      id: c.id, title: c.title, slug: c.slug, display_date: c.display_date, 
      rank: fragment("ts_rank_cd(search_vector, ?, 32) AS rank", sq.search_query),
      introduction: fragment("ts_headline('english',CONCAT(introduction,' ',main_body), search_query, 'StartSel=,StopSel=,MinWords=50,MaxWords=100') AS headline"),
      tags: c.tags
    })

Look closely at that ts_rewrite. We're passing in a query that will return the entire aliases table to the rewrite. That can't be efficient long term as the table grows and as it turns out, it's not. That's why it's recommended to include a where clause to trim it down.

ts_rewrite(plainto_tsquery(?), 'SELECT * FROM aliases WHERE plainto_tsquery(?) @> t')

There's just one problem. Every way that I tried to include the second parameter for the subquery gave me an Ecto error. I tried a lot of different ways from using a CONCAT to even executing raw SQL. Nothing was allowing me to pass in the parameter inside of the passed query so I left it out of the last article entirely. If anybody figures that out I'll be happy to update the article.

"When life shuts a door, open it again. It's a door. That's how they work."

What if there were some way to do this without having to use a schemaless query at all?

Search as a PostgreSQL Function

PostgreSQL functions are pretty broad things. You can use them with simple inputs and a single return value of a specific type, but you can also have them return rows just like querying a table or view. Our search function is an excellent use case to try it out and then see if we can get those results into an Ecto Schema.

# Search function migration
def change do
  execute """
    CREATE OR REPLACE FUNCTION article_search(term varchar)
    RETURNS table(id int, title varchar, slug varchar, display_date date, tags varchar[], rank real, introduction text)
    AS $$
      SELECT ar.id, ar.title, ar.slug, ar.display_date, ar.tags, 
        ts_rank_cd(search_vector, search_query, 32) AS rank, 
        ts_headline('english',
          CONCAT(introduction,' ',main_body), 
          search_query, 
          'StartSel=<mark>,StopSel=</mark>,MinWords=50,MaxWords=100'
        ) AS introduction
      FROM articles ar
        INNER JOIN (
          SELECT 
            ts_rewrite(
              plainto_tsquery(term), 
              CONCAT('SELECT * FROM aliases WHERE plainto_tsquery(''',term,''') @> t')
            ) search_query
        ) sq ON ar.search_vector @@ search_query
      WHERE ar.publish_at IS NOT NULL AND ar.static = FALSE
      ORDER BY rank DESC;
  $$ language SQL;
  """
end

We're taking in the search term as an argument, specifying the return type for the rows that will come back and just running our query. I haven't bothered including pagination here either but you'll probably want to for anything larger than my blog. Now we can see our function in action.

SELECT * FROM article_search('postgres')

Using our function from Ecto

At this point we're ready to go...now we just have to figure out how to work it into a schema. To start with I figured I'd try using it in a schemaless query like we did with our join fragment.

Repo.all(
  from r in fragment("article_search(?) results", ^term)
  ....
)

But it doesn't work! The fragment is fine for a join but at the moment, you can't use a fragment in an Ecto from clause. Ecto may have a direct solution for this at some point in the future. I put in a feature request here with a positive response from Jose, including a workaround. I think we can do better though.

Creating a Search Module

Since we can't use the function in a from and we can't specify a table with arguments in a schema we're going to create our own struct. We'll use Ecto to execute raw SQL and map those results to our struct. Here's the end result.

defmodule Brightball.ArticleSearch do
  defstruct id: 0, title: "", slug: "", display_date: Ecto.Date, tags: [], rank: 0.0, introduction: ""

  def search(term) do
    Ecto.Adapters.SQL.query!(Brightball.Repo, "SELECT * FROM article_search($1)", [term]) 
    |> result_to_struct
  end

  defp result_to_struct(res) do
    cols = Enum.map res.columns, &(String.to_atom(&1))     
    Enum.map res.rows, fn(row) -> struct(__MODULE__, Enum.zip(cols, row)) end
  end
end

Plus as a bonus this setup will make it easy for us to experiment with multiple types of searches using the same data structure. Now we can call it as easily from anywhere.

articles = ArticleSearch.search(term)

Closing Up

We explored using functions to return results from Ecto and even though there were a few road blocks, they were reasonably simple enough to work around. We also created a nifty search module that gives us a basis to experiment with future search techniques (hint, hint). Hope you liked it!