Brightball

Rails Gems to Unlock Advanced PostgreSQL Features

Ruby | Rails | PostgreSQL | PostGIS | - September 9, 2014 // Barry

If you've spent any amount of time on this site you may have noticed that I'm fond of PostgreSQL...and Ruby on Rails...and that I dislike the general trend among Rails developers to ignore all of the amazing features in PostgreSQL that make your application better in favor of risking data integrity just so that all logic can remain in Rails. So here's my top collection of Rails gems to get at all that untapped power in PostgreSQL that you didn't know you had.

Rails itself does a lot of things to expose underlying database functionality, including specific PostgreSQL functionality like supporting the hstore, json, and array datatypes as well as helper functions to install PostgreSQL extensions. ActiveRecord's transaction blocks are excellent and scopes will easily let you assemble and reuse complicated pieces of queries. It does not to it all, unfortunately, so here are some additional tools to unlock even more.

pg_search

As you may have heard, PostgreSQL has some pretty phenomenal full-text search capabilities. You can get document search engine speed and quality, complete with custom dictionaries and thesaruses, weighting of different field values, full control, index updates via simple triggers so your search is never out of sync with your data...the list goes on. It has many other types of search that are effective in different scenarios including trigram and double metaphone (sounds like) too. Best of all because of automatic mutli-index queries you can combine one (or multiple) searches with other parameters in a query to get exactly the result you're looking for. The pg_search gem developed by Pivotal Labs unlocks ALL of these capabilities in a very natural way for Rails developers.

schema_plus

When you start to become dependent on a framework, it's very natural to forgo a lot of the things databases already did for you. The problem is that as soon as you start putting logic that maintains data integrity in your application code you introduce race conditions. Race conditions inevitably corrupt your data, creating support tickets, causing you to spend time tracking down the cause and then figuring out how to correct the issues on a live system. That means you'll have to write code to cleanup ALL of the bad data and then fix the problem so it doesn't happen again.

Or you could just use your database and save yourself all that trouble. The schema_plus gem helps you to do that. While this gem is not only for PostgreSQL (it supports MySQL and SQLite3 too) it does include a lot of additional features if you're using PostgreSQL. For all of them you get a extra migration functionality that makes it faster and easier to create indexes, use foreign keys and control their behavior, drop tables conditionally and cascade dependencies, create views (from SQL or a standard ActiveRecord query), and drop views. You can also instruct ActiveRecord to explicity use table defaults for field values.

If you're using PostgreSQL this gem also allows you to easily create conditional indexes, indexes from expressions, specifying the method of indexing (hash, gin, gist, etc), case insensitive indexes and index sort orders. You can also specify :if_exists when removing an index. You can set column defaults to constant values or with a database function (such as NOW()) and even create, alter and drop Enums.

The same people who brought you this wonderful gem also added two more gems that further extend it's functionality: schema_associations and schema_validations.

schema_associations

If you're a big fan of DRY (Don't Repeat Yourself) then you've been frustrated before that you took the time to define all of your tables and relations according to ActiveRecord naming conventions but then still had to go into each model to add all of the relationships (both ways). Schema Associations fixes that completely. Correctly named relationships are all automatically created when the application starts so your database schema is the sole keeper of your data relationships. You can, of course, create and modify any of these as needed even with additional shorthand. Even turn it off or on in specific cases. It's very flexible.

schema_validations

In the same vein as associations...why should you have to take the time to create basic validation rules within ActiveRecord when you're database already knows all of them...right there in the schema? It creates basic validations based on your database schema such as max field length and null allowance. Also fully customizeable.

postgres_ext

PostgreSQL has a lot of custom datatypes. Rails has already added support for many of them since the first iteration of this gem but the latest version adds even more. For the PostgreSQL Array datatype, you get methods for querying overlap, contains, any or all. PostgreSQL also has INET/CIDR dataypes for handling network addresses like ip addresses and netmasks. Now you gain the ability to run contains operations against those types, such as checking to see if an IP Address is contained within 192.168.1.1/24.

It also adds ActiveRecord support for Common Table Expressions (CTEs) which basically create a temporary table to be used in a single query. You'll get the ability to use them in queries as well as creating temporary models that can have further commands run against them.

squirm_rails

One thing that ActiveRecord has never gotten around to supporting in any database is Stored Procedures. Stored Procedures are just functions in your database that do things with your data. In many cases these will involve multipart operations. A lot of people shy away from writing that type of logic in the database simply because it used to be a pain to do so. PostgreSQL, however, has changed this significantly by allowing database functions to be written in Python, Perl, Java, Javascript, R and many other languages. It's also the only place where you can easily combine logic from multiple languages into a single operation. Combine that with additional data structures, asyncronous operations, and the ability to notify listening external processes of events...and stored procedures get a little bit more powerful. Before you largely only gained the ability to centralize data logic between multiple applications (which was still valuable).

Despite all of those capabilities, using them was still very difficult. The Squirm::Rails gem has assembled an extremely well thought out approach to adding these capabilities to Rails. It adds a procedure class method to ActiveRecord that allows you to specify method wrappers around procedures. They're also added to instance methods to automatically call the procedure with properly named arguments from the instance.

Rather than storing procedure code in migrations, Squirm adds a db/functions.sql file to your project that is run with db:schema::load or directly with db:functions:load so that your database functions can be handled as project code just like the rest of your application as long as they're defined with CREATE OR REPLACE.

postgresql_cursor

In a very large data set, you may need to do intensive background processing on a large result set. ActiveRecord already provides some constructs for doing this with find_in_batches / find_each but there are drawbacks to these approaches. Specifcally, a query that takes a very long time to run will be re-run for each batch being processed which significantly limits what you can do with it. Cursors allow you to run a complex query once and then iterate over the large result set in your Rails code while letting the database keep track of where you are in the result set. You don't have to return 100,000 rows into your Ruby code and worry about that RAM impact and likewise you don't have to worry about running a 30 minute query multiple times either. This gem exposes that functionality via the each_instance (ActiveRecord instance) or each_row (hash only, approximately 4 times faster) method.

partitioned

Another for the large dataset books is table paritioning. PostgreSQL table paritions basically let you create child tables based on constraints from parent tables (because you can do that in PostgreSQL), shrinking the data set size and associated index sizes. The partitioned gem allows you to define, created, read and update partition rules in your models including bulk insert and updates.

activerecord-postgis-adapter

There are a lot of options for working with geospatial data in PostgreSQL. You can use straight SQL via Geokit which is actually fairly quick thanks to multi-index queries. There are also some basic geometric datatypes which you can create and index via extensions like earth_distance.

Nothing really even comes close to PostGIS (in any open source database) in terms of the capabilities it provides for working with geographic data. PostGIS can be a bit daunting because it has so much crammed into it in terms of 3rd party tools, ability to slurp down public mapping datasets, generating raster images, integrating with mapping systems and I'm barely scratching the surface. If you just want to be able to optimally store and query geographic data, the activerecord-postgis-adapter will get you up and running in no time. It will help you upgrade an existing PostgreSQL database, add spatial migration capabilities to define and index data based on geographic and geometric datatypes, recognize and cast data types properly, and let's you use spatial data in queries. It's the simplest way to scratch the surface of PostGIS for your geospatial query needs. And remember, you can query spatial data and text search data in the same query efficiently thanks to automatic multi-index queries.

PostgreSQL Queueing

There are a lot of different queuing systems out there for Rails. The three most well known are DelayedJob, Resque, and Sidekiq. DelayedJob will let you store your jobs in your application database while the other two use Redis for performance. The upcoming ActiveJob in Rails 4.2 will be rolling these and several others under a fairly unified API. DelayedJob suffers due to database abstraction which prevents it from utilizing many features of PostgreSQL for greater performance and efficiency. Luckily, there are two options out there that address this problem.

queue_classic

The guys from Heroku wrote queue_classic to ensure stablity, a huge amount of concurrent workers and to make sure that no job was ever checked out by more that one worker. It uses a forking approach for task execution similar to the Unicorn web server which ensures that a job crash won't negatively affect other running jobs. It uses row locking to SELECT, LOCK, and UPDATE the next job in the list to ensure that a job can't be selected by another worker while one is in the process of locking it. It also uses PostgreSQL's LISTEN/NOTIFY functionality to let hundreds of workers wait for new jobs without having to constantly poll for them. The database tells the waiting workers when they have something new to process. It additionally makes use of the PostgreSQL JSON datatype to store job data.

que

A relatively newer player in the queuing space for Rails is Que. The SELECT, LOCK, and UPDATE process is the bottleneck of queue_classic. PostgreSQL contains something called Advisory Locks which basically means that application can define locking rules and PostgreSQL will manage them internally. This is tremendously more efficient than using a locked_at column to flag jobs in progress and if for some reason a job dies while it's working the lock will drop so it can be tried again. Que also uses threading to acheive greater concurrency in a single worker. The two combined allow Que to acheive over 20x (10,000 vs 350 jobs / second) the throughput on jobs compared to DelayedJob or queue_classic. It's important to keep in mind, that type of concurrency is most useful with tasks that are constrained by I/O rather than the processor alone. Que is one to keep an eye on though as it appears to be the Sidekiq to queue_classic's Resque.

Order Query

Pagination is something that many of us take for granted, but using the standard LIMIT + OFFSET approach can seriously degrade performance the farther back you go in a large data set. Here is an extremely detailed explanation of why, including some great slides as well a better way of doing things. I also found it very interesting to see the required features to implement this solution and why PostgreSQL is the only major database that actually allows you to do so (vs MySQL, SQLite, SQL Server, and Oracle). The summarized version is that you're still having to sort every record that matched the where clause before you can figure out your limit and offset so the more efficient way is to narrow your query so that you're only sorting the actual results that will come back in your dataset. You can do this if you have some information about the previous page (and again, I highly encourage reading that link/slides for better understanding). The Order Query gem implements this approach in Rails for you to solve the complicated performance problem.

So in closing...

Have I mentioned how wonderful Rails and PostgreSQL are together?