Brightball

Why should you learn PostgreSQL?

PostgreSQL | PostGIS | - July 21, 2014 // Barry

Nearly a year ago I put together an hour long presentation on PostgreSQL to provide an overview of all of the benefits it provides you over other options in the database space. In hindsight, that wasn't nearly enough time because it has the capability to replace almost your entire application stack outside of the web server. In any case, here is an attempt to summarize all of the amazing functionality that you're cheating yourself out of by not choosing PostgreSQL.

Multi-Index Queries

This is big. Huge really because it's the core piece of functionality that allows the rest of the functionality you're going to hear about. PostgreSQL does multi-index queries. That means that if you are running a query and there are several different columns in the conditions that have their own indexes, it will use all of them. That removes the need to create an index combining three different columns just because you have a query that will check all of them. Just do one on each and PostgreSQL will figure out the rest. The only remaining need to do a multi-column index is to enforce uniqueness.

Custom Data Types like JSON, XML, hstore

There are a lot of custom datatypes for PostgreSQL built in including JSON, XML, hstore, montetary, arrays, intervals, geometric, uuid, ranges and a lot more. Custom data types come with their own sets of functionality and rules. For example, JSON and hstore datatypes allow quering against pieces of data stored with them. XML allows XPath queries. There are two types of indexes, GIN (Generalized Inverted Index) and GiST (Generalized Search Table), that are built specifically to work with custom data types allowing the datatype itself to define its own rules for indexing. You can define your own as well. Because of Multi-Index Queries you can include custom datatype credentials in a WHERE clause right alongside the rest of your conditions and PostgreSQL will use all of the indexes to find that data fast. The JSON and hstore datatypes alone eliminate the need to use a NoSQL database just to work with schemaless data. Simply define the consistent parts of your data and drop the schemaless aspects into a JSON column.

Automatic Data Compression with TOAST

The Oversized Attribute Storage Technique (TOAST) is automatically applied to large data types like TEXT or XML wherever possible. Effectively, it zips the data. To try it out, I copied a 2.2mb XML document in and verified that it was stored as 81kb.

Full-Text Search

The full-text search capabilities in PostgreSQL are truly excellent even when compared with standalone search engines like Solr, Elastic Search and Sphinx. You can load in numerous pieces of data and weight them by importance in the queries. There are many different dictionaries available for filtering out stop words or automatically handling transformations for things like plurality and others in different languages. You can define multiple search vectors on the same data set with different requirements by simply adding more vector columns. There's no headaches of syncing data with outside systems or managing deltas (changes). You just add a TRIGGER that will update it whenever it changes. Search conditions can be intermingled with other conditions thanks to multi-index queries too. Additionally, you save the wasted time of querying a search engine only to get the result list and then query the database again anyway. There's very little remaining need for an outside search engine unless search is a bottlenecking feature of your application or you're trying to do real time analysis with ElasticSearch.

Range Datatypes

Among the custom datatypes in PostgreSQL are several options for ranges handling various numerical and time formats. Range datatypes add additional query functionality at the database level to query whether a value lies within a range. You can also figure out if two ranges overlap and where. Create an exclusion constraint using a GiST index that will prevent two range values from being allowed to overlap each other too. That can be very useful for scheduling systems.

Asyncronous Database Logic

Asyncronous functionality is very useful in certain situations and PostgreSQL provides a lot of ways to take advantage of it. Set an entire connection to use asyncronous commits. Use find grained control within functions. You can even create indexes asyncronously to avoid locking large tables when a new index is added.

Database Code with Javascript, Perl or Python

Stored procedures and functions in PostgreSQL can be written using PL/pgSQL, Tcl, Perl, Python, and V8 (Javascript). Now if you want to implement logic that is better suited to be run inside of your database, you're not limited to writing that code in the somewhat clunky and painful syntax that most databases require for stored procedures.

Use Functions Everywhere

So remember those database functions we talked about in the last paragraph? Well as it turns out functions are a lot more useful in PostgreSQL. You can use functions to define default values for table columns. You can use them within stored procedures and triggers. You can use them all over your queries too. You can even create an index out of a function that will be invoked whenever the function is used in a WHERE clause. That's major. Meaning, you could create a unique index based on the results of a function such as lowercased usernames. Or remember the XML datatype mentioned above? You can't index XML but you can index the result of an XPath function.

Control Your Primary Keys

PostgreSQL uses sequences to manage primary keys on a table. In most cases this is just an incremented integer, but as a sequence it gives control back to you the developer. You get to define the logic behind how those numbers are generated if you want. If you're pulling in data from a legacy database, you could import everything and then start the sequence where your legacy data left off. If you wanted to create a rolling table (similar to MongoDB capped collections) for something like log data, you could set the sequence to a modulo value to keep cycling through the same set of ids.

LISTEN/NOTIFY with External Processes

You can let your external code listen for events that PostgreSQL will tell them about. This is the type of thing that you previously needed to depend on after save hooks or constant polling for since database triggers didn't do much good outside of the database. That's no longer the case. You can have a job queue that doesn't need to poll for new jobs. It just connects and waits for the database to tell them it's there. Have a 3rd party system that you're trying to keep in sync with the database via an API, statsd or a standalone search engine? No more polling or expire times on the data. As soon as it changes, a listening process can make the update. Using memcached to offload read traffic? Your cache management just got simpler because it can change the instant your data does. And the best part is that none of those require you to go through an API written in a single language. You just connect to the database, make the change and everything that depends on those changes will know about it.

Customize LOCKING Behavior

We already talked about custom data types and indexes, but PostgreSQL even lets you define how to handle database locks. There's an innovative queueing system for Ruby called Que that uses PostgreSQL Advisory Locks to manage concurrent workers pulling off jobs without conflicts. Initial benchmarks show this approach to yield approximately 20-30 times the throughput (10,000 jobs / sec vs about 350 job / sec) of delayed_job and queue_classic, the other two prominent database backed queues which depend on updating a column when locking the job.

Database Schemas

If you're working in a database without schemas, they're basically like sub-databases. A schema contains tables, can have different permissions and a user can perform joins across different schemas that they have access to in a query. This comes in very handy with multi-tenant architecture that often occurs in SaaS setups where you may assign a subdomain to an account. You can compartmentalize each set of account tables within a schema providing smaller indexes for all of those tables and avoiding the need to filter every query by account. You can also set a global schema for shared data to avoid duplication.

Foreign Data Wrappers for Outside Systems

PostgreSQL can connect to other systems. Not just other PostgreSQL databases but other systems entirely. It can connect to Redis, MySQL or even Memcached. You can run a query in PostgreSQL that will get some data from outside sources and some from within PostgreSQL. You can even create a view after transforming some of this data so that you can easily run future queries against it with more convenience. With Memcached, you can set a trigger in PostgreSQL that directly updates its related cache data without any outside process needing to be involved!

Tons of Extensions

There's extensions for everything. There's additional datatypes, dictionaries for search, data connectors. That Memcached thing we just described? There's an extension for that. You can manipulate images inside the database. There are plugins for columnar datastores used for analytics and even full vendor extensions like Infobright. Even complex scale out functionality that rivals MongoDB for WRITES (not a typo) using Postgres-XC.

PostGIS for Spatial and Geographic Objects

One of the most comprehensive extensions out there is PostGIS which adds data structures and index types for working with geospatial data, as well as image manipulation logic to generate map images with that data. The features are extensive.

So what does all of this add up to?

When you realize that your database is capable of all of these things it fundamentally changes your view of application design. The only thing capable of maintaining your data integrity is your database. It's not your application code or your framework code. The only thing that happens when you rely on application code to manage data integrity, is that you introduce race conditions and an entire additional layer of API overhead that you must go through when you need to connect to that data from another language.

The database is the backbone of every web application and when it has limitations, that is the point at which your architecture gets more complicated. Systems get added just for queueing, search, asynchronous writes, handling unstructured data, etc...when the reality is that you don't really need any of them outside of very specific use cases. For those niche scenarios, use those systems for that specific part of the data. In the case of MOST applications that is one table or datatype with very heavy write traffic.

For everything else, use PostgreSQL and make your life easier.