Brightball

Organizing Background Worker Queues

DevOps | queue | redis | sidekiq | scaling | resque | - April 23, 2015 // Barry

At work earlier today I ran across an issue where one of our application queues got backed up and it got me to thinking about how queues are organized in general. The TLDR answer: use urgency and intensity. 

If you've ever had to scale an application with a lot of traffic, you'll find that one of your first tasks is reducing web requests to only handle exactly what the user needs to get a response while offloading everything else to background workers via some type of queueing system.

There are many, options out there for such systems but the common bit of functionality that matters for sake of this article is naming the queues. Inevitably, as you start putting more and more things in your queue you start naming them instead of putting them all in the same pipe so that you can handle them differently.

Sometimes, your workers will all listen to each queue with a level of priority assigned to each. Other times you'll have some workers that ONLY listen to specific queues that need dedicated resources allocated to them. 

This is easy to identify when you know everything that's being backgrounded. Unfortunately, when your application is being iterated this tends to be something that just sort of evolves over time so with that in mind, here's a simple way to name your background queues from the very beginning so that you'll be properly ready to deal with resource constraints later.

Urgency

Urgency is basically just another way of saying priority, but in reality this should mean "user urgency". Will users notice and care if this takes a little while to run? Most things that you can put in the background at all are not urgent, otherwise you would have handled it immediately. Sending an email is a good example. If the user doesn't get the email within a few minutes it's not a big deal. If they don't get it after a few hours, it's a problem.

Intensity

Intensity is a measure of resource usage but NOT speed. A background job that takes a long time to complete because of contacting 3rd party systems is slow but it's effect on your server is negligible. A background job that runs reporting queries on your database, could lock tables, or involves a lot of processing like video...that is intense. The distinction is important because your application can handle thousands of slow but not intensive jobs running at the same time. It might only be able to handle one intensive job at a time.

Basic Naming Scheme

Clearly, every application is different and the bigger things get, the more specific queues will become in their focus. Here's the structure that I tend to use for organizing queues that handles growth pretty smoothly.

  1. urgent-light
  2. normal-light
  3. eventually-light
  4. urgent-medium
  5. normal-medium
  6. eventually-medium
  7. urgent-heavy
  8. normal-heavy
  9. eventually-heavy
Hopefully these are self-explanatory, but I'll break it down.
  • URGENT - The user needs these ASAP or they will think something is wrong.
  • NORMAL - The users needs this relatively soon but the time can vary by close to an hour without creating a problem.
  • EVENTUALLY - Sometime today.
  • LIGHT - Almost non-existent server impact and could run with virtually unlimited concurrency without impact users
  • MEDIUM - Involves a minimal amount of server resources in isolation but as a group can have more of an noticeable impact
  • HEAVY - So resource intensive that if your IT department had their way it would only run one at a time
The combination of these classifications will make scaling and adjusting resources easy in the long run. When you're just starting with background queues, you're going to just have one worker listening on all of them so these classifications won't mean much. Organizing your jobs this way will mean that when you need to, you'll have a much simpler time allocating resources. 
 
To the light jobs you provide a threaded worker with very high concurrency, executing in order of priority. This is everything from API calls to sending emails. These queues should always be at or near empty. Most emails would fall under something like "normal light" but for things like emails to get new customers on board, you should probably bump those up to urgent for engagement reasons. Emails sent by a daily job on the other hand would fall under eventually since the user isn't actively waiting on them.
 
For the medium jobs, you can still use the threaded workers but you'll want to scale back the concurrency so that you can control the impact on your live traffic. Think counter cache updates, live analytics or other activities that might involve a lot of database writes with index updates that could have disk i/o implications if too many of them are running at once.
 
For the heavy jobs, urgency is a critical measure. If the urgency is minimal you just run these sequentially to minimize the impact on everything else or even schedule their execution for known low traffic hours. For higher degrees of urgency you may need to look at allocating dedicated resources in order to help these process faster without impacting the rest of the system, like a database read replica to generate reports for example.
 
As your application grows, eventually you'll get to a point where you need queues to handle certain jobs explicitly, measuring priority of even the small details. Using the structure outlined above should make your life a lot easier until you hit that point.