2013-04-22

I need a dedicated message queue... or do I?

I'm starting the next step of a project at work, and that's presented me with a problem.  I need a job queue.  That in itself is not too problematic, but unfortunately I've been paralyzed by indecision over the last few days.  I've been reading about these new-fangled message queues for a while now, so in an effort to determine the best technology for the task I started surveying the information I could find about message queues and how they apply to my problem.  Unfortunately, there's a little voice in the back of my head saying "This isn't hard, just make a table and write some helpers for your ORM, you've done this before...", and I'm having trouble faulting that logic.

The problem is, while I really want to use the best technology for the task here, as the only technology based employee at a small (< 10 employees) company, I don't really have the time to spend days or weeks testing out implementations.  I need to bring my own experience (and where that fails the wisdom of the crowd) to bear.  As such, I'm going to try to present this problem in a general way, so it's useful to a wider audience, and in exchange hopefully I'll get some useful information back in comments and discussion.

With that out of the way let's start with a definition of my needs.  This is a small company, but they are moving heavily into analysis of a growing dataset to help make better decisions, streamline employee tasks and reduce labor.  As such, there's a need for a general job queue that can:

  • Track jobs that have no assigned employee
  • Allow a job to be assigned/unassigned to/from an employee
  • Allow an employee to mark a job as finished, or delay until a later date
  • Jobs must be able to be shifted to other states such as pending review
  • For some subset of jobs that are currently performed by an employee, jobs will eventually be mostly, and then fully performed automatically

That's a fairly open set of requirements.  With those, I think I have my pick of any number of message queue back-ends (brokers) with fewer, but still numerous options for a specialized job queue on top.  Then again, the draw of a RDBMS/SQL solution is strong here.  I can see lots of real tangible benefits to using a native SQL solution, but at the same time, I can see the benefits (immediate and future) of dedicated queue software.  So far my thought process has broken down like so for the benefits of staying within the RDBMS and making my own system:

  • Tighter coupling with data - better/easier statistical querying
  • Familiar - less likely to find surprising behavior
  • Known performance & implementation (I already know how I would implement this)
  • Reliability and persistence are well known, and non-issues
  • Keeps system complexity down by not introducing other failure points

In looking into this, came across a good post on implementing queues in SQL, https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you.  That all rings true, but other than some interesting ideas on how to hack up a performant polling solution, it's not really anything I didn't already know.  This isn't the first time I've ended up with a queue in in an RDBMS, I know a lot (but I doubt all!) of the pitfalls to avoid.  In truth, I think if I had a bit more support from the database in certain areas, such as polling/notify support or built in key-value stores, I wouldn't be as conflicted.  Alas, I'm using MySQL (Percona's variant), and while I generally like it, these are lacking areas.  Where this gets interesting is when I start comparing against the inherent and implicit strengths of implementing a queue on top of a RDBMS against the strengths of using a dedicated queuing system:

  • Optimized for queue performance
  • Subscription/notify system (not all, but this is useful to me)
  • Less likely to accidentally code data dependency in queue implementation, complicating future use
  • More opportunity to use fast queues as a general purpose structure for IPC and scalability
  • Ability to use and learn from a new technology

To be fair, I think the last point for an RDBMS implementation and the last two points for a dedicated queue solution don't really apply in this case.  I can see a use for a dedicated queue in the future in other parts of the system, so I imagine I'll have one sooner or later anyways, but I do think they are useful points in general, especially since I think they outline my current problem fairly well.  What trade-off do I take, something better suited to the data it's representing and the usages cases it needs, or something that keeps complexity low and allows easier data correlation?

I think what I need here is more information from people that have chosen one route or the other, and advantages and disadvantages they think that brought.  There may be techniques that mitigate some of the downsides I perceive for either solution that I'm unaware of.

Update: I decided to implement in MySQL as an experiment to learn the needs and requirements of message queue's intimately.  You can see the first part here.

No comments:

Post a Comment