Pirates of AMPS: Dead Man's Queue

  Jul 19, 2017   |      Ray Imber

queues error handling

burning letter with skull and cross-bones

Queues are the bread and butter of a good messaging system. AMPS provides a powerful queue system that is fast, resilient, and flexible, taking work from publishers and feeding them to consumers as fast as your network will allow.

The real world, unfortunately, has time constraints. AMPS Queues are extremely performant and are often used in very time sensitive use cases such as the processing of market data. It is extremely important that consumers of these time sensitive queues don’t receive stale data or poisoned data that can cause the system to get stuck. I will talk about what I mean by poisoned data later, but lets take the basic case of stale data first. AMPS solves this problem by allowing an expiration to be set on queue messages.

If a message outlives its expiration it is forced to walk the plank and is removed before another client can receive it. This is great! It assures that consumers only have the latest messages, and everything is smooth sailing on the AMPS seas.

Normal Queue and Dead Letter Flow

While keeping only the latest messages in the queue is important for time sensitive applications, those stale messages can still be useful! Statistics on those stale messages can contain important clues about possible problems and improvements to consumer clients. In use cases where consumer speed is of the upmost importance, these statistics are a map to hidden treasure for your business. If only there was a way to save those expired messages…

The Dead Letter Queue

A secondary queue that tracks expired messages from one or more primary work queues is known as a dead letter queue. The powerful automated actions platform of AMPS allows you to not only build a dead letter queue, but using AMPS views, you can create conflated statistics of your dead letter queue. You can use this queue and view setup to gain unpreceded insight into your AMPS queue performance!

This whole system hinges on the on-sow-expire-message action. This powerful action acts like a life boat for those stale messages walking the plank.

You might be saying to yourself, “on-sow-expire-message? But I thought we were talking about queues not sows?” Yes! But this is part of the power of the AMPS platform. Queues are implemented as a view over an underlying topic or set of topics that are backed by a transaction log. Among other things, this allows AMPS actions to receive events from Queues just like they were a SOW. It requires no extra effort on your part to get this functionality.

The AMPS Actions Platform

Just a quick aside. The AMPS actions platform is extremely flexible and modular. This dead letter queue is just one particular (rather simple) example of an AMPS action design pattern. The full functionality of AMPS actions is available for you to use with the expired messages. You even have full access to the message data through amps-action-do-extract-values. You could trigger a log rotation, a SOW compaction, write to a log, turn on your smart toaster, etc… See the AMPS User Guide chapter on Actions for all the gory details.

AMPS Views: the secret to unlocking the treasures of the Dead Letter Queue

Other platforms offer dead letter queues. It is a common design pattern. The innovative differentiator of our queue is that it can show more than just queue depth. The AMPS dead letter queue is a full fledged queue, with complete access to its data. You can perform arbitrary queries and views over the data in the queue.

Here is an example of the simple case: a view that shows the queue depth, last received stale message, and the time that message was received.

<View>
  <Name>DeadLetterStats</Name>
  <UnderlyingTopic>DeadLetters</UnderlyingTopic>
  <MessageType>json</MessageType>
  <Projection>
    <Field>COUNT(/data/id) AS /totalDeadLetters</Field>
    <Field>/data/id AS /lastProcessedId</Field>
    <Field>/received AS /lastReceived</Field>
  </Projection>
  <Grouping>
    <Field>/totalDeadLetters</Field>
  </Grouping>
</View>

Let’s see how we can spice it up! We have full access to the message data, so how about using GROUPBY to nicely group our dead messages:

<View>
  <Name>DeadLetterStatsByRegion</Name>
  <UnderlyingTopic>DeadLetters</UnderlyingTopic>
  <MessageType>json</MessageType>
  <Projection>
    <Field>/data/region AS /region</Field>
    <Field>SUM(/data/rumPrice * /data/rumQty) AS /totalLostRumCost</Field>
  </Projection>
  <Grouping>
    <Field>/data/region</Field>
  </Grouping>
</View>

With this aggregated the view, the Royal Navy could very easily see the total value of Rum lost, grouped by each region.

Use-Case Ideas

Any kind of SQL style query you can think of can be performed over your dead letter queue. Here are some possible examples to get your gears turning:

  • Your queue receives messages from different regions. GROUPBY the region identifier to determine which region is losing the most messages from the queue.

  • In a market order system, you can aggregate the monetary value of all the dead order messages with SUM to give an aggregated value for the lost messages. (This can be used as a powerful incentive metric for your team)

  • If SUM is too simplistic, you can get the average value of /order_price * /quantity over all the orders in the dead letter queue.

  • If you use variable length messages in your system, you can compute the AVG and STDDEV_POP over the byte size of the dead messages to get an idea of the variance of the size of the problem messages.

A little bit of self promotion: check out my earlier blog post on AMPS actions for an in depth discussion of the kinds of processing you can do over the data in your dead letter queue.

Poison! a.k.a Forced Expiration in AMPS Queues

A good pirate always checks their rum!

While stale messages are an important consideration for queues, there are other cases that can necessitate the forced expiration of a message from a queue. This could be the extreme case of a message with corrupted data, or simply a message containing a request that could not be fulfilled by a consumer, such as being unable to commit the message to an external database for some reason. We refer to these as poisoned messages, and as of AMPS 5.2.1.0, we have set of features that allow you to safely handle them, making your AMPS queue’s even more robust!

Poisoned Letter Disrupting a Queue

Forced Expiration Features

Let’s break down the new features added to AMPS queues and explain why you will love them. First, there are two new options that have been added to the AMPS queue configuration:

How they add safety to your queues? - MaxDeliveries: an upper bound to the number of times AMPS may deliver a queue message before automatically expiring it.

  • MaxCancels: a limit to the number of times a subscriber may cancel a lease on a message before it is expired.

At first it might seem like these two limits are redundant, but they work in conjunction. MaxDeliveries is checked when a message is submitted to a consumer from AMPS. MaxCancels is checked when a message is returned to the queue.

This gives you the maximum flexibility to determine the behavior that best fits your use-case. For some use-cases it’s more important to check canceled messages before they hit the queue again, and for other use cases it may be fine that they re-enter the queue, but they should only be allow so many delivery attempts.

You can define both MaxDeliveries and MaxCancels on the same queue. Note: Delivery is counted each time a message is delivered, so a message that is delivered and then cancelled counts both a delivery and a cancellation. In this case, a message is initially sent out, which counts as a delivery. When the message is then canceled, it will increment its cancel count right before it hits the queue again. MaxDeliveries is checked when a message leaves, MaxCancels is checked when a message returns.

See the AMPS User Guide for more details about MaxDeliveries and MaxCancels

These new Queue limits are extremely valuable to keeping your queues operating safely and at peak performance, but that’s just one part of the solution! If things do go wrong and we get slow queues or poisoned messages, we need as much data as we can get to solve the problem!

We got you covered on that end too! The on-sow-expire-message action has been supercharged with a new context variable:

  • AMPS_REASON: A comma-delimited string indicating one or more reason(s) the message was expired. Here are the values that AMPS_REASON can contain:

  • time_limit: This is the standard Queue timeout condition.

  • max_cancels: The message exceeded the maxCancels limit.

  • max_deliveries: The message exceeded the maxDeliveries limit.

  • forced_expire: A consumer forced the message to expire from the queue immediately.

AMPS_REASON gives a detailed picture of why messages are being expired from your queue. And remember, you can use the full power of AMPS views and aggregation with AMPS_REASON inside your dead letter queue. Not only can you see which messages are dying, you can now see why they are dying, and calculate a full range of queries and statistics over that data!

forced_expire is particularly powerful because it gives you a feedback mechanism between your consumers and AMPS. If a consumer detects some kind of problem with a message, it can force_expire the message, giving you a discrete flag that you can immediately take action on. For a more in depth discussion of handling poisoned messages in your consumer code, check out this excellent blog from my colleague Dirk.

A message that hit max_cancels could just be the result of a down-stream network slowdown, but a message with force_expire was a client very explicitly telling you that it could not process that message. This kind of granularity brings a whole new level of power to your monitoring infrastructure, allowing you to respond to problems faster and keep your queues running longer.

Let’s view some sample snippets from a config that implements this:

1) The main queue definition:

<Queue>
    <Name>WorkToDo</Name>
    <MessageType>json</MessageType>
    <Semantics>at-least-once</Semantics>
    <UnderlyingTopic>Work</UnderlyingTopic>
    <Expiration>60s</Expiration>
    <MaxDeliveries>2</MaxDeliveries>
    <MaxCancels>2</MaxCancels>
   </Queue>

2) Let’s make a DeadLetters queue to store these poor dead messages:

<Queue>
    <Name>DeadLetters</Name>
    <MessageType>json</MessageType>
    <Semantics>at-least-once</Semantics>
   </Queue>

3) A view over the DeadLetters to give you those awesome metrics:

<View>
    <Name>DeadLettersByReason</Name>
    <UnderlyingTopic>DeadLetters</UnderlyingTopic>
    <MessageType>json</MessageType>
    <Projection>
      <Field>/reason AS /reason</Field>
      <Field>COUNT(/data/id) AS /numMessages</Field>
      <Field>/received AS /lastReceived</Field>
    </Projection>
    <Grouping>
      <Field>/reason</Field>
    </Grouping>
   </View>

4) The action definitions to tie it all together:

<Actions>
     <Action>
        <On>
            <Module>amps-action-on-sow-expire-message</Module>
            <Options>
                <Topic>WorkToDo</Topic>
                <MessageType>json</MessageType>
            </Options>
        </On>
        <Do>
            <Module>amps-action-do-publish-message</Module>
            <Options>
                <Topic>DeadLetters</Topic>
                <MessageType>json</MessageType>
                <Data>{ "data" : {{AMPS_DATA}}, "received" : "{{AMPS_DATETIME}}", "reason" : "{{AMPS_REASON}}" }</Data>
            </Options>
        </Do>
      </Action>
   </Actions>

See the AMPS User Guide for more details about AMPS_REASON

Queue Semantics

As another important aside, AMPS queues provide two different modes of delivery semantics. These two modes act very differently, and it’s important to understand how queue performance and message expiration is affected by these modes:

  • at-least-once semantics
    • Messages are leased to a consumer, which must acknowledge the message or the message will be automatically returned to the queue.
    • It is equivalent to a don-destructive get from a traditional queue.

at-least-once semantics are most susceptible to poisoned messages, because the bad messages can be automatically returned to the queue, continuing to cause processing bottlenecks. This semantic mode is why the forced expiration features described above were created. The advantage of at-least-once mode is reliability, and the force expiration AMPS_REASON feature allows more detailed analysis of why messages are failing.

  • at-most-once semantics
    • The simplest way to remember these semantics is: “fire and forget”. Messages that are sent to a consumer are immediately removed from the queue and are never returned.
    • It is equivalent to a destructive get from a traditional queue.

at-most-once semantics favors performance over reliability, and thus does not attempt to make any guarantees about whether a message was successfully processed once it has been sent. This makes it less susceptible to poisoned messages, but it also gives you less insight into why messages may have failed to be consumed properly.

In fact, this mode does not support the MaxDeliveries or MaxCancels options. at-most-once only supports timeout expiration. This makes intuitive sense since a message in an at-most-once queue will only ever be delivered once, and can never be canceled by a consumer.

The on-sow-expire-message action still works with at-most-once queues, and dead letter queues work the same way.

The important note is that there is only one condition in which a message will be sent to the dead letter queue: the message timed out.

Note: There are many more subtle differences in behavior between these two modes of operation. See the manual for more details

Sample Dead Letter Queue Config

Finally, here is a complete working sample configuration for a Dead Letter Queue implementation in AMPS 5.2.1.0:

<AMPSConfig>
    <!-- Name of the AMPS instance -->
    <Name>dead-letter-queue-server</Name>
    <ProcessName>ampServer</ProcessName>
    <ConfigIncludeCommentDefault>enabled</ConfigIncludeCommentDefault>

    <SOWStatsInterval>1s</SOWStatsInterval>

    <Admin>
        <FileName>./dead-letter-queue/stats.db</FileName>
        <InetAddr>localhost:8085</InetAddr>
        <Interval>10s</Interval>
        <SQLTransport>websocket-any</SQLTransport>
    </Admin>

    <MessageTypes>
        <MessageType>
            <Name>json</Name>
            <Module>json</Module>
            <AMPSVersionCompliance>5</AMPSVersionCompliance>
        </MessageType>
    </MessageTypes>

    <Transports>
        <Transport>
            <Name>nvfix-tcp</Name>
            <Type>tcp</Type>
            <InetAddr>19090</InetAddr>
            <MessageType>json</MessageType>
            <Protocol>amps</Protocol>
        </Transport>
        <Transport>
            <Name>websocket-any</Name>
            <Protocol>websocket</Protocol>
            <Type>tcp</Type>
            <InetAddr>9008</InetAddr>
        </Transport>
    </Transports>

    <Logging>
        <Target>
            <Protocol>file</Protocol>
            <Level>trace</Level>
            <FileName>./dead-letter-queue/server.log</FileName>
        </Target>
    </Logging>

    <TransactionLog>
      <JournalDirectory>./dead-letter-queue/journals</JournalDirectory>
      <Topic>
        <Name>Work</Name>
        <MessageType>json</MessageType>
      </Topic>
      <Topic>
        <Name>DeadLetters</Name>
        <MessageType>json</MessageType>
      </Topic>
    </TransactionLog>

    <SOW>
      <Topic>
        <Name>Work</Name>
        <MessageType>json</MessageType>
        <FileName>./dead-letter-queue/work.sow</FileName>
        <Key>/id</Key>
      </Topic>
      <Queue>
        <Name>WorkToDo</Name>
        <MessageType>json</MessageType>
        <Semantics>at-least-once</Semantics>
        <UnderlyingTopic>Work</UnderlyingTopic>
        <Expiration>60s</Expiration>
        <MaxDeliveries>2</MaxDeliveries>
        <MaxCancels>2</MaxCancels>
      </Queue>
      <Queue>
        <Name>DeadLetters</Name>
        <MessageType>json</MessageType>
        <Semantics>at-least-once</Semantics>
      </Queue>

      <View>
        <Name>DeadLetterStats</Name>
        <UnderlyingTopic>DeadLetters</UnderlyingTopic>
        <MessageType>json</MessageType>
        <Projection>
          <Field>COUNT(/data/id) AS /totalDeadLetters</Field>
          <Field>/data/id AS /lastProcessedId</Field>
          <Field>/received AS /lastReceived</Field>
          <Field>/reasion AS /lastReason</Field>
        </Projection>
        <Grouping>
          <Field>/totalDeadLetters</Field>
        </Grouping>
      </View>
      <View>
        <Name>DeadLettersByReason</Name>
        <UnderlyingTopic>DeadLetters</UnderlyingTopic>
        <MessageType>json</MessageType>
        <Projection>
          <Field>/reason AS /reason</Field>
          <Field>COUNT(/data/id) AS /numMessages</Field>
          <Field>/received AS /lastReceived</Field>
        </Projection>
        <Grouping>
          <Field>/reason</Field>
        </Grouping>
      </View>
    </SOW>

    <Actions>
        <Action>
            <On>
                <Module>amps-action-on-sow-expire-message</Module>
                <Options>
                    <Topic>WorkToDo</Topic>
                    <MessageType>json</MessageType>
                </Options>
            </On>
            <Do>
                <Module>amps-action-do-publish-message</Module>
                <Options>
                    <Topic>DeadLetters</Topic>
                    <MessageType>json</MessageType>
                <Data>{ "data" : {{AMPS_DATA}}, "received" : "{{AMPS_DATETIME}}", "reason" : "{{AMPS_REASON}}" }</Data>
                </Options>
            </Do>
        </Action>
    </Actions>

</AMPSConfig>

How will you use this new features of AMPS to sail the seas and find the treasure? Submit your use case for this capability here or via email and we may send you a cool t-shirt!


Read Next:   Do-It-Yourself SOW Keys