Splunk Completes Acquisition of Plumbr Learn more

General Approach

Through the use of Plumbr Server API, it is possible to expose the insights captured by Plumbr to any system that can make an HTTP call. One of the more common use cases for that is sending out alerts to you on-call team so they can immediately respond to the degraded service level. Let us go over some of the common use cases that you might face.

Example 1:

Suppose that there is an e-shop application monitored by Plumbr at shop.example.com. What’s the most crucial metric for this application that can directly show if the business is going well? There may be many answers to that, depending on the business model, but “is anything being sold” would probably be close to the top of the list.

Since the application is monitored by Plumbr, each click on the “CHECK OUT” button on the cart is tracked, and the outcome is recorded. In the user interface, it could appear like this:

Screenshot from Plumbr UI

Looks like we have hundreds of users successfully checking out their cart. This means that revenue is being generated, and the e-shop can keep going. However, we’d like to make sure that these deals keep happening 24/7. So let us use the Plumbr API by passing in the serviceId and applicationName seen on the screenshot above.

$ curl -s -u admin@example.com "https://app.plumbr.io/api/v4/users/summary?context=serviceId%3D1234567890abcdef,applicationName%3Dshop.example.com&last=4h"

[
    {
        "failed": 1, 
        "onlySlow": 0, 
        "success": 249, 
        "total": 261, 
        "verySlow": 11
    }
]

These values can then be compared against some thresholds or other triggers. For instance, if there are zero sales during the last 4 hours, then something is probably broken (or it’s January the 1st). As we’ll see a bit later, it is very simple to codify such rules and send out alerts when needed.

Example 2:

Another important metric to track the well-being of the e-business would be how the users are experiencing the e-shop. If they are forced to wait for the spinning wheels, or, worse, if they are facing errors while flowing through the shop, then the long-term perspectives of the application are gloomy. In such a competitive market, people can just find a different e-shop that works for them.

With Plumbr, we can directly track the status of all the interactions in the e-shop:

$ curl -s -u admin@example.com "https://app.plumbr.io/api/v4/users/summary?context=applicationName%3Dshop.example.com&last=4h"

[
   {
      "total" : 609,
      "failed" : 3,
      "success" : 586,
      "verySlow" : 20,
      "onlySlow" : 0
   }
]

Dividing the “success” by “total”, we see that about 4% of the customers have received a sub-par digital user experience. If that number goes up, then it’s definitely a good cause for an alert.

Example 3:

For a bit of a more complex example, you could use Plumbr track the longer-term behaviour of your application. For instance, in some cases it may be a good idea to track spikes in error rates for a particular API call. A straightforward (albeit naive) approach to this would be using the moving average crossovers. To do that using Plumbr Server API, you would need to make two calls for different time windows:

$ curl -s -u admin@example.com "https://app.plumbr.io/api/v4/transactions/summary?context=applicationName%3Dsearch.example.com,serviceId=examplequicksearch1234567890&last=24h"

[
   {
      "total" : 2997918,
      "failed" : 20361,
      "success" : 2957453,
      "verySlow" : 0,
      "onlySlow" : 104
   }
]

$ curl -s -u admin@example.com "https://app.plumbr.io/api/v4/transactions/summary?context=applicationName%3Dsearch.example.com,serviceId=examplequicksearch1234567890&last=1h"

[
   {
      "total" : 125001,
      "failed" : 19117,
      "success" : 105884,
      "verySlow" : 0,
      "onlySlow" : 0
   }
]

From here, we can see that the error rate for 24 hours is under 1%, and may be within the SLO and perhaps not a reason for triggering an alert just yet. However, the majority of these errors all occurred in the last hour, with the error rate spiking to over 15%. This clearly indicates an issue. If something is not done quickly, the SLOs will be violated in no time.

The next step would be to set up regular monitoring of these values and to send out alerts based on them. This will come in the subsequent sections.