How to automate the monitoring of our 200 customer stacks in one deployment? | Toucan Toco

Categories

Table of Contents

Very quickly to deploy a new stack, we made a suite of playbooks Ansible. They are in charge of provisioning the target machine with the stack and all its dependencies, adjusting the environment (such as fail2ban rules, etc…), but nothing was managed for our supervision.

Rather than managing the monitoring of the different bricks of each instance with an agent-based tool like Nagios (to be deployed, configured and updated on each server), the first version of our monitoring was done using the free trial proposed by StatusCake. This service allows you to ping a given URL at regular intervals, and to send alerts via Slack or email.

The problem: doing everything manually

As this part was not automated, the creation of the tests StatusCake was done manually by going to the web interface as described in our documentation for initializing a new stack. Creating checks on the[StatusCake] interface (https://www.statuscake.com/) is relatively simple but remains tedious with the multiplication of projects.

Each time you have to connect to the interface, find the credentials, be sure that you create the test with the right options like the other tests, click, click, click and click again… As a result, no one wants to do it and we are exposed to mistakes.

Charlie Chaplin - Modern Times

The penalty is the same when you have to update the tests (for example, to add conditions or parameters). Scaling might become compromised quickly.

The solution with Ansible!

With the multiplication of customers and projects, the need to automate the creation and updating of monitoring has therefore quickly become apparent. Knowing that the deployment and update of our stacks are already automated by our scripts Ansible… Why not delegate the creation of these checks directly to it?

In our case, we would like to have different types of checks concerning the health of our services:

  • a check to make sure that our service is up
  • a check to make sure that this service can communicate with the different techonologies on which it depends (MongoDB, Redis, etc.)

These two points match the distinction between liveness and readiness, well described in an Octo article: [Liveness and readiness probes: Put intelligence into your clusters] (https://blog.octo.com/liveness-et-readiness-probes-mettez-de-lintelligence-dans-vos-clusters/):

  • liveness: if the check does not have an answer then an alert is raised, the service is considered as KO
  • readiness: according check’s anwser, we know if our service is “ready to be used”: just because we have an answer does not mean that the service is OK

The implementation of these checks requires the provision of 2 dedicated routes, on our service:

@app.route('/liveness')
def liveness():
    return "OK", 200

@app.route('/readiness')
def readiness():
    try:
        g.redis_connection.ping()
        g.mongo_connection.server_info()
    except (pymongo.errors.ConnectionFailure, redis.ConnectionError):
        return "KO", 500
    return "OK", 200
  • liveness: if the check on /liveness does not answer then our service is considered KO no matter what happens
  • readiness: if the check on /readiness returns 200 then the stack is considered OK because the connection to third-party services Redis and Mongo are OK

As for scripts Ansible, we have created a custom module in python ansible-statuscake (forked fromp404/ansible-statuscake which is no longer maintained). This module is used in this way:

- name: Create StatusCake test
  local_action:
    module:        status_cake_test
    username:      "my-user"
    api_key:       "my-api-key"
    name:          "My service check"
    url:           "https://myservice.example.com"
    state:         "present"
    test_type:     "HTTP"
    check_rate:    60  # on check toutes les minutes

In order for Ansible to find this custom module, you must place the python script status_cake_test.py in the library folder at the root of the playbook.

*Technical note: here, since we use a `local_action’, the execution takes place on the machine that starts the deployment, not on the target machine. In such a context, we can therefore benefit from the python interpreter and pip packages of our choice, without depending on the target machine. This was useful for us to create another annihilable module in python 3 / asyncio, which allowed us to rewrite some tasks using a competing model and thus save us deployment speed.

Some stats to finish….

Today, thanks to this approach:

  • we have created and automatically maintain about 1000 tests for our 200 stacks
  • test configuration is done via [Ansible] variables (https://www.ansible.com/) which are automatically committed, reviewed and applied to each stack deployment
  • when a project is decomposed, all associated tests are automatically deleted: so our monitoring is always the same as our production
  • StatusCake allows us to raise alerts on HTTP codes, patterns found in the response, response times… and allows us to do several different types of checks (performance, healthchecks,…)
  • allows the calculation of SLA at the scale of each instance by an external source which is in the same conditions as our customers, we finally have only to extract the data collected during the tests

Categories

Table of Contents