Application errors monitoring with HipChat and Bash

The classical support flow – an error occurs, a user opens a ticket / phones support,  company investigates, a fix is put in place, ticket is updated / user is called back – is very reactive. Here’s how we made it a bit proactive with HipChat and Bash for free to increase user satisfaction.

The Story behind

At one of the startups I’m involved with, almost an year ago we launched our Alpha product and invited some users to pilot it.  We’ve put some basic monitoring to check if all our apps  (as we use micro services) are up and running. We’ve integrated a cloud support service with a help bubble at the bottom right corner of our product and there we go.

Without any surprise, errors started popping up in the application logs every now and then, while our pilots were emailing and calling us to complain about them. As result we open an internal ticket, investigate, fix and get back to them with a solution. Sounds familiar, right? Like the classic support model you can see in almost every company – very reactive. At the end of the day most of the errors were caused by invalid user data to which we didn’t provide proper error messages or just required a quick db update – things that take us seconds but causing frustration to our users. Then we asked ourselves “Can we be proactive and turn that flow the other way around?”. An error occurs, we get notified, investigate, fix and call the user with something like “Hey, we have noticed you just had a problem and we are calling to help you / let you know we’ve fixed it.”

Our Solution

There are many application monitoring tools out there that claim to have easy integration but at the time we needed something the idea for HipChat popped up and we didn’t even go into evaluating any other products because that solution proved to be very simple and working for us.

We use HipChat for team communication as it integrates well with the other Atlassian products we use. So the idea is to tie up the application logs with HipChat, so when an error occurs, our team will get notified as soon as possible. Once we know that something has happened it’s very easy to investigate and take an action to resolve it. There’s a dedicated person that contacts users and in most of the cases issues get resolved in a few minutes, giving sense of fulfilment to us and building trust with our users.

Before we start here is an inventory list that will be needed

  • Source of errors – Application error logs
  • Access to application logs
  • Error detector – bash script (code provided below)
  • Task scheduler – Crontab on Unix
  • Error reporter – HipChat Integration
  • Admin access to a HipChat room
  • Humans to solve the errors and call the users

Let’s do it

Error logs

Let’s assume we have two applications running in Unix environment storing their log files in /var/log/mycompany. The full path to their files is:

/var/log/mycomapny/app1.log
/var/log/mycompany/app2.log

We want to get notified only about Warnings and Errors and here’s how they look like in the logs:

2017-06-24 00:02:22.039 ERROR 27518 qtp10707771-16 --- An exception has occurred while calling YOUR_METHOD ... YOUR_ERROR_MESSAGE
2017-06-22 15:09:00.788 WARN 32256 qtp31749278-16 --- o.h.engine.jdbc.spi.SqlExceptionHelper: SQL Warning Code: 1292, SQLState: HY000 ... YOUR_WARNING_MESSAGE

Error detector

The next step is to detect when something that we want to get notified about happens. For that purpose, we use a bash script (detector) that goes through all log records generated in the last 10 minutes and searches for ones, that we are interested in. The run the detector every 10 minutes, we use the default task scheduler of Unix – Crontab – with the following setup:

# m h  dom mon dow   command parameters
59 * * * * /home/user/bin/detector.sh 5
10 * * * * /home/user/bin/detector.sh 0
20 * * * * /home/user/bin/detector.sh 1
30 * * * * /home/user/bin/detector.sh 2
40 * * * * /home/user/bin/detector.sh 3
50 * * * * /home/user/bin/detector.sh 4

The columns are:

  • m – minute of the hour the command will be executed at
  • h – hour the command will be executed at
  • dom – day of month the command will be executed at
  • mon – month the command will be executed at
  • dow – day of week the command will be executed at
  • command to be executed
  • parameters to the command

By running the detector every 10 minutes, the maxim time between an error and its notification is 10 minutes. That works well for us but it’s up to you to change it according to your needs.

Detector script source

#!/bin/bash

# Verify input parameters
if [[ "$1" < 0 ]] || [[ "$1" > 5 ]]
then
    echo Invalid input! Shall be one of 0,1,2,3,4,5
    exit 1;
fi
minutes=$1;

# Set up where and which log files to monitor
log_files=(
    '/var/log/mycomapny/app1.log'
    '/var/log/mycomapny/app2.log'
'
);

# Needed to escape symbols that would break a JSON structure
json_escape () {
  # As this requires "python" install (out of the box on most unix systems) feel free to replace it with any other JSON formatting library available on your system.
  echo -n "$1" | python -c 'import json,sys; print json.dumps(sys.stdin.read())'
}

# Prime function that sends to HipChat
send_to_hipchat () {
    path=$1
    message=$(json_escape "$2");
    message=$(echo $message | sed -e 's#^"##');
    message=$(echo $message | sed -e 's#"$##');
    json='{"color":"red","message":"(fu) An error has happened on ENVIRONMENT in '${path}\\n${message}'","notify":true,"message_format":"text"}';

    # Take that URL from HipChat room integration (see instructions below)
    curl -d "$json" \
        -H 'Content-Type: application/json' \
        https://YOUR_COMPANY.hipchat.com/v2/room/ROOM_NUMBER/notification?auth_token=YOUR_TOKEN_GOES_HERE

}

# Enable regular expression search
grep_opts='-P';

# Change according to timestamp format in your logs. Consult 'man date' for details
time_pattern=$(date --date today "+%Y-%m-%d %H");

for fpath in "${log_files[@]}"
do
    # Change the pattern to match your logs. This one works for Spring Logback
    results=$(grep $grep_opts "${time_pattern}:${minutes}\d:\d\d\.\d+\s+(ERROR|WARN)" $fpath);
    if [[ $results != "" ]]
    then
        send_to_hipchat "$fpath" "$results";
    fi
done

The most important thing here is the pattern used to detect if a log record is of an interest to us, so change lines 43 and 48 accordingly. Together these two patterns will match the log lines given in Error Logs section above.

"+%Y-%m-%d %H" -> produces i.e. "2017-06-22 15"
"${time_pattern}:${minutes}\d:\d\d\.\d+\s+(ERROR|WARN)" - produces i.e. "2017-06-22 15:1\d:\d\d\.\d+\s+(ERROR|WARN)"

Copy the code, open your favourite editor, put it in and save it as i.e. – ~/bin/detector.sh
Don’t forget to make it executable as it will not run – chmod 755 ~/bin/detector.sh

HipChat setup

The next step is to setup HipChat, so that detector can access it through the API.
I would strongly recommend to create a new room that is dedicated to Errors only, so when a new notification comes everyone knows that something has happened and an action is needed now.

Room setup

For this you will need admin access to HipChat or to a specific room.

  • Go to hipchat.com (or your domain if hosted) and login.
  • Then go to Rooms.
  • Create or Select a Room.
  • Select Integrations from the menu
  • Select the first option – Build Your Own integration
  • Name your integration – i.e. Error reporting – and Save

The next page should show something like this:

Screenshot from HipChat room integration

Take the url from the first box and put it inside your detector script – line 35

Let’s test it

put a fake error line at the end of your log file and run detector manually.

That’s it

Happy Error hunting!

 

Presented live at Sofia Atlassian User Group meet up, 5th of April 2017

Be social and share