Simple website analytics using GoAccess and Caddy server

06-06-2020

Hi everyone! In this post I will show you how you can use GoAccess, a web log analyzer, on the logs emitted by Caddy. This is not trivial. In fact, Caddy has a particular logging system which is different from the one you are probably used to.

In fact, if you use Nginx or Apache, you may have seen logs in the CLF format:

127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.1" 200 2326

Instead, Caddy gives you the possibility to output logs in a structured file, for example JSON. Here a simple example of an entry:

{
  "level": "info",
  "ts": 1591447409.21408,
  "logger": "http.log.access.log0",
  "msg": "handled request",
  "request": {
    "method": "GET",
    "uri": "/contacts/",
    "proto": "HTTP/2.0",
    "remote_addr": "127.0.0.1:12345",
    "host": "alexmv12.xyz",
    "headers": {
      "Referer": [
        "https://alexmv12.xyz/"
      ],
      "Cookie": [
        ....
      ],
      "Upgrade-Insecure-Requests": [
        "1"
      ],
      "If-Modified-Since": [
        "Tue, 02 Jun 2020 15:59:48 GMT"
      ],
      "Accept-Language": [
        "en-US,en;q=0.5"
      ],
      "Accept-Encoding": [
        "gzip, deflate, br"
      ],
      "Dnt": [
        "1"
      ],
      "Te": [
        "trailers"
      ],
      "User-Agent": [
        "Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0"
      ],
      "Accept": [
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
      ]
    },
    "tls": {
      "resumed": false,
      "version": 772,
      "ciphersuite": 4867,
      "proto": "h2",
      "proto_mutual": true,
      "server_name": "alexmv12.xyz"
    }
  },
  "common_log": "127.0.0.1 - - [06/Jun/2020:14:43:29 +0200] \"GET /contacts/ HTTP/2.0\" 200 939",
  "duration": 0.001050597,
  "size": 939,
  "status": 200,
  "resp_headers": {
    "X-Content-Type-Options": [
      "nosniff"
    ],
    "Content-Security-Policy": [
      "default-src 'none'; script-src 'self'; connect-src 'self'; img-src 'self'; style-src 'self'; base-uri 'self'; form-action 'self';"
    ],
    "Referrer-Policy": [
      "no-referrer-when-downgrade"
    ],
    "Strict-Transport-Security": [
      "max-age=31536000;"
    ],
    "X-Frame-Options": [
      "DENY"
    ],
    "Etag": [
      "..."
    ],
    "Content-Type": [
      "text/html; charset=utf-8"
    ],
    "Last-Modified": [
      "Sat, 06 Jun 2020 12:43:22 GMT"
    ],
    "Accept-Ranges": [
      "bytes"
    ],
    "Server": [
      "Caddy"
    ],
    "Content-Length": [
      "939"
    ]
  }
}

As you can see, there is A LOT of information. There is even a "common log" entry, which basically simulates a log in the CLF format; however, it lacks many useful voices.

If we can manage all this info we can have some useful analytics about our website, without the need for third party services!

GoAccess is very useful from this point of view. For example, it can collect metrics about:

Determine the amount of hits, visitors, bandwidth, and metrics for slowest running requests by the hour, or date.

It can run both in the terminal, output an HTML report, or even update an HTML page in real time, if you want a page to visit whenever you want.

The problem

GoAccess does not have support for structured log formats yet. Obviously, managing logs such Caddy's ones directly in GoAccess would be the ideal solution. However, for now, I found a decent solution if you want to use Caddy and GoAccess.

jq

jq is a useful tool which helps us filter and modify JSON files in order to obtain simple text.

Here is a very simple example, directly from their website. If we have a JSON file like:

{
    "foo": 42, 
    "bar": "less interesting data"
} 

We can obtain foo calling:

cat file.json | jq '.foo'

Which will output 42.

We can use jq in order to take the JSON logs, parse them, filter them (if we want for example only today's visits) and analyze them with GoAccess.

Code

A simple script able to do that is:

#!/bin/bash

today_date=$(date -u +"%Y-%m-%d")
today_ts=$(date -d $today_date +%s)

goaccess <(cat alexmv12xyz.log | jq --raw-output '
   .request.remote_addr |= .[:-6] | 
   select(.request.remote_addr != "1.1.1.1") | 
   select(.request.remote_addr != "2.2.2.2") | 
   select(.ts >= '$today_ts') | 
   [
      .common_log,
      .request.headers.Referer[0] // "-",
      .request.headers."User-Agent"[0],
      .duration
   ] | @csv') \
   --log-format='"%h - - [%d:%t %^] ""%m %r %H"" %s %b","%R","%u",%T' --time-format='%H:%M:%S' --date-format='%d/%b/%Y'

Let's analyze it:

  1. the <(cat ... | jq ...) is the process substitution operator. Basically, it creates a temp file with the output of a command, and makes it available to another one
  2. the option --raw-output makes jq not escape the " symbols
  3. request.remote_addr |= .[:-6] basically strips out the port part in the remote_addr field. So, instead of 1.2.3.4:23456 we will have '1.2.3.4'. This is useful to filter out some IPs (what we will do in the next point)
  4. we may want to filter some IPs. For example, we filter out 1.1.1.1 and 2.2.2.2. In my case they are IPs that I use, and I don't want them to be in the analytics.
  5. we use the today_ts, which is the Unix timestamp to select only today's accesses. Obviously, we can remove this or modify as needed
  6. then, we create an array with the voice common_log, followed by some information that is not contained in that voice: the header "referer", the user-agent and the time requested to answer the request. The option // "-" is needed to specify a default value if the Referer can not be found
  7. the pipe | @csv is useful to put the items in a comma-separated list. Notice that common log is a single item, with its own syntax

Now, we have to specify a format log for GoAccess. Basically, we "rewrite" every line using placeholders instead of the actual value. It's more simple if you see it.

This is an example output of the cat ... | jq ... command:

"1.2.3.4 - - [06/Jun/2020:14:43:29 +0200] ""GET /contacts/ HTTP/2.0"" 200 939","https://alexmv12.xyz/","Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0",0.001050597

And this is the log format (%^ ignores the token):

"%h - - [%d:%t %^] ""%m %r %H"" %s %b","%R","%u",%T

The tokens here used are:

See GoAccess's manual for reference.

This is the result of the script:

Screenshot of GoAccess

Consideration

The first one is that we could avoid using the .common_log entry. I tried that, but I could not make it work because of the Unix timestamp format. I don't know if it's my fault or if it is GoAccess which does not like that particular time format. So, I found it easier to use that entry and just "append" all the missing information.

The second one is that we are parsing a JSON (structured data) into a sequence of lines in order to feed GoAccess. Honestly, this is not clean at all, but it's a decent solution until GoAccess implements direct structured logs parsing.

Let me know if you find it useful or if you have suggestions! You can propose them on GitHub.