Prometheus, PostGIS and Phoenix Part 2

Posted by Alex Koutmos on Monday, September 16, 2019

Contents

Intro

In Prometheus, PostGIS and Phoenix Part 1 we covered what exactly monitoring is and what are its benefits. We also put together the foundation for our application that we will be monitoring with Prometheus and Grafana. In the final part of this series, we’ll be setting up our monitoring stack, instrumenting our application, and writing a simple stress tester to make our graphs dance. Part 2 assumes that you have already gone through Part 1 and have the code and database at a point where we can jump right in. If you want to checkout the companion code and fast forward to this point, do the following:

$ git clone https://github.com/akoutmos/elixir_monitoring_prom.git
$ cd elixir_monitoring_prom
$ git checkout 070bee0ad72a594fe230c197f560c39c0c4c5b01

Without further ado, let’s dive right into things!

Step 5: Add Prometheus & Grafana to Docker Compose - commit

With our application functioning and returning results for queries, it’s time for us to shift our focus to our Docker Compose file so that we can add some monitoring related services. Specifically, we’ll need to add 3 services:

  • Prometheus: At its core, Prometheus is a time series database. You configure Prometheus to scrape various sources at a particular interval, and the data that it fetches at each of those intervals makes up your time series data. It is important to note here that Prometheus is a “pull” type monitoring solution. In other words, it reaches out to its configured sources and requests data. Naturally the other monitoring type is “push” and that works by applications pushing their metrics to a database. Discussing the pros and cons of push versus pull is out of scope for this post but feel free to read [2] for more information.
  • Postgres Exporter: In order for Prometheus to fetch metrics from a particular source, it needs an HTTP endpoint to hit which will expose said metrics. Naturally, Postgres does not expose metrics in the Prometheus supported format as it should not be coupled to any one particular monitoring tool. This is where the Postgres Exporter comes in. The Postgres Exporter will fetch metrics out of Postgres, and will then expose a /metrics endpoint which Prometheus can scape. In other words, the Exporter acts as the intermediary between Postgres and Prometheus. You will actually see this pattern a lot in the Prometheus ecosystem. There are exporters for RabbitMQ, MongoDB, CouchDB, Consul and many more services (see [3] for more information).
  • Grafana: Grafana is a tool which you can use to setup dashboards and visualize the time series data that is in Prometheus (it is of course not limited to only Prometheus and can also query InfluxDB, Loki, Graphite and many other data sources). You can also use Grafana to issue alerts when metrics surpass certain thresholds to proactively address issues as they occur.

Now that we have some background as to the tools that we will be using, let’s dive right in and add some additional entries to our docker-compose.yml file. We’ll add to our services and volumes in our docker-compose.yml file as follows:

version: '3.7'

services:
  ...

  grafana:
    image: grafana/grafana:6.3.3
    depends_on:
      - prometheus
    ports:
      - '3000:3000'
    volumes:
      - grafana-data:/var/lib/grafana
      - ./docker/grafana/:/etc/grafana/provisioning/
    env_file:
      - ./docker/grafana/.env

  prometheus:
    image: prom/prometheus:v2.11.2
    ports:
      - '9090:9090'
    volumes:
      - ./docker/prometheus/:/etc/prometheus/
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/config.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'

  postgres_exporter:
    image: wrouesnel/postgres_exporter:v0.5.1
    ports:
      - '9187:9187'
    depends_on:
      - postgres
    environment:
      DATA_SOURCE_USER: postgres
      DATA_SOURCE_PASS: postgres
      DATA_SOURCE_URI: postgres:5432/?sslmode=disable

volumes:
  ...

  prometheus-data: {}
  grafana-data: {}

You’ll probably notice that there are some references to a ./docker/... directory in the docker-compose.yml service volumes sections. I won’t go too far into the details of how I put together the Prometheus & Grafana configurations as that could be an article of its own. So for now, feel free to copy the docker directory directly into your project (if you do want me to deep dive into Prometheus & Grafana provisioning/configuration leave a comment below :)). The docker directory contents can be found here: https://github.com/akoutmos/elixir_monitoring_prom/tree/master/docker. After this step, your top level directory tree should look something like this:

$ tree -L 1
.
├── _build
├── config
├── deps
├── docker
├── docker-compose.yml
├── lib
├── mix.exs
├── mix.lock
├── priv
├── README.md
└── test

With the docker-compose.yml file up to date and our new monitoring services in place, running docker-compose up should bring up all the containers. If all goes well, you should be able to go to http://localhost:3000 where you will find Grafana (username: admin, password: admin)! If you navigate to the Postgres Stats dashboard you should see some pretty graphs and stats:

Image of postgres grafana dashboard

While the Postgres dashboard has data available to display, the remaining dashboards are empty :(. The reason for this is that our Phoenix app is not exposing a metrics endpoint at /metrics for Prometheus to scrap. In fact if you look at the Docker Compose logs, you see that Prometheus is trying to scrape our application, but our app is returning a 404:

[debug] ** (Phoenix.Router.NoRouteError) no route found for GET /metrics (ElixirMonitoringPromWeb.Router)
    (elixir_monitoring_prom) lib/phoenix/router.ex:324: ElixirMonitoringPromWeb.Router.call/2
    (elixir_monitoring_prom) lib/elixir_monitoring_prom_web/endpoint.ex:1: ElixirMonitoringPromWeb.Endpoint.plug_builder_call/2
    (elixir_monitoring_prom) lib/plug/debugger.ex:122: ElixirMonitoringPromWeb.Endpoint."call (overridable 3)"/2
    (elixir_monitoring_prom) lib/elixir_monitoring_prom_web/endpoint.ex:1: ElixirMonitoringPromWeb.Endpoint.call/2
    (phoenix) lib/phoenix/endpoint/cowboy2_handler.ex:40: Phoenix.Endpoint.Cowboy2Handler.init/2
    (cowboy) /app/deps/cowboy/src/cowboy_handler.erl:41: :cowboy_handler.execute/2
    (cowboy) /app/deps/cowboy/src/cowboy_stream_h.erl:296: :cowboy_stream_h.execute/3
    (cowboy) /app/deps/cowboy/src/cowboy_stream_h.erl:274: :cowboy_stream_h.request_process/3
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

Let’s fix this in the next step by instrumenting our application! You can kill the Docker Compose stack for now as we’ll need to add additional dependencies and update the mix.exs + mix.lock files (the Elixir service only fetches new dependencies on container startup).

Step 6: Instrument Phoenix application - commit

Let’s add a couple of dependencies to our project (a huge thanks to deadtrickster for all his hard work on the Elixir Prometheus libraries https://github.com/deadtrickster):

defp deps do
  [
    ...
    {:prometheus, "~> 4.4.1"},
    {:prometheus_ex, "~> 3.0.5"},
    {:prometheus_ecto, "~> 1.4.3"},
    {:prometheus_phoenix, "~> 1.3.0"},
    {:prometheus_plugs, "~> 1.1.5"},
    {:prometheus_process_collector, "~> 1.4.3"}
  ]
end

After running mix deps.get on the host machine, we can get to work configuring our Phoenix app to utilize these new dependencies. Let start by creating a file lib/instrumenters.ex. The instrumenters defined within this file will allow us to take metrics on Phoenix, Ecto, and Plug internals. Within that file put the following (feel free to split this up into different files, I just find it easier to have all the Prometheus related instrumenters in the same place):

defmodule ElixirMonitoringProm.PhoenixInstrumenter do
  @moduledoc "Prometheus instrmenter for Phoenix"

  use Prometheus.PhoenixInstrumenter
end

defmodule ElixirMonitoringProm.PipelineInstrumenter do
  @moduledoc "Prometheus instrmenter for Phoenix"

  use Prometheus.PlugPipelineInstrumenter
end

defmodule ElixirMonitoringProm.RepoInstrumenter do
  @moduledoc "Prometheus instrmenter for Phoenix"

  use Prometheus.EctoInstrumenter
end

defmodule ElixirMonitoringProm.PrometheusExporter do
  @moduledoc "Prometheus instrmenter for Phoenix"

  use Prometheus.PlugExporter
end

With that in place, let’s turn our attention to lib/elixir_monitoring_prom/application.ex and initialize the instrumenter supervisors and hook into the Ecto telemetry events (when using Ecto 3.1+). Within the start/2 function add the following lines:

def start(_type, _args) do
  # Start all the instrumenters
  ElixirMonitoringProm.PhoenixInstrumenter.setup()
  ElixirMonitoringProm.PipelineInstrumenter.setup()
  ElixirMonitoringProm.RepoInstrumenter.setup()
  ElixirMonitoringProm.PrometheusExporter.setup()

  # NOTE: Only for FreeBSD, Linux and OSX (experimental)
  # https://github.com/deadtrickster/prometheus_process_collector
  Prometheus.Registry.register_collector(:prometheus_process_collector)

  :telemetry.attach(
    "prometheus-ecto",
    [:elixir_monitoring_prom, :repo, :query],
    &ElixirMonitoringProm.RepoInstrumenter.handle_event/4,
    nil
  )

  ...
end

Now that the instrumenter supervisors are initialized, we’ll need to hook into the Phoenix plug chain in endpoint.ex so that we can expose our /metrics endpoint to Prometheus and so that we can also take metrics on the various calls coming into our API. At the bottom of lib/elixir_monitoring_prom_web/endpoint.ex but before plug ElixirMonitoringPromWeb.Router add the following:

defmodule ElixirMonitoringPromWeb.Endpoint do
  ...

  # Creates the /metrics endpoint for prometheus & collect stats
  plug ElixirMonitoringProm.PrometheusExporter
  plug ElixirMonitoringProm.PipelineInstrumenter

  ...
end

We also need to hook into Phoenix’s builtin instrumentation hooks (https://hexdocs.pm/phoenix/Phoenix.Endpoint.html#module-instrumentation) and configure the metrics collection. To do so, open up config/config.exs and update the following entry (as a side note, please don’t ever commit your secret_key_base to source control…this is just a tutorial and thus I get a pass :)):

# Configures the endpoint
config :elixir_monitoring_prom, ElixirMonitoringPromWeb.Endpoint,
  ...
  instrumenters: [ElixirMonitoringProm.PhoenixInstrumenter]

config :prometheus, ElixirMonitoringProm.PipelineInstrumenter,
  labels: [:status_class, :method, :host, :scheme, :request_path],
  duration_buckets: [
    10,
    100,
    1_000,
    10_000,
    100_000,
    300_000,
    500_000,
    750_000,
    1_000_000,
    1_500_000,
    2_000_000,
    3_000_000
  ],
  registry: :default,
  duration_unit: :microseconds

  ...

Finally, the last update we will make is to lib/instrumenters.ex. Without this change, our http_requests_total counter (a metric collected by the Prometheus libraries we pulled down) will not have the correct endpoint path for the requests coming in. To fix this, open up lib/instrumenters.ex and update the ElixirMonitoringProm.PipelineInstrumenter module so it looks like this:

defmodule ElixirMonitoringProm.PipelineInstrumenter do
  @moduledoc "Prometheus instrmenter for Phoenix"

  use Prometheus.PlugPipelineInstrumenter

  def label_value(:request_path, conn) do
    case Phoenix.Router.route_info(
           ElixirMonitoringPromWeb.Router,
           conn.method,
           conn.request_path,
           ""
         ) do
      %{route: path} -> path
      _ -> "unkown"
    end
  end
end

With all these changes in place, bring up the Docker Compose stack with docker-compose up and run the following in another terminal to perform a quick and dirty stress test (do this in 2 separate terminals or run docker-compose up -d to run in the background as done in Part 1 - Step 4):

# We'll pipe to /dev/null since we don't really care about the output at the moment
$ curl -s "http://localhost:4000/api/breweries?zip_code=98052&mile_radius=[1-100]" > /dev/null

After running that command, if you make a request to http://localhost:4000/metrics (either via curl or in the browser), you should see the raw metrics that your application is exposing. For example, here are some of the metrics that you will see after running the previous curl stress test:

# TYPE http_requests_total counter
# HELP http_requests_total Total number of HTTP requests made.
http_requests_total{status_class="success",method="GET",host="localhost",scheme="http",request_path="/api/breweries"} 100
# TYPE phoenix_channel_receive_duration_microseconds histogram
# HELP phoenix_channel_receive_duration_microseconds Phoenix channel receive handler time in microseconds
# TYPE phoenix_controller_call_duration_microseconds histogram
# HELP phoenix_controller_call_duration_microseconds Whole controller pipeline execution time in microseconds.
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="10"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="25"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="50"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="100"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="250"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="500"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="1000"} 0
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="2500"} 2
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="5000"} 16
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="10000"} 54
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="25000"} 99
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="50000"} 99
phoenix_controller_call_duration_microseconds_bucket{action="index",controller="ElixirMonitoringPromWeb.BreweryController",le="100000"} 100

...

The output you see here from hitting the /metrics endpoint is what Prometheus will scrape every 5 seconds (or however many seconds you configured in your docker/prometheus/config.yml file). Without getting too far into Prometheus as that is a huge topic on its own, each one of these entries in the output is effectively its own timeseries data point that Prometheus will track over time for you. Prometheus has several data types that it supports, each with its own particular type of data that it is meant to capture [8]. If this is something you would like me to deep dive into, feel free leave me a message below and I see what I can put together :).

With everything in place, it is now time to create a stress tester that we can run in order to make our pretty graphs do shinny things!

Step 7: Write a stress test tool - commit

Now that everything is in place and working as expected, it is time to throw together a quick and dirty stress tester and hammer the backend service to make those graphs dance! At the root of your project, create a file called stress_test.exs. Within that file, add the following (feel free to change the values up at the top and adjust the stress tester as you desire):

wait_time_per_query_ms = 75
number_of_queries = 1..10_000
miles_to_search = 1..50

valid_zips =
  "#{__DIR__}/priv/repo/wa_zip_codes.csv"
  |> File.read!()
  |> String.split("\n")
  |> Enum.filter(fn line -> String.trim(line) != "" end)
  |> Enum.map(fn csv_line ->
    [zip, _city, _state, _lat, _long, _tz, _dst] =
      csv_line
      |> String.replace("\"", "")
      |> String.replace("\n", "")
      |> String.split(",")

    zip
  end)

:inets.start()

Enum.each(number_of_queries, fn count ->
  radius = Enum.random(miles_to_search)
  zip = Enum.random(valid_zips)

  url =
    String.to_charlist(
      "http://localhost:4000/api/breweries?zip_code=#{zip}&mile_radius=#{radius}"
    )

  :httpc.request(:get, {url, []}, [], [])

  if rem(count, 100) == 0, do: IO.puts("Made #{count} zip code requests")

  :timer.sleep(wait_time_per_query_ms)
end)

The idea with the stress tester is to get a wide range of requests put together so we can see real world performance and not send the same request over and over again (triggering DB cache). In order to run the stress test application, run the following in the terminal while your Docker Compose stack is running in the background (if it isn’t already running, use docker-compose up):

$ elixir stress_test.exs
Made 100 zip code requests
Made 200 zip code requests
Made 300 zip code requests
...

With the stress tester running in the background now, we can log into our Grafana server and see how our application and services are handling the load. If you navigate to http://localhost:3000 and then view the Phoenix Stats dashboard, you can see the statistics for how long HTTP requests are taking and how long the DB requests are taking. You can also navigate to the PostgreSQL Stats and BEAM Stats dashboards and see how the database and BEAM are behaving under load.

Image of Phoenix grafana dashboard

Step 8: Adding additional metrics - commit

A monitoring system isn’t terribly useful if we can’t also add our own metrics that we would like to keep track of. For example, if the business side of the operation wanted to know how your users are interacting with your system, giving them stats on the BEAM isn’t going to give them much information. Luckily, using the libraries we pulled into our project, we can create our own metrics and visualize them in Grafana. For this effort we’ll be creating two metrics:

  1. A counter for zip codes so we can keep track of which zip codes are the most popular to search out of.
  2. A histogram where we can keep track of what are the most common radius queries that we receive.

These types of metrics can help business people quantify various hypotheses about the customer base and how the application/service is being used. This in turn will allow the business to make data driven decisions. To get started adding the aforementioned metrics to our application, we’ll need to create a new file at lib/metrics.ex (you can really create this file anywhere, and often times I actually have a Phoenix Context named metrics where all my metrics reside). Within lib/metrics.ex add the following:

defmodule ElixirMonitoringProm.Metrics do
  use Prometheus.Metric

  alias ElixirMonitoringProm.{ZipCode, ZipCodes.ZipCode}

  def setup do
    # Our counter for the number of searches against a particular zipcode.
    # Given that there is a label 'geohash' each zip code gets its own counter.
    Counter.new(
      name: :elixir_app_zip_code_search,
      help: "Counter for brewery zip code searches",
      labels: [:zip_code, :geohash]
    )

    # The provided radius will fall into one of the defined buckets
    Histogram.new(
      name: :elixir_app_radius_search,
      buckets: [1, 2, 5, 10, 20, 50],
      help: "The radius that people are searching within for breweries"
    )
  end

  def increment_zip_code_search(zip_code) do
    zip_code
    |> ElixirMonitoringProm.ZipCodes.get_zip_code_info()
    |> case do
      %ZipCode{point: %Geo.Point{coordinates: {long, lat}}} ->
        geo_hash = Geohash.encode(lat, long)
        Counter.inc(name: :elixir_app_zip_code_search, labels: [zip_code, geo_hash])

      _ ->
        :noop
    end
  end

  def observe_radius_search(radius) do
    Histogram.observe([name: :elixir_app_radius_search], String.to_integer(radius))
  end
end

You’ll probably notice that we make use of a function call to Geohash.encode/2 in order to encode the geohash for the zip code’s lat and long. We’ll need to pull down that dependency by adding an entry to our mix.exs file (make sure to kill your Docker Compose stack as we are installing a new dependency and run mix deps.get on the host machine to update the mix.lock file):

defp deps do
  [
    ...
    {:geohash, "~> 1.0.0"}
  ]
end

We’ll also need to add the get_zip_code_info/1 function to our zip codes context. The implementation of that function is (be sure to also add the import Ecto.Query line at the top):

defmodule ElixirMonitoringProm.ZipCodes do
  alias ElixirMonitoringProm.{Repo, ZipCodes.ZipCode}

  import Ecto.Query

  ...

  def get_zip_code_info(zip) do
    Repo.one(from zip_code in ZipCode, where: zip_code.zip_code == ^zip)
  end

  ...

end

Now that the implementations for the two pieces of data we want are in place, we’ll need to invoke the observe_radius_search/1 and increment_zip_code_search/1 functions from our controller to keep track of the stats (for the purposes of this tutorial this will suffice, but for a production application you may not want to introduce additional latency to your API by having to make an additional DB call). Open up lib/elixir_monitoring_prom_web/controllers/brewery_controller.ex and add the two Metrics module calls to the index/2 function (make sure you also update the alias up at the top):

alias ElixirMonitoringProm.{Breweries, Metrics}

def index(conn, %{"zip_code" => zip_code, "mile_radius" => radius}) do
  results = Breweries.get_breweries_in_radius(zip_code, String.to_integer(radius))
  Metrics.increment_zip_code_search(zip_code)
  Metrics.observe_radius_search(radius)

  json(conn, results)
end

Our last step is to initialize the metrics during application startup in the lib/elixir_monitoring_prom/application.ex file. Add the following line after the other prometheus initialization steps:

def start(_type, _args) do
  ...

  ElixirMonitoringProm.Metrics.setup()

  children = [
  ...
end

With all that in place, start the Docker Compose stack one last time. Once you see the following lines:

elixir_app_1         | [info] Running ElixirMonitoringPromWeb.Endpoint with cowboy 2.6.3 at 0.0.0.0:4000 (http)
elixir_app_1         | [info] Access ElixirMonitoringPromWeb.Endpoint at http://localhost:4000

Navigate to Grafana at localhost:3000 and go to the Brewery Search dashboard. Once there, in a terminal run the stress tester again via elixir stress_test.exs and see the magic unfold (the gif screen record below has been sped up, but your should be refreshing every 5 seconds)!

Image of application grafana dashboard

Closing thoughts

Well done and thanks for sticking with me to the end! We covered quite a lot of ground and hopefully you picked up a couple of cool tips and tricks along the way. In Part 2 of this tutorial we covered how to setup Prometheus and Grafana within a Docker Compose stack. We also got Prometheus to fetch metrics from our Phoenix application and even added 2 of our own business related metrics. We even threw together a quick stress tester using Erlang’s built in HTTP client. As you can see, getting your application to expose metrics is not too difficult and can yield many rewards. We didn’t have a chance to deep dive into Grafana or Prometheus this time around, but perhaps next time!

Some caveats I would like to discuss before closing out. When dealing with time series data, it is recommended to not have labels with a high cardinality. We kinda broke that rule given that our zip code search metric has the zip code itself as a label which means we the possibility of having 700+ time series data points to track. Of course this is a tutorial and we are allowed to take some short cuts (mostly to make some cool visuals). If you were to track something geographically like this, I would recommend dealing with a data space of no more than 200 (i.e use countries or states as labels). Especially make sure not to use unbounded values as labels like user IDs or your Prometheus instance is going to have a bad time. We also did not cover how to get a stack like this ready for a production environment, so make sure you do your due diligence when trying to do this in a production setting.

One last point I would like to cover is how monitoring differs from logging. While we could have accomplished something similar using a logging solution like ELK and counted various log entries with some sort of pattern. In my experience, it is a lot easier to maintain a codebase where metrics are an explicit call (versus an implicit log statement which is aggregated elsewhere) and it is a lot less stress and noise on your logging solution if it only focuses on dealing with log events. In addition, given that Prometheus (and other monitoring databases for that matter) are fundamentally time series databases, ELK won’t be able to compete in terms of query performance when you are exploring data in your dashboards. As a result, it is best to use each tool for its specific use case. Keep application and system event logs in your logging solution, and use a monitoring solution for application and system metrics.

As a reward for making it this far, I’ll give you a bonus and show you the BEAM dashboard during the stress test :).

Image of BEAM Grafana dashboard

Feel free to leave comments or feedback or even what you would like to see in the next tutorial. Till next time!

Additional Resources

Below are some additional resources if you would like to deep dive into any of the topics covered in the post.


comments powered by Disqus