Going CPU bound
(An optimisation journey in Elixir)

I’m going to present findings from four months of performance optimisations where our team was tasked with creating a transaction service that can handle 40_000 transactions per second on a single node.

I'll present our team's process and the insights we gained along the way. Much of the performance came from general DB optimisations (Postgres), but I'll focus on simulating traffic and on critical characteristics of standard library functions.

Before achieving the desired performance on the main server it was necessary to simulate an appropriate amount of traffic with a distribution similar to our current one.

Starting point

  1. Experience with a medium traffic server (2000 req/s)
  2. Expirience with monitoring traffic and errors with telemetry / Prometheus / Grafana
  3. A Mocking tool (build with elixir) to simulate traffic (Pareto distribution - 80/20 Rule) image

Monitoring setup

Monitoring setup (dummy endpoint)

Key measurements

  1. Requests per second
  2. Beam VM memory consumption
  3. CPU utilization
  4. Disk usage
  5. Memory utilization

image

Methodology

We started by upgrading and optimizing our internal mock tool, so we could measure performance impacts from various code changes and also get a lower bound on hardware requirements.

The starting test bench was a production server with a single endpoint that did nothing but return 200 OK, which we tested against our first mock implementation that served requests and waited for a reply. This served as a best-case scenario and a check if everything worked as expected (it did not).

Methodology

image

Methodology

image

Logging

Latency numbers every programmer should know

OperationCategoryTypical latency
Register / increment (1 CPU cycle)CPU0.3 ns
L1 cache referenceCPU1 ns
Branch mispredictCPU3 ns
L2 cache referenceCPU4 ns
Main memory (RAM) accessMemory / OS100 ns
Context switchMemory / OS1–3 μs
NVMe SSD read (4 KB)Local I/O100 μs
Roundtrip in same datacenterLocal I/O500 μs
HTTP call, same regionNetwork10–50 ms
Third-party HTTPS API (Stripe, Twilio)Network100–500 ms

Since logging is a I/O operation - the question is how much are we paying for it?

Legacy logging setup

On our mocking tool we used to have 3 loggers:

  • default logger (standard io)
  • file logger (we once used to run platform mock without systemd as a daemon)
  • mnesia logger (so we could see logs on our frontend)

Logging quick test

To answer the question - does logging have an impact on performace.

We just trned it OFF

The result was a nice 40% performance gain!

HTTP client

We started by using the build in :httpc mudule which was a good choice.

Some performance improvements can be made by using HTTP/2 so we swithced to the Finch library.

Sidenote (HTTP/2)

both HTTP/1.1 and HTTP/2 should take into consideration the concurrency setting

#HTTP/1.1
pool_count * pool_size > concurrency

HTTP/2 multiplexes up to 100 requests in a single connection - therefore the minimum pool_count is:

#HTTP/2
min_pool_count = concurrency / 100

The following formula is a good starting point for experimantation

pool_count = 3 x (concurrency / 100) |> round()

Finch telemetry

Added start_pool_metrics?: true to application.ex

%{
  :default => [size: 12000, protocols: [:http1], start_pool_metrics?: true]
}
# Finch metrics
Metrics.distribution(
  "finch.request.duration.milliseconds",
  event_name: [:finch, :request, :stop],
  unit: {:native, :millisecond},
  tags: [:name],
  measurement: :duration,
  reporter_options: [
    buckets: [50, 100, 200, 400, 800, 1600]
  ]
),
Metrics.counter(
  "finch.request.error.count",
  event_name: [:finch, :request, :exception],
  tags: [:name, :kind]
)

Receiving Finch telemetry data

GOOD - we see telemetry on Grafana

But ...

OFF ON 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 55,000 60,000 Req / s Finch telemetry

Speed test

Things were progressing but we ran in bottleneck.

Performance started dropping when we increased http request concurrency, at the same time we could not utilize more then 50% of the CPU
200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Concurrency 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 55,000 Requests per second 35,000 40,000 45,000 50,000 Req/s Request per second by concurrency

CPU utilization

image

Task.async_stream

This function has an interesing option :max_concurrency, from the elixir documentation:

Options

  • :max_concurrency - sets the maximum number of tasks to run at the same time. Defaults to System.schedulers_online/0.

Example

stream = Task.async_stream(
  collection, 
  Mod, 
  :expensive_fun, 
  [], 
  max_concurrency: max_concurrency
)
Enum.to_list(stream)

Code Beam async_stream

Concurrency Test

defmodule ConcurrencyTest do
  def async_stream_test({range, max_concurrency}) do
    range
    |> Task.async_stream(
      fn _ ->
        Process.sleep(10)
        1
      end,
      max_concurrency: max_concurrency,
      ordered: false,
      timeout: :infinity
    )
    |> Stream.run()
  end
end
tests = 100..2000//100
  |> Enum.map(fn concurrency -> {1..1_000_000, concurrency} end)
  |> Enum.map(fn {_, concur} = args -> 
    start = System.system_time(:millisecond)
    ConcurrencyTest.async_stream_test(args)
    stop = System.system_time(:millisecond)
    %{concurrency: concur, duration: stop - start}
  end)
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Concurrency 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 110,000 120,000 Duration (ms) 20,000 40,000 60,000 80,000 100,000 Duration Duration by Concurrency Level
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Concurrency 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 110,000 120,000 Duration (ms) 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 Avg Max Queue Length 20,000 40,000 60,000 80,000 100,000 Duration Duration and Queue Length by Concurrency

Code Beam async_stream

nested_async_stream

defmodule ConcurrencyTest do
  def nested_async_stream({range, concurrency}) do
    chunk_size = 100
    outer_concurrency = div(concurrency, chunk_size)

    range
    |> Stream.chunk_every(chunk_size)
    |> Task.async_stream(
      fn chunk ->
        chunk
        |> Task.async_stream(
          fn _ ->
            Process.sleep(10)
            1
          end,
          max_concurrency: chunk_size,
          ordered: false,
          timeout: :infinity
        )
        |> Stream.run()
      end,
      max_concurrency: outer_concurrency,
      ordered: false,
      timeout: :infinity
    )
    |> Stream.run()
  end
end
tests = 100..2000//100
  |> Enum.map(fn concurrency -> {1..1_000_000, concurrency} end)
  |> Enum.map(fn {_, concur} = args -> 
    start = System.system_time(:millisecond)
    ConcurrencyTest.async_stream_test(args)
    stop = System.system_time(:millisecond)
    %{concurrency: concur, duration: stop - start}
  end)
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Concurrency 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 110,000 120,000 130,000 Duration (ms) 0 2 4 6 8 10 12 14 16 Avg Max Queue Length 20,000 40,000 60,000 80,000 100,000 120,000 Duration Duration and Queue Length by Concurrency (nested async stream)

Flow

Computational flows with stages.

Flow allows developers to express computations on collections, similar to the Enum and Stream modules, although computations will be executed in parallel using multiple GenStages.

Using nested async_stream we got to a coulpe of thousand concurrent requests, to go furher we switched to flow that allowed us to simulate our target concurrecy of 20_000

CPU utilization after Flow

image

Encoding JSON

We have multiple options for JSON encoding:

  • JSON.encode!/2
  • JSON.encode_to_iodata!/2
    This is the most efficient format if the JSON is going to be used for IO purposes.

But since the data format we are using to simmulate trafic is not going to change there is another option

  • EEx

Benchee setup EEx JSON encoder

defmodule EExJsonEncoder do
  require EEx

  defp bool(true), do: "true"
  defp bool(false), do: "false"

  template = """
  {"param_a":"<%= t.param_a %>",\
  "param_b":"<%= t.param_b %>",\
  "param_c":"<%= t.param_c %>",\
  "param_d":<%= bool(t.param_d) %>,\
  "param_e":"<%= t.param_e %>",\
  "param_f":"<%= t.param_f %>",\
  "param_g":"<%= t.param_g %>",\
  "param_h":"<%= t.param_h %>",\
  "param_i":<%= bool(t.param_i) %>,\
  "param_j":<%= bool(t.param_j) %>,\
  "param_k":<%= bool(t.param_k) %>,\
  "param_l":<%= bool(t.param_l) %>,\
  "param_m":<%= bool(t.param_m) %>,\
  "param_n":"<%= t.param_n %>",\
  "param_o":"<%= t.param_o %>",\
  "param_p":"<%= t.param_p %>",\
  "param_r":"<%= t.param_r %>",\
  "param_s":<%= Integer.to_string(t.param_s) %>,\
  "param_t":"<%= t.param_t %>",\
  "param_u":"<%= t.param_u %>",\
  "param_v":"<%= t.param_v %>"}\
  """
  
  EEx.function_from_string(:def, :encode, template, [:t])
end

Benchee run

Benchee.run(
  %{
    "JSON.encode!" => &JSON.encode!/1,
    "JSON.encode_to_iodata!" => &JSON.encode_to_iodata!/1,
    "EEx template" => &EExJsonEncoder.encode/1
  },
  inputs: %{
    "sample" => TestData.sample()
  },
  time: 5,
  warmup: 2,
  memory_time: 2,
  reduction_time: 2,
  print: [fast_warning: false]
)

JSON benchee results

Name                             ips        average
EEx template                719.37 K        1.39 μs
JSON.encode_to_iodata!      116.67 K        8.57 μs
JSON.encode!                109.08 K        9.17 μs

Comparison: 
EEx template                719.37 K
JSON.encode_to_iodata!      116.67 K - 6.17x slower +7.18 μs
JSON.encode!                109.08 K - 6.59x slower +7.78 μs

Memory usage statistics:

Name                      Memory usage
EEx template                 0.0859 KB
JSON.encode_to_iodata!         4.47 KB - 52.00x memory usage +4.38 KB
JSON.encode!                   4.53 KB - 52.73x memory usage +4.45 KB

Lessons learned

  • Track and document your progress and rergessions with dates (and commits)
  • When simulating traffic - simulate a lot (in our case we are simulating 500M requests - the full simulation lasts over 1h!)
    • (a queuing system can be very helpfull)
  • Do not log everything - every IO operation has a cost!
  • Do not put telemetry on everything (measure cost)!
  • The server generating traffic must me much faster then the expected speed of the server consuming traffic. (in our case 250_000 req/s vs 40_000 req/s)

Feedback

image