Such a setup was good enough to handle requested traffic with an autoscaling mechanism, but we wanted to be sure that our system will handle all these users even when a part of it breaks. The application tier has already been covered with a Disaster Recovery plan since each broken instance of the app can be easily and automatically replaced within the autoscaling group. The more interesting part was the data tier. To simulate a crash of one of the DB instances, we used Amazon's embedded failover mechanism. Unfortunately, after failover the application still maintained the TCP sessions even with the
reaping_frequency setting enabled. This led to plenty of application instances with a lot of open TCP sessions which were unused and filled up the connection pool, which causes timeouts on the client side.
Sidenote: To keep this story reasonably short, please refer to the following stories, where others bumped into the similar problem:
This was unacceptable, so after some debugging we decided to abandon the makara gem and decouple connection pooling to third-party middle-ware software.
Before we move on to the middle-ware software section we have to understand what connection pooling actually is and why it is so important.
You can get an outstanding knowledge about connection pools from this great article, but in order to keep it short I will just extract here a quote just to paste a quote to let you understand what we are talking about:
A pool is an object that maintains a set of connections internally, without allowing direct access or usage. These connections are given out by the pool when communication with the database needs to happen, and returned to the pool when the communication is finished. The pool may be initialized with a configured number of connections, or it may be filled up lazily on-demand. The ideal usage of a connection pool would be that code requests a connection from the pool (called a checkout) just when it needs to use it, uses it, and puts it back in the pool (a release) immediately. This way, the code is not holding on to the connection while all the other non-connection related work is being done, greatly increasing efficiency. This allows many pieces of work to be done using one or a few connections. If all the connections in a pool are in use when a new checkout is requested, the requester will usually be made to wait (will block) until a connection has been released.
Since our system became more complex and the application didn't handle DB failover we decided to decouple connection pooling to third-party software. There are two popular proxies dedicated to PostgreSQL:
These systems allow you to make as many database connections as you want without worrying about management because the connections they give you are cheap simulated connections that they handle with low overhead. When you attempt to use one of these simulated connections, they pull out a real connection from an internal pool and map your fake connection onto a real one. After the proxy sees that you've finished using the connection, it keeps your fake connection open, but aggressively releases and re-uses the real connection. The connection counts and release-aggressiveness settings are configurable and help you tune for a gotchas like transactions, prepared statements, and locks.
For a scalable Amazon Aurora database, only PgPool fits well since PgBouncer does not support READ/WRITE SQL query split. The configuration of PgPool can be overwhelming at the beginning, but there is a set of settings dedicated for Aurora that can be a good starting point. It's especially worth focusing on the
max_pool parameters which state how many connections can be opened to a database. More information about the relationship between parameters and how to set them accordingly to your need can be found here.
After some configuration tuning and a couple more performance tests we finally managed to achieve the goal:
As you can see, connections are spread evenly to the replicas and during failover sessions from failed instances are released and switched to other replicas instantaneously.
After a few twists and turns, we landed with bullet-proof scalable infrastructure with resilience on DB and application levels. The system will handle even more than 2.5k simultaneous users in a cost-efficient way since it can scale out automatically based on the system load. This journey description was intended to present a real-life example of how infrastructure system can be designed and improved with all the ups and downs in the middle.
This story is not over yet, since there is much more we can adjust, like replica lag for example. So I would like to leave you with a quote which I think this posts illustrates:
It is not how it ends that matters, but the journey it takes to get there.