Quality of Service Mirroring Timeout Configuration
The Quality of Service (QoS) timeout determines how long a Mirror configuration of Caché will tolerate a loss of connectivity between its members. However, the exact method by which it does this and what that means for someone looking to configure this setting on their own system is not entirely obvious. For example, InterSystems released an alert in April of 2015 detailing the consequences of having a shorter QoS timeout on virtualized systems and our decision to change the default value of QoS from 2 seconds to 8 seconds in newer versions. To start with, however, I want to discuss how exactly QoS is used on any given system. For the purpose of this discussion, I am only going to refer to the machines and mirror members involved in mirror failover; asynchronous mirror members (which cannot automatically take over in any failure scenario) play no part in mirror failover.
Agent Controlled Failover Mode (all systems pre 2015.1, and some 2015.1+)
All mirrors running on systems before version 2015.1 of Caché, and some mirrors on later systems, make use of only two machines to determine failover; they lack the third machine used as an arbiter in later mirrors. These two machine setups operate in what is known as Agent Controlled Failover Mode. Therefore, understanding what will occur in any given scenario when operating under Agent Controlled Failover Mode requires us to look only at the primary and backup failover machines.
On the primary machine, the QoS timeout determines how long the primary may wait for acknowledgment from the backup that the backup has received journal data. Although data can be entered (and stored in buffers) while the primary waits for an acknowledgment, Caché will not return on operations such as a journal sync. While waiting on the QoS timeout, the primary will not write to the database without an acknowledgment, but will write to the journals. After the QoS timeout, the primary sends a message to the backup to tell it to disconnect before entering the trouble state, which prevents it, the primary, from writing to both the journals and the databases.
If the QoS+Trouble Timeout time limit is exceeded waiting on a response from the backup, then the primary will continue without acknowledgement from the backup. Trouble Timeout is three times the length of the QoS through version 2015.1, and two times the length of the QoS in 2015.2+. The primary will continue in this way until the backup reconnects, catches up, and then is marked active. When the backup is once more marked active, the primary will again wait on acknowledgements before committing journal writes to disk.
On the other side of the connection, if the backup gets no message from the primary within half the QoS timeout period, it will send a message to make sure the Primary is still alive. This ensures that on systems where the primary goes for a long period without sending data that the backup does not disconnect or mark itself erroneously as behind. The backup then waits an additional ½ QoS for a response from the primary machine.
If the primary tells the backup that the primary is still up, the backup will continue normally. If the backup does not receive such a message, it will check with the ISCAgent on the primary machine to see if the primary instance is down or hung. If it gets confirmation that the primary is down or hung, it will force down the primary and take over itself as the new primary. If the backup gets no message from the primary after requesting one, it has no way of knowing whether the primary is running normally and has just lost connection with the backup, or if the primary is in fact down. Therefore the backup will not take over as primary, as two mirror members running as primary simultaneously (a condition known as split brain) can lead to unpredictable and unrecoverable results.
This process of checking on the primary after only ½ QoS without a message means that the backup will wait for close to the full QoS timeout on a busy system after an actual disconnect before taking over. On a system with low activity, the backup might respond in as little as half QoS, as the primary may not have sent a message for up to half QoS before there was any actual loss of connection.
Arbiter Controlled Failover Mode (2015.1+ only)
The alternative to Agent Controlled Failover Mode is Arbiter Controlled Failover Mode. It is available only in Caché versions 2015.1 and later. In a system with an arbiter, both mirror members also maintain connection with a third, arbiter machine. The arbiter dictates failover behavior when the whole primary machine loses connection or goes down. It is important to note that whenever the two failover members in a mirror configured to use an arbiter are connected to each other and to the arbiter, the system will switch to Arbiter Controlled Failover Mode automatically.
When using an arbiter, the primary behaves in much a similar way as to when there is no arbiter. The first exception is when the primary unexpectedly loses connection to the backup. When such a loss of connection occurs, the primary will enter trouble state and send a message to the arbiter to check on the arbiter’s connection with the backup. The primary will wait in trouble state until the arbiter sends a return message, preventing all writes during this time. If the arbiter responds and says that it also lost connection to the backup, the primary will switch to agent controlled mode and continue writes.
If the arbiter responds saying that it is still connected to the backup, then the primary will send a message to the backup (via the arbiter) to initiate a switch to agent controlled mode. Once the backup switches to agent controlled mode, it will send a message to the primary (also via the arbiter) saying it is safe to switch to agent controlled mode. The primary will then switch to agent controlled mode and continue writes.
If the arbiter does not report back at all, then the primary will continue blocking updates, as it knows that the backup and arbiter may believe the primary to be down. This is an important feature in order to prevent split brain, without it, the logic the arbiter uses to decide when the backup should take over would be invalid. The primary will wait indefinitely in this trouble state for acknowledgment from either the backup or the arbiter, and upon getting such a signal will not continue as primary unless it gets confirmation that the backup did not take over during the break in communication.
The second exception is that the primary will wait indefinitely on the backup to respond to the intentional disconnect message sent when entering trouble state. If the backup receives the disconnect instruction, it will switch to agent controlled and then instruct the primary to switch to agent controlled. The primary will then continue after switching. However, if the primary does not receive the signal to switch, for instance due to some other disconnect, it will continue in trouble state until it does receive such a signal.
The backup in arbiter controlled mode will send a message to the arbiter if it loses connection with the primary. If it gets no return message from the arbiter, it will wait indefinitely, so as to not take over if it is isolated. The arbiter may send either of two different messages to the backup to end this behavior.
First the primary may have sent an instruction to initiate a switch to agent controlled mode via the arbiter as described above. When the arbiter passes this message along to the backup, the backup will switch to agent controlled mode and then send a message via the arbiter for the primary to do the same.
Second, the arbiter may send a message to the backup saying that the arbiter has also lost connection with the primary. If this occurs, then the backup will take over as primary. This is known to be safe (prevent split brain) due to the primary pausing writes whenever it loses connection to both arbiter and backup in arbiter controlled mode.
It is important to note that unless all three machines lose connectivity nearly simultaneously, any connection loss will result in a switch to agent controlled mode. This ensures there are no problems when the machines eventually reconnect or if multiple disconnects are staggered.
Pitfalls and Concerns when Configuring your Quality of Service Timeout
In order to determine what QoS is best for your system, you need to decide how to value two competing issues: How long do you want to wait on a failover in the event the primary has actually gone down, and how long do you want to give the primary and network to send new information when running slowly? These concerns are partly subjective, but there are some guidelines to keep in mind, the most common of which are backup utilities.
VM snapshots and other backup utilities can often take several seconds to run. Additionally, many of these freeze networking for some or all of their duration. If the length of a network outage caused by a snapshot exceeds the QoS timeout period, then you will experience a failover. This is the primary reason that InterSystems changed the default QoS from 2 seconds to 8 second in versions 2015.2+. In fact, any event that prevents the primary from talking to other machines for the QoS timeout or longer will cause failovers, even if the primary could have been able to continue on with no problem. Benchmarking tests on your vm performance, disk performance, and network performance will help determine just how long these pauses in communication may last.
On the other hand, long QoS timeouts mean that data may not be written to the database for longer periods of time. This means that users may be more likely to notice delays in functionality as the primary is willing to wait longer on journal syncs for a response from the backup. It also means that in an event where the primary becomes disconnected from the backup but not data processing, more data will potentially be lost as processes send asynchronous writes to the primary that are then never sent to the backup before failover. Remember that in the latter situation no data that were guaranteed to be durable, such as those checked by journal syncs, will be lost. Any write that cares about whether it was successful is forced to wait for the backup to acknowledge receipt of that journal information before committing any changes to the database.
Conceptually, the QoS timeout is the length of time a system is willing to wait on making writes to the database should the primary and failover lose connection, and long periods may not be desirable. Excessive QoS timeouts may even result in system freezes due to a lack of journal writes, as Caché will freeze if it waits too long on a journal write. The primary freezes all writes while waiting on the QoS timeout in order to make sure no data is guaranteed as stored until both mirror members have that data, but this means that if the QoS timeout is too long, the journal inactivity timeout could be exceeded and cause the whole system to freeze while the journal daemon is waiting on the QoS timeout. This might even result in a failover onto the backup that was having trouble receiving updates in the first place! On earlier versions, the default timeout for journal inactivity is 10 seconds, whereas in Caché 2013.1.4+ the journal inactivity timeout was increased to 30 seconds. Therefore, it is imperative to not have QoS be higher than those values.
Overall, the QoS timeout must be small enough that you can tolerate waiting that long for a failover in a primary failure event, but large enough to not cause unwanted failovers due to normal losses of connection. Since QoS timeout is not generally modified before and after VM snapshots and similar events, it is typically best to look at your longest expected breaks in connectivity for a metric on how long your QoS should be. You can then configure QoS to be long enough to be able to handle those events, but not too much longer. This will minimize the time to failover in an actual primary member failure while simultaneously preventing unwanted failover events.
Additional Documentation and Information
Configuring the Quality of Service (QoS) Timeout Setting Documentation