Clustering and High Availability

From MiRTA PBX documentation
Jump to navigation Jump to search

Several MiRTA PBX are able to work in cooperative mode, building a cluster of servers, providing superior performance and high availability. MiRTA PBX cannot be used for load balancing without any external tool, but can be used for a load sharing cluster. The best way to setup the system is by using DNS SRV. DNS SRV is often referred as a way to provision high availabilty. It is a special DNS record listing all the servers providing a service. For each service offered a “priority” and “weight” are defined, so the load can be shared among several servers. A typical DNS SRV record has the following format (from Wikipedia)

_sip._udp.example.com. 86400 IN SRV 10 60 5060 bigbox.example.com.
_sip._udp.example.com. 86400 IN SRV 10 20 5060 smallbox1.example.com.
_sip._udp.example.com. 86400 IN SRV 10 10 5060 smallbox2.example.com.
_sip._udp.example.com. 86400 IN SRV 10 10 5066 smallbox2.example.com.
_sip._udp.example.com. 86400 IN SRV 20 0 5060 backupbox.example.com.

The first four records share a priority of 10, so the weight field's value will be used by clients to determine which server (host and port combination) to contact. The sum of all four values is 100, so bigbox.example.com will be used 60% of the time. The two hosts smallbox1 and smallbox2 will be used for 20% of requests each, with half of the requests that are sent to smallbox2 (i.e. 10% of the total requests) going to port 5060 and the remaining half to port 5066. If bigbox is unavailable, these two remaining machines will share the load equally, since they will each be selected 50% of the time. If all four servers with priority 10 are unavailable, the record with the next lowest priority value will be chosen, which is backupbox.example.com. This might be a machine in another physical location, presumably not vulnerable to anything that would cause the first four hosts to become unavailable.

The load balancing provided by SRV records is inherently limited, since the information is essentially static. Current load of servers is not taken into account. Not only, but when a phone deregisters from one server to register on the other, there is a small delay and during such time the phone will be unavailable. Not only, but if the phone was in a call when the switch is performed, the phone status (INUSE) will be lost and another phone call may be received by the phone while still in use.

The most common setup for MiRTA PBX comprises two servers acting each one as asterisk, web and database server. A possible DNS SRV record for this setup can be the following:

_sip._udp.pbx.domain.com. 86400 IN SRV 10 10 5060 voip1.domain.com.
_sip._udp.pbx.domain.com. 86400 IN SRV 20 10 5060 voip2.domain.com.

In this way all the phone will register on voip1.domain.com and in case of any problem, the phone will move on voip2.domain.com. If a phone is registered on voip2 and a call arrives from voip1, the system will route the call accordingly and the client will not notice any difference. A tenant can have half the phones on a server and half on another server without noticing any difference. Even if this configuration is possible, it is not really advisable due to the additional load due to the routing of the calls between the servers. It can be good to work towards having all the phones for a tenant on the same server. A more advanced setup will consist in creating two pools of servers as following:

_sip._udp.pbxA.domain.com. 86400 IN SRV 10 10 5060 voip1.domain.com.
_sip._udp.pbxA.domain.com. 86400 IN SRV 20 10 5060 voip2.domain.com.
_sip._udp.pbxB.domain.com. 86400 IN SRV 20 10 5060 voip1.domain.com.
_sip._udp.pbxB.domain.com. 86400 IN SRV 10 10 5060 voip2.domain.com.

The first pool, pbxA.domain.com will list voip1.domain.com as primary server and voip2.domain.com as secondary server. The second pool will list voip2.domain.com as primary and voip1.domain.com as secondary. All the phones using pbxA as DNS SRV address will normally connect to voip1. It is perfectly normal to find around 10% of the phones connected to the secondary server due to normal packet loss. All the phones using pbxB as DNS SRV address will use voip2.domin.com as primary. Carefully choosing which pool configure on tenant's phones, the load of the system can be effectively shared among multiple servers while providing resilience.

You can be tempted to create a single pool, listing all your servers, with the same priority/weight. In this way, the phones will randomly register on any of the servers listed. This will work, but there are few drawbacks. There will be a small delay when a phone moves from one server to the other. In this delay the phone can be unreachable to both servers. It depends on how the phone handles the hand over between the two servers when moving from one to the other. The other drawback affects call transfer. If the phone registers on server A and place a call and after few seconds move to server B, the user can put the first call on hold and start a new call, this time from server B. If the user asks to bridge the call together, making a transfer, it will fail because calls are on different servers.