In RulePoint, you can create primary and multiple back-up nodes for the application services, such as the event processor, source controller, and the responder controller.
When you configure the application services, you must select the deployment mode as high availability. In a high-availability mode, you need to distribute the primary and back-up instances on different nodes. Nodes can be distributed across same or multiple host machines. If one of the service instance becomes unavailable on a node, the grid manager ensures that the secondary node takes over as the primary, without any loss of events.
The grid manager controls the primary and the back-up runtime components, and manages the change of state from primary to secondary. The primary service instance on a node takes care of all its activities, whereas the back-up instance on another node functions as the standby. In a failover situation, the grid manager detects a problem with the primary component, and makes the backup the primary. To ensure that none of the events are lost, the ultra messaging layer picks up events that occur between the failover from the primary to the back-up instance. After the failover, the previous primary instance becomes the back-up instance.
You can add multiple hosts in a high-available setup to include multiple instances of the grid manager. When you start the RulePoint topology in a particular host, the grid manager in that host is designated as the leader, and the secondary instances in other hosts are designated as the backups. Only the primary instance can be active at one time. When the primary grid manager fails, one of the backups is designated as the primary through leader election by a common lock in the run-time database.
You can also create multiple instances for the UM store and UM lbmrd for high availability. You must always configure a quorum of odd number of stores for fault tolerance and reliability. You need to maintain the quorum so that the components send and receive messages without failure. A message is considered as stable after a quorum of the stores have acknowledged the message as stable. A majority of three UM stores need to operate for the persistent store to function successfully in a high availability scenario. A majority is considered as half of the UM stores plus one. If a majority of the stores fails, the UM cannot guarantee the delivery of messages and the topology goes down.