Network communication and latency. MySQL Cluster requires communication between data
nodes and API nodes (including SQL nodes), as well as between data nodes and other data nodes,
to execute queries and updates. Communication latency between these processes can directly affect
the observed performance and latency of user queries. In addition, to maintain consistency and
service despite the silent failure of nodes, MySQL Cluster uses heartbeating and timeout mechanisms
which treat an extended loss of communication from a node as node failure. This can lead to reduced
redundancy. Recall that, to maintain data consistency, a MySQL Cluster shuts down when the
last node in a node group fails. Thus, to avoid increasing the risk of a forced shutdown, breaks in
communication between nodes should be avoided wherever possible.
The failure of a data or API node results in the abort of all uncommitted transactions involving the
failed node. Data node recovery requires synchronization of the failed node's data from a surviving
data node, and re-establishment of disk-based redo and checkpoint logs, before the data node
returns to service. This recovery can take some time, during which the Cluster operates with reduced
Heartbeating relies on timely generation of heartbeat signals by all nodes. This may not be possible
if the node is overloaded, has insufficient machine CPU due to sharing with other programs, or is
experiencing delays due to swapping. If heartbeat generation is sufficiently delayed, other nodes treat
the node that is slow to respond as failed.
This treatment of a slow node as a failed one may or may not be desirable in some circumstances,
depending on the impact of the node's slowed operation on the rest of the cluster. When setting timeout
values such as HeartbeatIntervalDbDb  and HeartbeatIntervalDbApi  for
MySQL Cluster, care must be taken care to achieve quick detection, failover, and return to service,
while avoiding potentially expensive false positives.
Where communication latencies between data nodes are expected to be higher than would be
expected in a LAN environment (on the order of 100 μs), timeout parameters must be increased to
ensure that any allowed periods of latency periods are well within configured timeouts. Increasing
timeouts in this way has a corresponding effect on the worst-case time to detect failure and therefore
time to service recovery.
LAN environments can typically be configured with stable low latency, and such that they can provide
redundancy with fast failover. Individual link failures can be recovered from with minimal and controlled ... zobacz całą notatkę