Idempotent interfaces and deterministic data loss mitigation

pavel.kirienko · October 12, 2019, 3:28pm

UAVCAN is designed around the assumption that the likelihood of packet loss by the underlying transport (CAN, UDP/IP, serial, …) is acceptably low for the target application. The available methods of communication – namely, message publication and service invocation – are affected by a packet loss differently. This post assesses the implications of data loss for different categories of application interfaces and proposes a simple method of data loss mitigation which is expected to be effective in a certain class of applications.

Idempotent application interfaces

At the application level, real-time publish/subscribe systems often leverage idempotent interfaces, where the state of the application is a function of the last received message and it is invariant to the prior message exchanges. Practical examples include setpoint values for a control loop (the most recent command is acted upon) or sensor readings (the most recently sampled value is of interest).

Non-idempotent interfaces are not invariant to the prior interactions. Participating agents make assumptions about the current state of the other members of the process; the correct behavior becomes conditional on the correctness of such state assumptions which may render the system more fragile and may complicate the failure mode analysis. The positive property of state-sharing in non-idempotent interfaces is that it permits the communication between agents to contain less context, rendering it less redundant and less burdensome for the underlying communication system.

In theory, it is possible to transform any non-idempotent interaction protocol into a behaviorally equivalent idempotent protocol by extending the information exchanged between agents with a sufficient context to enable the involved agents to determine the state expected by the opposite side of the communication from any of the exchanged atomic messages. In practice, this is sometimes infeasible or undesirable depending on the constraints of the application.

Example 1. Consider a system comprised of a server and two clients. The server contains an integer counter. Client A commands the server to increment or decrement the counter by a specified amount. Client B reads the counter and acts upon its value. The correct functioning of Client B is conditional on the correctness of the model of the counter maintained by Client A. An equivalent idempotent interface would be that where the Client A commands the desired value of the counter directly instead of relying on incremental modification.

The example 1 is, indeed, very synthetic, but it illustrates the general principle: the transformation is done by the relocation of the process context from the private state of an agent into the protocol which is explicit to all participants.

Counter-example 2. Agent A is running a probabilistic state estimator (e.g., a Kalman filter) based on measurements supplied in real-time from Agent B. The state maintained by Agent A is a function of the full history of samples supplied by Agent B. Loss of a sample may deteriorate the process modeling capabilities of Agent A, which is a particular (specific to this application) manifestation of the state-fragility argument introduced earlier. The issue is theoretically solvable by extending the communication context as shown above; i.e., by requiring Agent B to include the full history of measurements into every message sent to Agent A. While that change would clearly render the interface idempotent (practically: the system model maintained by Agent A will be rendered invariant to the history of samples), it is not feasible for obvious reasons.

Example 3. Consider a USB-CAN adapter connected to a host system. The host system may command the adapter to open the channel with particular parameters, send a CAN frame to the bus, and expect the adapter to report incoming CAN frames received from the bus. Suppose that whenever the host desires the adapter to change its state, it emits a reconfiguration command message; having emitted such message, the host expects the adapter to retain the new state until explicitly commanded otherwise. In particular, if the adapter is commanded to open the channel, the host will expect that outgoing CAN frames will be normally delivered to the bus, and vice versa. The aforementioned expectation is the manifestation of the fragile cross-agent state sharing specific to this application. An equivalent idempotent interface would be that where the host and the adapter exchange periodic messages, where each message contains the full context necessary to recreate the full image of the state of the other participant. A very practical, in-depth review of this specific problem is available in this post: On CAN adapters and stateless interfaces.

Counter-example 4. Consider the electric drive power state machine defined in CANopen DS402. In order to bring the drive into the desired state, the controller may be required to step through a set of intermediate states. A stateless interface would be trivial to set up, but it may go against the safety-related design objectives of the protocol.

(Brief intermission: the counter-example of DS402 shows how different design objectives of the system manifest in communication protocols. Industrial drives are often fail-safe: the safest state is reached when the system is disengaged. In comparison, a propulsion drive on an aircraft would normally be fail-operational instead, since the state where the propulsion system of an airborne vehicle is turned off is likely to be hazardous).

Many of the communication patterns that one encounters in other domains, particularly in the general ICT, are built to accommodate design objectives that are very different from those of safety-critical real-time intravehicular networks. I am observing that adopters of UAVCAN would often carry the same design mindset and approaches that would work well in a conventional Internet application into a vehicular bus, which may result in suboptimal performance of the protocol and the whole system in general.

UAVCAN somewhat discourages the reliance on stateful non-idempotent interfaces by offering a very limited set of service identifiers, although this is not to say that categories of idempotent/non-idempotent and message/service interfaces are not orthogonal.

The takeaway message is that one designing a mission-critical intravehicular network may benefit from unlearning some of the principles that are commonly found in general-purpose networked applications.

Temporal redundancy, or deterministic data loss mitigation

As shown in the previous section, some applications may benefit from idempotent (stateless) interfaces, and it is theoretically possible to transform any (NB: formal proof?) non-idempotent interface into a behaviorally equivalent idempotent interface. In practice, this is not always feasible; therefore, applications should be provided with adequate means of ensuring sufficient reliability guarantees for stateful interfaces where any single event of losing an atomic data unit (such as a message or a request) poses risks or other expenses to the application.

In the following discussion, “data loss” and “data corruption” are to be considered equivalent, because the two concepts are uniformized through the robust data integrity checking capabilities built into the underlying transport layer.

Suppose that one of the system design inputs is the requirement that atomic data units exchanged between participants of a certain mission-critical process are to be successfully delivered with the probability of P_t. Further, suppose that the underlying physical network is characterized to successfully deliver atomic data units with the probability of P_n.

The latter assumption is valid for a typical specialized network such as those found in various vehicular or industrial deployments; it may not hold, however, for a typical domestic or business LAN where the parameters of the environment may vary wildly and therefore the probability of a successful delivery may be hard to predict statically.

If P_n ≥ P_t, the network meets the reliability goals and no further adjustments are necessary. Otherwise, assuming that P_n is time-invariant (the probability of losing an atomic data unit is not strongly correlated with that of its neighbors), the target probability of successful delivery can be obtained by repeating each atomic data unit transmission M times, which results in the adjusted success probability P_n’ = 1 - (1 - P_n)^M.

For example, given a network that successfully delivers 99% of atomic data units, and the probability of data corruption is time-invariant, the multiplication factor of 2 can increase the probability of successful delivery (network reliability) up to 100% - (100% - 99%)² = 99.99%.

The extent to which the assumption that the probability of successful delivery is time-invariant is valid is a matter of separate discussion.

Practically, deterministic data loss mitigation is to be implemented by emitting redundant copies of the transfer immediately following the original. Redundant transport frames and/or complete transfers arriving at the other end of the link will be removed over the course of the standard incoming frame processing pipeline. No additional logic is needed on the receiving end because automatic removal of repeated frames and/or transfers is guaranteed by the UAVCAN specification.

Receiving agents should support reassembly of multi-frame transfers whose frames are delivered out-of-order in order to meet the performance metrics of the provided model with multi-frame transfers. As can be seen upon closer inspection, lack of support for out-of-order delivery may somewhat hinder the performance of the method.

For example, suppose that a UAVCAN transfer contains three frames, F0 to F2, and the multiplication factor M = 2, then the resulting frame sequence would be as follows:

F0      F1      F2      F0      F1      F2
\_______________/       \_______________/
    main copy             redundant copy
------------------ time ------------------>

The resulting behavior in the provided example is that the transport network may lose up to three unique frames without affecting the application. In the following example, the frames F0 and F2 of the main copy are lost, but the transfer survives:

F0 F1 F2 F0 F1 F2
|  |  |  |  |  |
x  |  x  |  |  \_____ F2 __________________________
   |     |  \________ F1 (redundant, discarded) x  \
   |     \___________ F0 ________________________  |
   \_________________ F1 ______________________  \ |
                                               \ | |
----- time ----->                              v v v
                                            reassembled
                                            multi-frame
                                             transfer

The described temporal redundancy method provides the network designer with a tunable reliability control for each application-level link within the network. Compared to the commonly used alternatives based on explicit confirmation with retry by timeout (e.g., TCP, DDS/RTPS, QUIC), the method has important shortcomings and advantages:

Pro: The network load and latency implications that follow from the use of this method are trivial to model accurately.
Con: The model breaks if the assumption that the probability of losing an atomic data unit is not strongly correlated with that of its neighbors does not hold. Data losses caused by a head-of-line blocking, excessive buffering, priority inversion, or other types of congestion somewhere along the network route may only be exacerbated by the described method. However, such issues may be considered atypical for real-time networks.
Pro: The method is entirely stateless from the standpoint of the network. The state information pertaining to the data loss mitigation process is confined to the sending agent entirely; other participants remain well-isolated from this activity. The result is that deployment of this method does not complicate the implementations or the failure mode analysis, unlike stateful alternatives based on explicit confirmation and retry.
Con: The method creates a constant overhead even if the network is functioning perfectly.
Pro: The method is equally applicable to unicast and broadcast/multicast exchanges (it is conceptually similar to the rebroadcasting scheme implemented in various wireless protocols such as IEEE 802.15.4 or IEEE 802.11).

At the time of writing, an experimental implementation of the described method is available in PyUAVCAN.