Best practices for managing long task execution requests

pavel.kirienko · December 14, 2023, 10:47am

Generally, one should avoid making application-level communications dependent on the identity of the node. As covered in the Guide, this introduces undesirable coupling between the consumer of the network service and the way it is implemented, which may complicate verification and cause issues in the longer term as the system evolves. The RPC services are essential for low-level network maintenance tasks, but if you’re dealing with application-level design, I recommend focusing only on messages. This will not only isolate the consumers of your network service from irrelevant details of its implementation (specifically, which node provides the service), but also remove the response timing constraints imposed by service calls, and provide a more approachable way to avoid statefulness.

The mention of ROS is relevant, but one should keep in mind that the RPC in ROS is not directly comparable to that of Cyphal; it is a higher-level communication primitive because it allows the interacting nodes to be unaware of each other’s identity (you don’t need the name of the server node to invoke a ROS service).

When designing a message-based interface, I find it helpful to think of it as a shared-memory interface between different processes in an OS, where writes to the shared memory are similar to message publications, with the added limitation that writes may sometimes fail (messages being lost; although the analogy breaks somewhat when you consider that the view of the shared memory may not be consistent for all participants). Considering the possibility of message loss, each participant should publish its state periodically to ensure that the desired state is eventually communicated to all participants; this obviates the need for the more complex (and stateful) request-retry logic. Based on this, one can formulate some of the key features of the interface:

The interface involves at least two topics (subjects):
- the command topic, which can be used to request task execution with specific options;
- the status topic, which can be used to monitor the state of the task (whether it is running and with what parameters, or if it’s being canceled).
The action executor should publish its state and the options of the task being executed (if execution is underway) periodically. This will ensure that all nodes interested in commanding and/or monitoring the state of task execution will converge eventually to the same view of the executor’s state.
A task itself becomes a first-class entity in this design, equipped with a unique identifier (a name or a number). A node that desires the task to be commenced chooses a unique identifier and publishes it on the command topic. If another task is already in progress, the command will be simply ignored (other handling policies may also be implemented, such as canceling the already running task, enqueueing the request, or using some form of prioritization, etc). The requester will know whether its request is accepted by checking the next status message.
Cancellation of a task is done by publishing the appropriate command with the same task identifier. The usage of the unique task identifier protects against the obvious race condition where a task would be substituted with another one while the cancellation request is in transit. If the task with the specified ID is not running, no action will be taken.
Full idempotence can and should be achieved by checking whether a task with the specified unique identifier has already been executed.
The design is robust, or can be made robust with trivial adjustments, against the unscheduled restart of any of the participants; more on this in the section titled Statelessness of the Guide.
It is possible to enforce that the requesting node has the latest state update from the executor by adding an epoch counter to the status message and requiring that the same value is present in the command message. Similar techniques are used in distributed consensus algorithms; for example, see the Raft Consensus Algorithm.

A toy implementation might look like this:

# Command message published over the command topic.
# It should be published periodically unless no task needs to be run.
uavcan.primitive.String.1.0 task_id
CommandParams.1.0           params

# CommandParams
@union
uavcan.primitive.Empty.1.0 cancel
CommandOptions.1.0         execute

# CommandOptions
float32    option_one
float64[9] option_two

# Status message published over the status topic.
# Published always at a fixed rate, plus possibly on change.
float32 PUBLICATION_PERIOD = 0.1  # [second]
uavcan.primitive.String.1.0           current_task_id
CommandParams.1.0[<=1]                params           # empty if idle; "cancel" option means that it is being canceled
uavcan.time.SynchronizedTimestamp.1.0 eta              # done/canceled by this time
uint7                                 completion_pct