Best practices for managing long task execution requests

I’m currently developing a system with a variety of devices communicating via Cyphal/CAN, and I’m seeking advice on optimizing communication strategies for specific tasks. The system configuration is as follows:

  • Device Type A: Executes an atomic long task (lasting 10-20 seconds) with options. (For example, a robotic mechanism drops a determined by option number of coils into a combustor.)
  • Device Type B: Initiates the long task in response to user requests.
  • Device Type C: Triggers the long task when a sensor reading crosses a defined threshold.

As the system expands, we expect an increase in the number of devices of each type. My primary challenge revolves around managing the execution of these long tasks, particularly concerning task cancellation, where the time required to cancel is proportional to the time already spent on the task.

I have outlined some key considerations and questions:

  1. Using Services:

    • How can the initiating device effectively track the progress of the task?
    • In scenarios where a second device requests the same service during its execution, and consequently receives an error, what is the best approach? Should the device continuously poll to determine when the task has been completed?
  2. Using a Combination of Messages:

    • What are the best practices for managing and synchronizing tasks of such duration?
      • How should the system communicate to the first requester that their task (with specific option values) is in progress?
      • How can the first requester detect if their request has been lost and needs to be resent?
      • How can the system inform subsequent requesters that their requests (with different option values) will not be executed because another request is already being processed?
      • etc.

I would greatly appreciate insights on any existing patterns or best practices that address similar use cases in the Cyphal/CAN environment.

Thank you in advance for your assistance and guidance.

I’d personally do a little tip-of-the-hat to the way ROS does things with action servers. In case you aren’t familiar, ROS has a package called actionlib that provides a system where a node can send a request to perform a task, and then receive feedback on the state of the task, as well as messages if it succeeded or failed. Basically, exactly what you’re looking for. My intuition for this would be to have an RPC service that starts the action, and then begins the node publishing messages about the state of the action. Your “caller” node can then subscribe to those messages and stay up to date with what is going on. With regards to cancellation, the obvious answer to me is to have a separate “cancel” RPC for stopping the task. This is mostly because of my personal philosophy of minimizing state on the network, which I think is also one of the reasons why Cyphal works the way it does. Polling just introduces state, and doing everything with messages creates a lot of unnecessary overhead, so I think the solution of using services to start/stop and messages to update is a nice compromise.

1 Like

Generally, one should avoid making application-level communications dependent on the identity of the node. As covered in the Guide, this introduces undesirable coupling between the consumer of the network service and the way it is implemented, which may complicate verification and cause issues in the longer term as the system evolves. The RPC services are essential for low-level network maintenance tasks, but if you’re dealing with application-level design, I recommend focusing only on messages. This will not only isolate the consumers of your network service from irrelevant details of its implementation (specifically, which node provides the service), but also remove the response timing constraints imposed by service calls, and provide a more approachable way to avoid statefulness.

The mention of ROS is relevant, but one should keep in mind that the RPC in ROS is not directly comparable to that of Cyphal; it is a higher-level communication primitive because it allows the interacting nodes to be unaware of each other’s identity (you don’t need the name of the server node to invoke a ROS service).

When designing a message-based interface, I find it helpful to think of it as a shared-memory interface between different processes in an OS, where writes to the shared memory are similar to message publications, with the added limitation that writes may sometimes fail (messages being lost; although the analogy breaks somewhat when you consider that the view of the shared memory may not be consistent for all participants). Considering the possibility of message loss, each participant should publish its state periodically to ensure that the desired state is eventually communicated to all participants; this obviates the need for the more complex (and stateful) request-retry logic. Based on this, one can formulate some of the key features of the interface:

  • The interface involves at least two topics (subjects):

    • the command topic, which can be used to request task execution with specific options;
    • the status topic, which can be used to monitor the state of the task (whether it is running and with what parameters, or if it’s being canceled).
  • The action executor should publish its state and the options of the task being executed (if execution is underway) periodically. This will ensure that all nodes interested in commanding and/or monitoring the state of task execution will converge eventually to the same view of the executor’s state.

  • A task itself becomes a first-class entity in this design, equipped with a unique identifier (a name or a number). A node that desires the task to be commenced chooses a unique identifier and publishes it on the command topic. If another task is already in progress, the command will be simply ignored (other handling policies may also be implemented, such as canceling the already running task, enqueueing the request, or using some form of prioritization, etc). The requester will know whether its request is accepted by checking the next status message.

  • Cancellation of a task is done by publishing the appropriate command with the same task identifier. The usage of the unique task identifier protects against the obvious race condition where a task would be substituted with another one while the cancellation request is in transit. If the task with the specified ID is not running, no action will be taken.

  • Full idempotence can and should be achieved by checking whether a task with the specified unique identifier has already been executed.

  • The design is robust, or can be made robust with trivial adjustments, against the unscheduled restart of any of the participants; more on this in the section titled Statelessness of the Guide.

  • It is possible to enforce that the requesting node has the latest state update from the executor by adding an epoch counter to the status message and requiring that the same value is present in the command message. Similar techniques are used in distributed consensus algorithms; for example, see the Raft Consensus Algorithm.

A toy implementation might look like this:

# Command message published over the command topic.
# It should be published periodically unless no task needs to be run.
uavcan.primitive.String.1.0 task_id
CommandParams.1.0           params
# CommandParams
uavcan.primitive.Empty.1.0 cancel
CommandOptions.1.0         execute
# CommandOptions
float32    option_one
float64[9] option_two
# Status message published over the status topic.
# Published always at a fixed rate, plus possibly on change.
float32 PUBLICATION_PERIOD = 0.1  # [second]
uavcan.primitive.String.1.0           current_task_id
CommandParams.1.0[<=1]                params           # empty if idle; "cancel" option means that it is being canceled
uavcan.time.SynchronizedTimestamp.1.0 eta              # done/canceled by this time
uint7                                 completion_pct