Standard Device Error Protocol

As indicated on the UDRAL Planning Document, a robust, machine-comprehensible error protocol would be a powerful addition to the new UDRAL standard, allowing devices such as flight controllers to intelligently monitor for errors and handle component failures. While UAVCANv1 defines some protocols related to error handling - namely uavcan.node.Health (published as part of uavcan.node.Heartbeat) and uavcan.diagnostic.Record, these alone seem insufficient to provide robust error monitoring and handling.

This thread is intended to organize discussion around the two questions:

  • How should UDRAL service classes report errors?
  • How will UDRAL handle error reporting in an efficient and robust manner?

The planning document currently has a proposal which would publish Health messages at a low regular rate for each service class implemented by a node in addition to the node heartbeat Health message. This proposal allows for the monitoring and handling of health and errors on each service, but it has a few disadvantages:

  • If a node only implements one service, then the service health would be redundant to the node heartbeat health
  • The Health message contains little information on the actual type of error; it merely reports that an error has happened with a certain severity. This means that any type of error handling based on Health reports would need to rely on inference based on service class type, which does not seem like a particularly robust or flexible method to handle failures.

With that in mind, this thread is intended to propose superior solutions. If anyone has a better error handling solution, perhaps they would like to chime in.

I propose structural polymorphism (subtyping).

We already have uavcan.node.Health. We could publish that as-is, but as you said it’s uninformative.

We could instead use it as the bare minimum health indicator extended with service-specific context provided in the subtypes. Say, you have service Foo and service Bar, which then would define something roughly like reg.udral.service.foo.Feedback and reg.udral.service.bar.Feedback. Both of these Feedbacks would be structural subtypes of uavcan.node.Health.

A subscriber that requires only general status information will be able to subscribe using just uavcan.node.Health. A more sophisticated subscriber that is able to act depending on the health of Foo would subscribe using reg.udral.service.foo.Feedback.

Similar approaches for data slicing are encouraged by DDS XDR by the way. DS-015 used to employ this approach as well, which you can see, e.g., here:

Base type: https://github.com/UAVCAN/public_regulated_data_types/blob/0a773b93ce5c94e1d2791b180058cb9897fab7e1/reg/drone/service/common/Heartbeat.0.1.uavcan

Derived type: https://github.com/UAVCAN/public_regulated_data_types/blob/0a773b93ce5c94e1d2791b180058cb9897fab7e1/reg/drone/service/actuator/common/Feedback.0.1.uavcan

Also:

It may be redundant in the sense of network bandwidth but not in the sense of abstraction. A node health is published over a fixed port-ID and it models the health of the node overall. A service name is published over a non-fixed port-ID and it models the state of a specific service. If they are identical, this is an implementation detail of that specific node that consumers of its services need not be aware of. Additionally, this is not compatible with the polymorphic proposal I just shared here.