UAVCAN v1.0 and ArduPilot

pavel.kirienko · November 8, 2019, 3:34am

Tridge, thank you for the extensive post. The importance of such direct feedback from industry leaders is hard to overstate.

I think that we are well-aligned on the subject of v0’s deficiencies. The release of v1 is the direct result of our attempts to resolve them, and we are certainly aware that the breaking transition will somewhat damage the protocol and the surrounding ecosystem. We expect, however, that the short-term damage is far outweighed by the long-term benefits of the new v1 specification because its design is based not only on theoretical assumptions but also on our practical experience with v0. I think we covered it reasonably well in the recently published roadmap, in the July’s article, and in the Stockholm Summit recap, so those who are looking for more details will know where to find them.

I understand that perhaps our approaches might seem non-obvious to someone with deep experience in the domain of small unmanned vehicles, but that should be attributed to the fact that the popularity of UAVCAN is growing in other domains, such as space vehicles and manned electric aircraft. I use the term “software-defined vehicles” to describe the meta-domain, I think it reflects the core principles well, as well as the role of UAVCAN in it, which is to serve as a medium-level protocol that is both deterministic and abstract. Extension of the protocol to the new domains should not reduce its utility for small unmanned systems; quite the contrary – considering the existing development practices and the growing regulatory pressure, UAV systems could benefit from adopting practices from the other fields.

I think that transferring existing methods and approaches from protocols designed to address different problems based on incompatible models and assumptions, such as MAVLink (2), is a serious mistake. One needs to keep in mind that the core design goals of UAVCAN include statelessness (low-context communication) and decentralization (no super-node, module-serves-the-network). Here is a relevant excerpt from one of the lengthy online discussions that shaped v1:

In order to steer this conversation away from dead-end paths, let me say now that any design decisions that focus on bus masters, centralized activities of any kind, or stateful/context-dependent communication go directly against the core design principles of the protocol. As such, things like protocol version negotiation at the time of dynamic node ID allocation, or centralized data type compatibility checking are not going to happen.

[…]

The reason why statefulness and context-dependency […] are evil and are to be avoided is that they introduce significant complexity and make node behaviors harder to design, validate, and predict. Each independent interaction between agents shall have as few dependencies on the past states as possible. This simplifies the analysis, makes the overall system more robust, and makes it tolerant to a sudden loss of state (e.g., unexpected restart/reconnection of a node). Additionally, in a decentralized setting, maintenance of a synchronized shared state information can be a severe challenge. Decentralization by itself is extremely important as it allows the network to implement complex behaviors while avoiding excessive concentration of decision-making logic in a single node, thus contributing to overall robustness and ease of system analysis.

The above should be sufficient motivation for complete avoidance of network initialization procedures of any kind. There will be no mandatory data types besides the already existing NodeStatus. Any node shall be able to immediately receive and interpret any transfer from the bus without any preparatory stages or special network initialization routines. This implies that the protocol will be redundant since all of the information necessary to interpret a given transfer must be directly attached to the transfer.

Some of the issues in v0 were caused by wrong base assumptions about how the ecosystem is going to operate. Lack of built-in means of advancing data type definitions was the direct result of the assumption that one who is defining a data type is able to model its usage in great detail, relieving the implementers and users from compatibility-related issues completely. Further, it was also assumed that the protocol maintainers will be able to foresee the most common application-level use cases and provide an adequate set of standard data types to address them.

As you are absolutely correct to point out, the result was unsatisfactory: we ended up with dozens of poorly specified, overly generic definitions that were impossible to advance. Support for vendor-specific types was lacking as well, so one looking to avoid dealing with the poor set of standard types could not do that easily, especially considering the fact that the application-level types were co-existing in the same type library (i.e., namespace) with the types that are essential for the operation of the protocol, such as NodeStatus.

We solved the problem by shifting the responsibility from the author of the data type to the integrator, requiring the latter to ensure that the equipment is configured to use the correct data type versions. As a result, data type authors now have the ability to introduce changes into data types, both breaking and backward-compatible. The v1 specification provides a set of strict, well-defined rules that allows one to reason constructively about breaking changes. Backward-compatible changes, such as the addition of new fields or some minor modifications of existing fields, are possible as well, provided that the memory footprint of the object is not affected. Specifically, when defining a new data type, the original author leaves some space unused, which can be utilized for additional fields in newer versions of the data type later. We opted out of supporting arbitrary extensions at the end because they complicate the use of the protocol in deterministic hard real-time systems, where the worst-case memory footprint of the object must be known statically.

The next obvious step that we took was to remove all application-specific types from the standard namespace. You will not find anything about GNSS receivers, IMUs, or ESCs there, and they are never coming back. Instead, we are delegating the task of defining and maintaining domain-specific data types to vendors, who are presumed to be far more qualified for that than UAVCAN maintainers are. Such definitions are still stored in the same repository but under a different namespace (not uavcan).

The seeding of the transfer CRC with a data type signature was a mistake. This behavior was dropped in v1 (along with several other simplifications of the protocol such as removal of tail array optimizations and data type identifiers); now, the CRC is invariant to the kind of data contained in the transfer. This change eliminated a leaky abstraction, providing a much cleaner design and layering, and removed the serious logical inconsistency that multi-frame transfers were protected against a data type mismatch while single-frame transfers were not. Further, it makes this case a non-issue:

We’ve found it quite common that a message gets added by a vendor, let’s call them COMPANYX, and then later we want to adopt that message into a different namespace, say ardupilot namespace or uavcan namespace. Right now if we renamed a DSDL from org/COMPANYX/equipment/foo/2000X.FooBar.uavcan to org/ardupilot/equipment/foo/200YY.FooBar.uavcan then the signature would change, which means the original vendors equipment would no longer be compatible. This makes it really painful to do the natural migration of new messages from vendor namespaces into more widely used namespaces. The original vendor needs to carry patches against the upstream code that has adopted their message in order to use it with their equipment.

The described case warrants special attention for a different reason: it seems to show that you might be managing the ecosystem in a suboptimal way. If you are supporting a piece of equipment from a third-party vendor, you are supposed to use the message definitions provided by that vendor. I don’t think you have valid reasons for creating verbatim copies of those definitions in your namespace; the utility of that action is negative because you add new entities to replicate existing functionality with no added value, increasing the burden of maintenance and confusing the adopters. I understand that you might have been induced to implement this approach by the fact that supporting vendor-specific namespaces in v0 was hard, but in v1 it should no longer be the case and so ArduPilot should avoid practicing that in the future.

If you are interested in the future of the standard, particularly in the sense of supporting new protocols (Ethernet, serial, wireless), then you will probably want to know that we are resurrecting the runtime compatibility enforcement in more capable transports that can tolerate the resulting overhead (Ethernet and serial). The new mechanism is not going to affect the CAN (FD) transport, it is purely a matter of future development. We call it “data type hash”, and it is roughly outlined here: Alternative transport protocols in UAVCAN. I am not yet entirely happy with that proposal, because it moves the responsibility of ensuring compatibility from the integrator back to the data type developer, which could be seen as undoing some of the progress we have made in v1, and also it is not entirely compatible with the vague idea of polymorphic types that I was thinking about lately. At any rate, this is a matter of ongoing research, and it should be crystal clear that it does not, and will not, affect any existing or future v1-over-CAN deployments, because it simply does not map onto the CAN transport at all.

I think the best way to see how the issues you have outlined are addressed in v1 is to read the specification. I would also be delighted to have a call with you if you are willing to tolerate my bad English.

I am entirely sympathetic to the ROM footprint issues. I can’t say it with confidence, but I think that it might be possible to squeeze a dual-stack implementation into your bootloaders within the 5K budget, especially so if you are comfortable with optimizing libcanard heavily for your specific use case. We say that v0 and v1 are different, but for a bootloader, the difference amounts to slightly different bit layouts here and there, which should be easy to generalize over.

There exists an alternative which is not great but you may want to consider it anyway: update the bootloader together with the application when the version is changed. This is done trivially by embedding the bootloader image into the application binary. The major drawback of this approach is that there exists a brick-prone window between the point where the original bootloader is erased and the new one is installed.

Yet another alternative is to rely on the default bootloader supplied by STMicroelectronics. Those come with serious limitations, but they work as a last resort if the original bootloader is unusable and a JTAG/SWD probe is not available.

The PX4 project has a BSD-licensed UAVCAN v0 bootloader for STM32 targets that fits into 8K of ROM (made by David Sidrane and Ben Dyer), perhaps you could benefit from that also.

(Opinion: not sure if pertinent here, but I would say that designing a new part for a generic UAVCAN node with less than 512K of ROM is a mistake. Flash memory is cheap and saving pennies on it is unlikely to pay off considering the great difficulties of managing ROM-constrained environments in the long term. Software is getting more complex, and the hardware should evolve to suit the growing demands.)

The v1 version of Libcanard still requires work, and we could use all the help we can get. Kjetil and Åsmund did a great job already, but it’s not quite done yet. If anybody from the ArduPilot team would like to help, let’s coordinate here on the forum or on the dev call.