Alternative transport protocols in UAVCAN

Hi Pavel. I’m well into my attempts to implement UAVCAN on the Espressif ESP8266 / ESP32 microcontrollers and I think I’m at the point where I can ask some semi-sensible questions about certain details regarding the serial and UDP transports, which I intend to use.

First, I thought I’d give an overview of my use case so you can see why I’m making certain decisions: I’m working on “Maker-grade” laboratory equipment for citizen-science level home/small labs. Some of these may turn into vendor products in time, but they’re intended be DIY kits published for anyone to build and modify. A typical lab can be considered to be a “room scale” robot with sensors and actuators that must work together to run an experiment. I’ve built prototype examples of:

  • Scales (microgram sensitivity)
  • Power meters & data loggers.
  • Mini Centrifuges (50,000 RPM or >1000G)
  • Robotic (“Digital”) Pipettes for dispensing millilitre to microlitre quantities.
  • Robotic Microscope (with digital camera, lighting calibration, CNC slide stage/focus)
  • Temperate controllers for hotplates
  • Magnetic stirrers
  • Liquid & Air pumps
  • Lights & Lasers
  • Pick & Place robot (to hold the pipette for automated tasks)

The devices can work independently in ‘manual mode’, but there are good reasons to network them:

  • Calibrating/Configuring devices through a UI that’s easier to use than attempting complex setups with a knob, two buttons and a half-inch OLED screen. (If that)

  • Central management/recording of an experimental protocol from a computer running Labview (or similar) or bespoke Python code.
    eg: When starting a new experiment, configuring the pipette to a particular dispense mode, setting the centrifuge to an appropriate speed/duration, weighing a centrifuge sample and then preparing a ‘counterweight’ sample, starting data recorders.

  • Remote monitoring of long-running experiments from outside the lab.
    eg: overnight cell culturing, plant growth. That implies multiple ‘management’ computers temporarily connecting to devices over the network, probably via TCP/IP. (since WiFi/UDP broadcasts are not usually routed outside the subnet)

  • Remote control of the protocol to avoid contamination.
    eg: If the experimenter is already holding their favorite electronic pipette, some of the buttons could activate the centrifuge / microscope / tray loaders to avoid touching other control knobs.

  • Emergency stop / safe mode buttons. In the case of a lab accident (eg, a sample exploding in the centrifuge, unbalanced centrifuge ‘going for a walk’, beaker boiling over on the hotplate, motor jamming, liquid spill, or even a fire) a big red button that shuts down all the equipment is preferable to running around the lab to turn off individual devices. Several E-Stop buttons might be distributed around the room (of different severity) and these must function even (especially!) if the management computer crashes or the WiFi access point fails - hence a desire for redundant or multiple transports.
    eg: If the building fire evacuation alarm goes off, a “Safe Mode” button at the door should put all the robots into standby / turn off the hotplates and centrifuges until the user returns.

Some pieces of equipment are able to be tethered, but others must be wireless (like the pipette) using network protocols like WiFi, Zigbee, LoRa or the “ESP NOW” mesh API which Espressif modules can use to send WiFi packets directly to each other without an Access Point. Other modules (like stepper motor controllers) can benefit from hard real-time links over CAN within a single device, which the ESP32 supports. Both Espressif chips have integrated WiFi and the ‘Arduino’ boards have USB serial interfaces.

For these reasons I’m especially interested in the UDP protocol over WiFi, and the serial protocol both over TCP/IP and USB serial connection. A typical use would be plugging a device into the USB port of the management computer for initial configuration so that it can then connect to a nominated WiFi access point / E-Stop button controller. Then it can be unplugged, moved to the bench, and be remotely monitored/operated over WiFi.

OK so given all of that, here are my specific questions about the alternative transports:

  1. Is it appropriate for the different network interfaces to act as independent “nodes”? If I plug in a USB serial cable it makes sense to use PNP to allocate the node ID for the ‘temporary’ serial transport. (I can’t predict the host ID / what other serial devices might be connected in advance, especially on out-of-the-box first use) Once the device’s WiFi interface is configured and brought up it will be allocated an address by DHCP and there’s no guarantee that will match the node ID already obtained for the USB serial connection. (And changing node ID’s is explicitly prohibited by the spec.)

  2. It would seem that it’s not possible to treat the USB serial and WiFi links (either UDP or TCP) as redundant transports for the above reasons, since the USB serial should always be available first. Does that sound right?

  3. Will pyuavcan have the ability to automatically detect new USB serial connections? Or is it left to the application (eg Yukon) to detect port changes and configure pyuavcan with the new transports? Are multiple hardware serial ports considered to be part of the same UAVCAN ‘network’? (And if so, will pyuavcan re-transmit messages from one serial port to the others?)

  4. Is it appropriate for multiple TCP/IP tunneled serial connections to a single device to act as connections to the same node, or should they also be instantiated as new nodes? Basically, should the Transport ID’s be shared across multiple connections or should a new TCP/IP session see counters starting from zero? (and potentially different Node ID’s)

  5. Am I correct in thinking that a separate Transfer ID counter needs to be maintained for each ‘session specifier’ even on transports which have monotonic ID’s? This seems to cause a proliferation of large counters that need to be kept indefinitely for an arbitrary matrix of transports/sessions, even though a single monotonic ID per transport (or node) is enough to guarantee uniqueness.
    eg: Would the Heartbeat function need to maintain Transfer ID’s for every serial/wireless transport, potentially including multiple TCP/IP serial connections, even though they may have equal-sized monotonic IDs? Or can they all be classified as Redundant Transports and share a single ID per function? Or can I just keep one ID per transport? Or node?
    The ESP chips are fairly roomy as microcontrollers go (80K to 200K of RAM) but memory is still tight.

  6. It makes sense that each Subject message be broadcast over all available transports in parallel, but should Service Response messages also be sent over all transports, or only the transport on which the Service Request originated? (I can’t find a clear rule in the spec. but I might have missed it.)

I’ve also got a few questions regarding the serial protocol when being used on ‘noisy’ hardware ports and software (XON/XOFF) flow control, but I might save those for now.

Hi Jeremy,

Thank you for the detailed description of your case. I think UAVCAN should suit it well even though this use case is somewhat unconventional.

Yes, it is acceptable and natural for a unit to expose independent UAVCAN nodes. Redundant transports only make sense if they are used in a similar configuration at runtime (regardless of whether the redundancy is homogeneous or heterogeneous).

Seeing as in your case USB and IP serve different physical networks, they should expose independent UAVCAN nodes, so yes, your assumption is correct.

PyUAVCAN does not involve itself with managing OS resources such as serial ports or sockets, it is the responsibility of the higher-level logic implemented in the application. How the OS resources are mapped to UAVCAN nodes is determined entirely by the application. It is possible to set up a redundant transport that leverages multiple ports concurrently, or several independent transports where each works with a dedicated (or multiple) serial port. I perceive you’ve already read this but I will post the link anyway for the benefit of other readers:

Both approaches are viable. If your TCP/IP connections leverage the same underlying L2 network, then it doesn’t seem to make sense to have several of them. If they operate on top of completely different networks (e.g., one is lab-local, another is used to interface with a remote client, which is a very made-up example), you should use independent nodes per transport.

In general, the following rule applies: if the objective of an additional transport is to increase the reliability of the system, use the same node in a redundant transport configuration. If the objective of the transport is to expand connectivity to other (sub-)systems, use a dedicated node.

The objective of transfer-IDs is not only uniqueness but also detection of missing data, so they shall be sequential. Concerning your question, however, you seem to have misunderstood the fact that transfer-IDs are computed at the presentation layer and then shared among all available transports. Under your example, the correct option is “or node”.

Section Transmission over redundant transports states that all transports shall be utilized concurrently for any outgoing transfer. Service response transfers are not exempted.

Thanks for the reply! That helped a lot, though I might need to clarify some of my questions a little, especially about the Transfer ID.

Thank you for the detailed description of your case. I think UAVCAN should suit it well even though this use case is somewhat unconventional.

I agree… I’ve gone through almost every industrial automation protocol that exists, and MODBUS(+extensions) is the only other one that even comes close, but UAVCAN beats it by having the SI datatypes as part of the spec… pretty essential with scientific equipment. Every other spec. usually fails by having an onerous ‘vendor’ process (to get an ID that allows you to transmit on the bus) that essentially excludes hobbyist-level makers.

The objective of transfer-IDs is not only uniqueness but also detection of missing data, so they shall be sequential. Concerning your question, however, you seem to have misunderstood the fact that transfer-IDs are computed at the presentation layer and then shared among all available transports. Under your example, the correct option is “or node”.

Right, this is the part I should clarify… so what that means is that in the case where a device has multiple interfaces which act as separate nodes (USB, UDP, Zigbee, etc) and might be running dozens of application-level functions (say 100 for a fairly complex app) Then each ‘node interface’ has to keep a fairly large table (of 800 bytes for 64 bit monotonic IDs) for the Subjects.

If the unit wishes to invoke Services on other nodes (eg. as part of a discovery process that lists available devices for some app service) the spec seems to say that a counter also needs to be kept per local node interface + remote node + service port ‘session’. If there’s 100 devices on a network that appear over time, that’s another 800 bytes per service port. If I’m invoking a couple of services (even if they don’t respond) then I’m quickly using more memory to store transfer ID’s than actual network buffers, and I can’t ever deallocate them.

And that’s just the ID storage. Add overhead for the keys and table management and that’s likely to double. It could easily consume the majority of a microcontroller’s memory, especially if arbitrary numbers of virtual transports are allowed.

Now lets’ say I violate the spec. and keep ONE global Transport ID counter for the entire node, (ie: one for USB, one for UDP etc.) shared between all Subjects and Service Requests. Since missing data is a typical event, it’s not going to affect much besides incrementing a ‘missing data’ counter on other nodes, if they even care about that. Right?

For apps that don’t have a hard reliability requirement, it won’t make any practical difference. It’s not nice, but it’s basically optional. My unit just gets flagged as a ‘low reliability device’.

For serial transports where each transfer must be decoded regardless (no filtering hardware) a single monotonic counter that increments per transfer satisfies the uniqueness requirement, and allows the transport to detect missing data (at the transport level) although it will not know the port that data relates to. However if the transport is over a ‘reliable’ link such as TCP/IP tunneled serial, PPP, xmodem etc. there’s basically no chance of that happening anyway.

It seems that for some of the alternate transports, maintaining a large table of Transfer ID’s has no benefit over a single global Transfer ID shared by all subjects, and is a detriment on memory-constrained devices.

Similarly, the requirement to maintain a Transfer ID counter per Service Request session specifier seems excessive. A single global ID per node (or even per unit) would suffice. The responding node simply copies the Transfer ID regardless… it doesn’t seem to care if the ID’s are non-sequential. There is no requirement in the spec. for the node running the Service to maintain a list of ‘last request Transfer ID’ that I can see. If services are idempotent, it shouldn’t even matter. The spec. doesn’t even seem to say what should happen if the requests are out-of-order. (I assume because Cyclic IDs on CAN would be problematic if enough frames were lost)

So if the Service Responder doesn’t really care about sequential Transfer ID’s per port and the Requester also doesn’t care (it can detect ‘lost data’ anyway because the response never comes) why not use a single global Transfer ID counter for all Requests for monotonic ID transports? (Of course the situation remains very different for Cyclic IDs)

I understand this is probably giving you a really bad feeling and violates some of your design goals, but I’d encourage you to have a think about whether sequential monotonic ID’s per session specifier is really an absolute hard requirement for all transports, or can be relaxed to ‘optional’ given how much of an overhead it can be. Especially for Service Requests.

Section Transmission over redundant transports states that all transports shall be utilized concurrently for any outgoing transfer. Service response transfers are not exempted.

Sorry, I should have clarified… consider the case where multiple serial TCP/IP connections and a UDP transport all route to a single node instance, which you indicated was a viable approach. (It would certainly save memory) They’re not really ‘redundant’, so it’s a bit of a grey area. I’m asking if a request that comes in over a TCP/IP serial tunnel should cause the response to be sent back only to that tunnel, or should it be sent to ALL the TCP tunnels connected to that node and also via UDP.

At first glance that would seem to be wasting bandwidth on interfaces that aren’t involved in the request, but perhaps there’s a good reason for it.

The resource utilization issues you are describing seem to be rooted in the fact that your networks are highly dynamic. A typical vehicular bus is unlikely to encounter nodes or transports that are added or removed at runtime so the specification requires that once a transfer-ID counter is allocated, it should not be removed. This is described in section Transfer-ID:

The initial value of a transfer-ID counter shall be zero. Once a new transfer-ID counter is created, it shall
be kept at least as long as the node remains connected to the transport network; destruction of transfer-ID counter states is prohibited.

Footnote: The number of unique session specifiers is bounded and can be determined statically per application, so this requirement does not introduce non-deterministic features into the application even if it leverages aperiodic/ad-hoc transfers.

In v0 we had a provision for dynamically reconfigurable networks where we allowed transfer-ID counters to be dropped by timeout to reclaim the memory back. If you consider this measure sufficient to support your case, we could consider re-introducing that provision back into v1 by lowering the requirement level of the above text to “destruction of transfer-ID counter states is not recommended”.

Note, however, that such optimization requires the node to make assumptions about the maximum transfer-ID timeout setting on the remote (receiving) nodes.

Yes. However, observe that a sequence counting mechanism that enables detection of missing data is required by safety-critical vehicular databus design guidelines (e.g., FAA CAST-16). This capability is provided per subject/service. I understand that it may not be relevant for your specific case but we should explore other solutions before introducing special provisions for uncommon use cases in the specification.

This is specified explicitly but the wording may be suboptimal. Observe, section 4.1.4 Transfer reception:

An ordered transfer sequence is a sequence of transfers whose temporal order is covariant with their transfer-ID values.

Reassembled transfers shall form an ordered transfer sequence.

Therefore, an out-of-order transfer-ID indicates that the transfer shall be discarded unless the previous transfer under this session specifier was received more than transfer-ID timeout units of time ago:

For a given session specifier, a successfully reassembled transfer that is temporally separated from any other successfully reassembled transfer under the same session specifier by more than the transfer-ID timeout is considered unique regardless of its transfer-ID value.

If the interfaces are managed by a single logical node instance, then by definition they form a redundant group, hence it is required to emit every outgoing transfer once per interface. This behavior does not incur undesirable side effects because (section 4.1 Abstract concepts):

Redundant transports are designed for increased fault tolerance, not for load sharing.

The objective […] is to guarantee that a redundant transport remains fully functional as long as at least one transport in the redundant group is functional.

I understood that in the case you described you should run a dedicated logical node per transport. As a similar example, I am currently working on a UAVCAN bootloader for deeply embedded systems that supports UAVCAN/CAN alongside with UAVCAN/serial over UART or USB CDC ACM. The bootloader exposes an independent logical node instance per transport interface, so they do not form a redundant group. Hence, a request received over USB is responded to using USB only, on the assumption that the interfaces interconnect completely different networks (e.g., the CAN may be connected to the vehicular bus while the USB may interconnect only the local node and the technician’s laptop).

The resource utilization issues you are describing seem to be rooted in the fact that your networks are highly dynamic.

Oh yes, definitely. I’m very aware I’m porting this to an application domain which is a little bit ‘next door’ to what UAVCAN was originally intended for, but I’m hoping my experiences can be helpful to make the spec. more widely used in those domains. I see a lot of potential there, and the work on alternative transport protocols is an indication that UAVCAN is trying to expand in those directions.

I’ve had extensive experience writing microcontroller firmware over the years for PICs, Atmels (Arduinos), and now ESP, and a constant issue has been finding ways to link them together. Protocols are either so heavy-weight that they won’t fit the memory constraints, or light-weight that they fail to provide enough capability. (eg. classic MODBUS doesn’t do strings or floating point numbers) IoT is all the rage now, and while RESTful services are easier to implement on modern chips, their Achilles heel is the need for centralized servers.

In v0 we had a provision for dynamically reconfigurable networks where we allowed transfer-ID counters to be dropped by timeout to reclaim the memory back. If you consider this measure sufficient to support your case, we could consider re-introducing that provision back into v1 by lowering the requirement level of the above text to “ destruction of transfer-ID counter states is not recommended ”.

The last thing I want to do is cause changes to the spec. which either make it more complicated, or which degrade the original focus on high-reliability intra-vehicle networks. What I suspect is happening here is a conflict between two core design goals of UAVCAN: high reliability for hard real-time systems, and minimal shared context.

The Transfer ID’s are the ‘battleground’ between those two goals - they are the minimum shared state required to create high reliability, and in static networks with fixed numbers of nodes and transports they do that with minimal overhead. All good.

But yes, in more dynamic implementations with arbitrary numbers of peers and alternate transports, that minimum shared state starts getting quite large. One design goal starts losing the conflict with the other.

Worse, the moment something becomes optional in the spec. (or “not recommended”) it begins causing problems for implementors. They’ll ask why it’s optional, and in what cases. That’s bad, confusion must be avoided at all costs.

So how do we resolve those conflicting goals? How do we keep FAA CAST-16 reliability, while potentially enabling low-context alternate transports for other domains (especially configuration and ‘debug’ monitoring) while preventing confusion?

Other specs resolve this by having ‘Profiles’. eg: the MPEG/H264 standards specify a set of features, but also define which features should be enabled in certain circumstances. eg: Limits on block sizes which allow ‘hardware’ ASIC decoders in Blueray players to guarantee they will always be able to decode a baseline stream (since once sold, those players last for decades in people’s homes and cannot be upgraded) but which are more flexible and advanced in other profiles intended for modern professional-grade camera gear which only needs to communicate with equally modern editing software.

So they have a “Baseline Profile” used in videoconferencing hardware, a “High Profile” for high-definition television broadcasts, and “High 4:4:4 Predictive Profile” for pro camera gear. (as well as others) In effect the same algorithms are used in each profile, but with changes to datatype sizes and limits on the amount of processing power and memory available to the codec.

I would advise thinking along the same lines, and defining which features of UAVCAN form a “High Reliability Profile” (basically the entire current feature set) which gives the safety-critical guarantees you’ve worked so hard for, but also a “Low State Profile” which lists what parts of the protocol become optional when optimizing for that domain.

Any place in the spec where you say must or shall or should potentially gets marked with the profile that it’s for.

Ideally the profiles are interoperable in at least one direction… the same way an “Extended Profile” H.264 decoder can parse “Constrained Baseline Profile” streams but not vice-versa. But it’s clear to implementors that if they don’t conform to some optional part of the spec, then their implementation sits in a different class, even if they can exchange compatible frames.

This also gives you options down the track if you wish to extend the spec further, such as adding extra timing constraints that might differentiate a “Hard Real-Time” profile (perhaps implemented in an FPGA) from libraries such as pyuavcan which will always be limited by OS delays.

The issues I’m having is that I have to keep a large amount of state to satisfy a Profile that I’m never going to reach regardless. There’s no way a UDP transport over WiFi is ever going to meet FAA CAST-16 reliability standards.

So if I can’t reach that bar anyway, but still see huge advantages in using UAVCAN for it’s app-level features like SI datatypes and decentralized node discovery, then what other parts of the spec. also become optional? Making my own choices on a feature-by-feature basis seems… unwise. And has the potential to fragment the standard into confetti.

I am currently working on a UAVCAN bootloader for deeply embedded systems that supports UAVCAN/CAN alongside with UAVCAN/serial over UART or USB CDC ACM. The bootloader exposes an independent logical node instance per transport interface, so they do not form a redundant group. Hence, a request received over USB is responded to using USB only, on the assumption that the interfaces interconnect completely different networks (e.g., the CAN may be connected to the vehicular bus while the USB may interconnect only the local node and the technician’s laptop).

Yes, that is exactly the kind of use case I’m also looking at! (The TCP serial case is mostly a wireless version of the same) That means you’ve also got a situation where you need to allocate lots of state to provide the alternate transport, so much it might potentially interfere with the CAN functions.

If I understand the bootloader concept, you may have a situation where the USB interface is connected to a host network with an unknown set of logical nodes, have to emit a broad range of subjects, and invoke services (if you’re using uavcan.file to retrieve the new firmware) from arbitrary remote units. Possibly while sharing some state/services between the ‘logical interface nodes’ within the unit. (such as statistics and registers) I have this problem multiplied by a potentially arbitrary number of TCP/IP connections (practically limited to a dozen or so) if I choose to implement that transport.

Anything that can be done to reduce the overhead of the serial transport implicitly improves the reliability of the CAN transport. Paradoxically, degrading the reliability of one can improve the other. In this case, “The Perfect is the enemy of the Good” quite literally.

This is specified explicitly but the wording may be suboptimal. Observe, section 4.1.4 Transfer reception :slight_smile:
For a given session specifier, a successfully reassembled transfer that is temporally separated from any other successfully reassembled transfer under the same session specifier by more than the transfer-ID timeout is considered unique regardless of its transfer-ID value.

I’ll admit I didn’t get the full implication of that at first, but I did read the bit which said:
Transfer-ID timeout is a time interval that is 2 (two) seconds long. The semantics of this entity are explained below. Implementations are allowed to redefine this value provided that such redefinition is explicitly documented.

And said to myself “Well, in that case I’ll simply set it to Zero for my implementation and I’ll document that and then I won’t have to worry about it anymore.” - again, not nice and probably not what you wanted, but if you’re going to make things optional then some of us are going to take the easy way out. :crazy_face:

Although in my defense I was thinking about the WiFi UDP and TCP/IP serial protocols where the order is fairly strictly determined by the laws of physics (for the radio link) and TCP protocol. Plus I’m a big fan of idempotency.

The alternative is storing even more monotonic Transfer ID’s per session for up to 2 seconds on what could be high-traffic links (10Mbit/s over WiFi) and that could easily overflow my microcontrollers’ limited RAM.

hmm… re-reading it again I still don’t see if the protocol explicitly specifies what should happen if the transfers (not frames, but complete transfers) arrive out-of-sequence. The Transfer-ID timeout seems to de-duplicate identical transfer ID’s within that interval, but non-sequential transfers seem to be totally allowed.

I suppose that implicitly means that out-of-order transfers are allowed and should be responded to in the order they arrive? And if they’re service requests it’s up to the requester to sort it out when the replies get back?

I guess that’s unavoidable. If a unit starts rebooting several times a second (eg from power brownouts) you have to accept transfers where the counter’s gone back to zero, although if it’s within the 2 second window they will be discarded.

Oh, on the topic of reception timestamps:
Transport frame reception timestamp specifies the moment of time when the frame is received by a node. Transfer reception timestamp is the reception timestamp of the earliest received frame of the transfer.

In cases where a frame arrives in small fragments (say over serial links) Would you prefer that timestamp to be sampled at the start of the frame or the end? I could set the timestamp to the start frame delimiter, the first actual frame byte, last byte, or the end delimiter in cases where they’re measurably different. I’d probably pick the first frame byte, since frame delimiters can be ambiguous as to which one you’re getting.

ps: when I’m re/quoting you, what’s the bbcode to include your name header thingy? That quite nice.

The idea of profiles was poked by Scott a few months ago: Future Proposal: Featherweight Profile. Overall I think it’s probably a sensible direction to move towards but we have not yet accumulated the critical mass of alterations to justify breaking off a profile. Scott’s proposal is actually on the opposite side of the determinism/flexibility spectrum but the idea is the same.

That’s not quite true. The way you described it sounds like the resource utilization of the protocol stack is a function of the network configuration which is not something that can be robustly controlled or predicted by a given node. The protocol is intentionally designed to ensure that a properly constructed implementation (stack) can demonstrate predictable behaviors regardless of the network configuration. It is also a featured property of Libcanard (which is optimized for real-time systems).

The bootloader allocates a well-known set of transfer-ID counters statically and its memory footprint is not dependent on which interfaces are active or which of the tasks are being performed. Now, your case is different because certain base assumptions that the Specification makes about the underlying communication system are not met in your design (it’s too dynamic).

But setting the transfer-ID timeout to zero (which is related to transfer reception) does not automatically relieve you from the requirement to keep transfer-ID states for outgoing transfers. If you just used zero TID for outgoing transfers you would run into compatibility issues with third-party software and hardware. But the following approach is viable from the protocol design standpoint (the RAM issue notwithstanding):

I think you might consider removing least-recently-used TID counters automatically when the RAM resources are exhausted. It’s probably the solution that minimally departs from the Specification.

In the longer term, we should consider extending the Specification or introducing a profile (the latter is much more convoluted) based on your experiments here. We are watching you, Jeremy.

Out-of-order transfers shall be dropped. This is explicitly required by “Reassembled transfers shall form an ordered transfer sequence.” If you accept a transfer with an out-of-order TID, the set of transfers will not form an ordered sequence, hence a violation. Would you like to volunteer to add a footnote to clarify this, perhaps?

You don’t have to. Normally, in our applications, a node works non-stop until the system is shut down (if ever). See, the transfer-ID timeout has to be sized properly to suit the trade-off (section Behaviors, non-normative blue box):

Low transfer-ID timeout values increase the risk of undetected transfer duplication when such transfers are significantly delayed due to network congestion, which is possible with very low-priority transfers when the network load is high.
High transfer-ID timeout values increase the risk of an undetected transfer loss when a remote node suffers a loss of state (e.g., due to a software reset).

I expect this to be specified once the UAVCAN/serial made it to the spec doc. The existing experimental implementation in PyUAVCAN timestamps by the first delimiter of the frame (the ambiguity is resolved retroactively) and I think it’s probably optimal because the delimiter is the first element of the frame. Timestamping at the end is undesirable because the payload transmission time (and its escaping, if any) would skew the timestamp.

You can select the quoted text and then click “Quote”:

Yup, I totally get that. My intention is to implement the spec as fully as I can, and i would never intentionally violate the monotonic order by going backwards or to zero. The issue is maintaining strict sequence per session… that’s what potentially has 64K maximum ID’s per port (one for each 16-bit serial protocol node ID)

At the moment I’m implementing the spec as-is, but with only 80K of RAM on the ESP8266 I know that there are edge cases (eg: a remote node - real or virtual - keeps rebooting and getting a new node ID over a long enough time, or an ‘intentional DDoS’ attack over TCP/IP serial) where my code will have to either throw away state intentionally, or crash. (and reboot, potentially adding to the problem!)

I do also plan to use UAVCAN (with libcanard) ‘as intended’ for some of my robots, which may have several ESP32 motor controller / sensor nodes within a single device connected by a CAN bus. That’s a big part of the attraction… using the same protocol for in the ‘internal facing’ and ‘external facing’ interfaces.

If you’re not familiar with the ESP32 I recommend having a look… it’s a very capable and popular chip. It will probably become my main platform, although I’d also like the stack to run on the older ESP8266. There’s a lot of those still out there.

I’ll probably never get the stack to fit in an Atmel/Arduino without some serious compromises.

That’s why I was asking about violating the strict sequential order… I figured that skipping ID’s (by using a more ‘global’ counter) for outgoing service requests would at least prevent counters going backwards, and is an expected case. I could keep a ‘maximum ID I’ve ever used’ and restart from that.

Outgoing subject ID’s are at least a static size table, and while it would be nice to shrink that for very memory constrained devices, I can always constrain that by limiting the virtual node interfaces I create. (eg: by rejecting TCP/IP connection requests)

Subscribing to subjects requires a session table, but I can always limit the number of my own subscriptions if memory is tight, or dump old state on a timer (and ‘resync’ later) if I have to listen to an arbitrary number of nodes. I lose reliability, but that’s my problem. If the network is small and stable everything will be fine, and if not it’s my choice what’s most important to preserve.

It’s the service request session index that is the worry. Throwing away outgoing service request transfer ID’s on a timed basis (or having a fixed-limit table and discarding the overflow) is certainly possible, but would seem to be a greater violation of the spec than skipping ahead? I can (mostly) predict the consequences of ID skipping, but how remote nodes will respond to completely lost state (ID resets to zero) seems more unknown. Especially if things get stressed and I have do it repeatedly.

Throwing away incoming service request transfer ID state at least only affects the local node. I still think it’s way easier to simply respond to any monotonic request that arrives, regardless of apparent session transfer order or duplication. I get a request, respond, and then forget about it. Totally stateless. It makes little difference for “read” operations, and “write” conflicts can be handled at the app level through idempotency, which solves other conflict modes too. (like multiple node access)

Are there any ports/services which allow nodes to force the dumping of session state on remote nodes? Aka “I’m rebooting/shutting down, forget everything about me?” Is that behavior implied by the uavcan.node.Heartbeat MODE_INITIALIZATION and MODE_OFFLINE messages? Can I dump state if I don’t see a heartbeat for a while, or should I be assuming the node might come back?

I suppose that once I subscribe to a node for any subject (or make a request) I could also create a subscription to it’s heartbeat and use that to decide when to dump the session state. That’s not too hard.

Ah! I get it now. I was reading that as “Reassembled (frames for a) transfer shall form an ordered transfer sequence (of frames)”, basically making explicit that all frames must be received for a transfer buffer to be complete. Sorry, my mistake.

Oh not yet. Let me at least implement the spec. first before I go changing anything? At least if my confusions on the way are blogged, I’ll be able to remember where I had troubles.

But yeah, some clarification of what implementors should do if it doesn’t arrive in strict sequence (because of transmission errors, weird packet re-ordering by routers, or node reboots) would help. It’s easy to specify you should transmit in order, but reception can’t be controlled.

A literal interpretation of that would imply that a lost transfer could leave the receiving node in a state that would reject all future transfers (a broken sequence) or if a node rebooted then all transfers would be rejected until the counter reached the previous known value. Both of which seem undesirable.

It could also be taken to imply that if the transfer order skips an ID, then the receiver should wait some time period in the hopes that the missing transfer will eventually turn up (by redundant transport or repetition or because the router re-ordered it) and that would mean delaying all the later frames in some buffer until that is resolved. Which also seems undesirable.

With Cyclic transfer ID’s everything would resolve quickly (once it cycles around to the same ID) but monotonic ID’s never would.

That would certainly be ideal. I also build and fly dones (mostly tricopters) and I know that bits of the drone are very susceptible to reboots (especially when the battery is getting low and you punch the motors hard) and a fast motor controller resync time is the difference between crashing and recovery! :fearful:

The CAN case has this covered, but If an arm takes a hit and you’re limping home with a broken wire and rebooting speed controller and have a serial-over-bluetooth redundant link, two seconds is a long, long time.

I’m thinking of the rare case when the frame start delimiter gets corrupted in transit… the frame should parse anyway (since one separator delimiter is enough, according to the implementation docs) but if the previous frame ended a while ago the timing could appear to be very skewed if the ‘most recent delimiter’ timestamp is used. (could actually appear to be ‘backwards in time’ long before the frame was sent!) That’s why I suggested the first header byte - it can’t go missing.

Under the model adopted by the spec, skipping ahead occurs in the event of a transient failure in the transport network (e.g., frame loss), whereas a reset to zero occurs if the remote node is restarted. In both cases, the behavior of both parties is well-specified.

Assuming that the loss of state (i.e., transfer-ID reset) is unlikely to occur often (from your description I infer that it’s so), removal of the least-recently-used transfer-ID on memory exhaustion is a lesser departure from the spec than the alternative because the alternative involves continuous violation of the specification per transfer (reuse of transfer-ID across different ports) whereas the LRU TID removal occurs only under special circumstances such as network reconfiguration.

The problem in this reasoning is that it introduces a leaky abstraction into the transport layer, requiring the application level to resolve issues pertaining to ordering and idempotency. If the transfer ordering constraints are upheld at the transport layer, the relevant context at the application layer is reduced.

There are no dedicated interfaces for state manipulation, and that’s not exactly the purpose of MODE_OFFLINE (it is intended as a way to let a node signal its departure explicitly instead of timing out).

That seems sensible, but does it add value if you have the LRU TID removal policy in place?

No no no. See, if a sequence is ordered that does not imply that it is also contiguous.

Indeed, this could be a valid interpretation, although it’s borderline malicious. I suppose we should add clarifications around this.

Good catch. This failure case is not considered and it’s a bug. Will file a ticket against PyUAVCAN.

Yes indeed, it’s the duty of the implementer to identify the optimal trade-off between:

In v0 we had an explicit provision for auto-tuned TID timeout where the optimal value is computed by the implementation at runtime based on the messaging frequency, and it is implemented in libuavcan v0. In v1 this possibility is not explicitly mentioned but it’s not prohibited either. I am not yet sure if there are any hidden complications or interesting edge cases arising out of such behavior so endorsing such approaches in the spec would be probably unwise, but the possibility is nevertheless still there.

Well over a year has passed since the beginning of this thread. What does the specification roadmap look like today? As @JediJeremy, I am interested in using UAVCAN over UDP/IP and would like to push the specification process forwards.

Would there be a way to accelerate the specification (or at least a formal draft) of UAVCAN/UDP?

To recap what’s been said at the dev call today: it will happen on its own but it will take time until we, the core maintainers, get to that. It might be possible to accelerate the process by funding this work directly and by dedicating additional engineering resources (not necessarily full-time).

As we agreed on the call, we will return to this question at the next dev call. Those who are also interested are welcome to report here or to PM me directly.

I have a lot of interest in UAVCAN/ethernet and hope to participate in its development but my current priorities make that something I wouldn’t be able to drive anytime soon. One thing to consider is why UDP? I haven’t looked into the different ways to utilize ethernet deeply enough to have an informed opinion yet but IEEE 802.1 TSN is of great interest given the industry support for it allowing the use of COTS switches which does suggest UDP is a useful transport. Raw ethernet might also be interesting given the ability to reduce protocol overhead when using UAVCAN in isolation but I’m unsure how this effects portability for platforms like Linux.

Same reason why AFDX is also UDP-based: it’s a zero-cost protocol. UDP merely offers a particular layout of metadata attached to the packet with no additional constraints. Relying on that particular metadata format allows us to stay compatible with COTS products.

As I wrote in the PyUAVCAN UDP transport docs (consider that document to be a sort of RFC until UAVCAN/UDP is formally specified) the UAVCAN session specifier maps well onto the UDP port numbers and IP addresses, allowing us to delegate the packet routing and demultiplexing work to the underlying networking stack, if such is available. Low-level implementations that lack a networking stack (deeply embedded devices) would have to implement virtually similar logic anyway even if we defined our custom packet metadata formats.

Sounds right. UDP then.

I oppose the choice of utilizing multiple different UDP ports. I see two issues:

  • If one desires to crate a UAVCAN switch or a just a passive logging software, this requires listening on many (>30000) UDP ports. This is not impossible, but certainly not nice either.
  • If one desires to use UAVCAN over the internet, having to forward only one port is much simpler/more robust. As @scottdixon already mentioned in the dev call, security needs to be adressed if we want to come close to the interweb. This also works way simpler and more efficient if only a single port is used (say hello TLS/SSL tunnel?).

However, I see certain things which are nice with using multiple/many ports as well, for example that services can run as independent processes, as they listen on different sockets. Although:

  • In an embedded device this barely makes a difference
  • In a non-embedded device this brings up a new question: If multiple services are multiple processes, then one of them can crash, while the others still work. The way I understand UAVCAN so far, this breaks some assumptions that we can derive from the existence of a heartbeat originating of a node (as a node can no crash partially).
  • From my point of view, if a node offers such vast functionality that having multiple processes for various services is sufficient, it would make more sense to split this node up into multiple nodes.
  • Ports allow service identification, but IP addresses do so as well. For the former case we could also just assign a new IP address to a device which then hosts multiple nodes, so that, we do not need to stress multiple ports for a single node.

I understand that both of you @pavel.kirienko & @scottdixon do have limited time for this. I hope we can keep the discussion running a bit anyways! Maybe @finwood has some more to say about our current stance regarding ports as service identifiers.


I don’t think the argument of the ease of forwarding is admissible. The protocol is designed for optimal communication at the application level in embedded systems. Deeply embedded applications can access the network traffic at the data link layer directly (below UDP/IP) and implement the necessary switching logic there. Non-embedded applications can build the same logic using raw sockets.

The many-port design is superior because as explained above it makes heavy use of the existing technology instead of building custom abstractions, and it does to at zero cost for conventional applications (meaning that the forwarding may get slightly more complicated but it’s not a first-class use case).

Regarding UAVCAN over the Internet: at the dev call, there was a bit of miscommunication. We weren’t talking about UAVCAN/UDP over the Internet, that doesn’t make any sense. We were talking about a completely different feature described in section 5.3.13 Internet/LAN forwarding interface. That feature enables UAVCAN nodes to send and receive arbitrary datagrams over public or local computer networks; you can find more info at uavcan.internet.udp. It has absolutely nothing to do with UAVCAN/UDP or any other UAVCAN transport.

Using IP addresses for port identification breaks the hard layering model built into the IP. It’s possible to go this way but it’s hard to justify.

I understand, mostly. A few notes:

Non-embedded applications can build the same logic using raw sockets.

Defeats the

makes heavy use of the existing technology

argument in a way. As one needs to implement UDP on top of the raw sockets then again. Sure, that’s not a lot of work, but not zero effort either.

Using IP addresses for port identification breaks the hard layering model built into the IP

Can you elaborate on this further?

Well, yes, but this is not the primary use of the protocol. In the first place, it is intended for building applications, not bridges.



The conventional network protocols such as UDP/IP we’re dealing with usually follow the ISO/OSI model of abstraction layers (with newer IP-based protocols such as QUIC this may no longer be the case but that would be a separate story).

IP is at layer 3. Layer 3 is involved with routing packets between computers. It is not involved with sessions or multiplexing – that would be layer 4/5 (the border is occasionally blurry), where the UDP is with its port numbers.

1 Like

Hey there… just wanted to give you a quick update since I’ve been quiet for a bit.

Basically, I’ve written about 3500 lines of code and have a very pre-alpha library which implements UAVCAN/Serial over USB and WiFi TCP/IP for the ESP chips using the Arduino SDK. It handles multiple concurrent connections, and implements most of the current alpha spec including the datatype hash ID’s. (which I actually like in their current form, btw.) It cuts corners all over the place, but those corners will get filled in once there’s enough working to act as a proper testing framework. Most of the missing corners are in the transport state machine, eg: I’m de-duplicating frames on a short timer (as we discussed before) but I’m not really enforcing the strict sequential order yet, not sending redundant packets, no multiple frame reassembly, that kind of thing.

The repo is here:

I’ve got it working in “loopback” mode (Heartbeat and NodeInfo) and one of my next tasks is getting pyuavcan to connect to the device over TCP/IP and validate that the two libraries can talk.

Alas I’ve been having some trouble with that (Python’s not my first or favorite language) so I’m wondering if there’s an example somewhere? I’ve tried writing one based on the code snippets in the documentation, but can’t seem to get it to make a connection. (as in, it doesn’t even seem to open a TCP/IP connection to my device)

The other big question I have is if there are any WireShark extensions/extras that make debugging UAVCAN packets easier? I’ve only just installed WireShark so I don’t know much about it, but it seems like something that might have been done?

I’m currently working on the UDP transport but I’m having some issues with that… won’t bore you with the details but let’s just say the ESP’s networking API isn’t that well documented and everything works fine so long as I don’t actually try to send the packet. :unamused: Assembling the packet, opening and closing the port is all fine. sigh Been banging my head on that one for a couple of weeks now, digging ever deeper into the APIs.

What I’m finding is that there is a big overhead in the “simple” APIs which assumes I’m going to create an object (with packet buffers) for each UDP port I intend to send OR receive on. (which would be dozens to hundreds of ports for a complex UAVCAN node) And trying to use the “low level” API to bypass that overhead is causing hard crashes. I’ll get it soon, one way or the other.

The idea of using different UDP ports for each service/subject makes sense if you have lots of memory and hardware-level packet filtering (which the ESP does not) but it also means constructing a parallel set of objects, callbacks, and ports inside the networking API that mirrors what the library keeps for the other transports. I am seriously thinking of putting the ESP into “promiscuous” mode and filtering/decoding each WiFi packet myself, since there may actually be a hard limit on how many UDP ports I’m allowed to have open - possibly only a few dozen. Still working that out.

Once I can make my crash problems go away, I’ll finally have multiple independent devices on the WiFi network exchanging messages without a central point of failure… I mean, central server. :grin: (if you don’t count the WiFi AP)

Once that is functioning, I’ll get the transport state machine fully up to spec. while also looking at creating an entirely new transport using the WiFi P2P/Mesh networking features of the ESP, which means I can even do without the WiFi access point. That’s the Holy Grail for me.

I’m also developing an even deeper hatred of C++ than I already had, but I suspect you know all about that.

1 Like

You should be aware that recently a design flaw was identified in the UAVCAN/serial packet framing (credits to @VadimZ); see the thread UAVCAN/serial: issues with DMA friendliness and bandwidth overhead. We are going to be replacing the current byte stuffing logic with COBS. That will allow us to gain a near-constant overhead of ~0.4% instead of the variable 0~100%. Vadim doesn’t seem to be available at the moment to replace the framing logic of the UAVCAN/serial implementation in PyUAVCAN, so we could use help here.