Alternative transport protocols in UAVCAN

pavel.kirienko · June 10, 2020, 6:39pm

To recap what’s been said at the dev call today: it will happen on its own but it will take time until we, the core maintainers, get to that. It might be possible to accelerate the process by funding this work directly and by dedicating additional engineering resources (not necessarily full-time).

As we agreed on the call, we will return to this question at the next dev call. Those who are also interested are welcome to report here or to PM me directly.

scottdixon · June 10, 2020, 7:09pm

I have a lot of interest in UAVCAN/ethernet and hope to participate in its development but my current priorities make that something I wouldn’t be able to drive anytime soon. One thing to consider is why UDP? I haven’t looked into the different ways to utilize ethernet deeply enough to have an informed opinion yet but IEEE 802.1 TSN is of great interest given the industry support for it allowing the use of COTS switches which does suggest UDP is a useful transport. Raw ethernet might also be interesting given the ability to reduce protocol overhead when using UAVCAN in isolation but I’m unsure how this effects portability for platforms like Linux.

pavel.kirienko · June 11, 2020, 8:07pm

Same reason why AFDX is also UDP-based: it’s a zero-cost protocol. UDP merely offers a particular layout of metadata attached to the packet with no additional constraints. Relying on that particular metadata format allows us to stay compatible with COTS products.

As I wrote in the PyUAVCAN UDP transport docs (consider that document to be a sort of RFC until UAVCAN/UDP is formally specified) the UAVCAN session specifier maps well onto the UDP port numbers and IP addresses, allowing us to delegate the packet routing and demultiplexing work to the underlying networking stack, if such is available. Low-level implementations that lack a networking stack (deeply embedded devices) would have to implement virtually similar logic anyway even if we defined our custom packet metadata formats.

scottdixon · June 17, 2020, 3:50am

Sounds right. UDP then.

wucke13 · June 18, 2020, 12:38pm

I oppose the choice of utilizing multiple different UDP ports. I see two issues:

If one desires to crate a UAVCAN switch or a just a passive logging software, this requires listening on many (>30000) UDP ports. This is not impossible, but certainly not nice either.
If one desires to use UAVCAN over the internet, having to forward only one port is much simpler/more robust. As @scottdixon already mentioned in the dev call, security needs to be adressed if we want to come close to the interweb. This also works way simpler and more efficient if only a single port is used (say hello TLS/SSL tunnel?).

However, I see certain things which are nice with using multiple/many ports as well, for example that services can run as independent processes, as they listen on different sockets. Although:

In an embedded device this barely makes a difference
In a non-embedded device this brings up a new question: If multiple services are multiple processes, then one of them can crash, while the others still work. The way I understand UAVCAN so far, this breaks some assumptions that we can derive from the existence of a heartbeat originating of a node (as a node can no crash partially).
From my point of view, if a node offers such vast functionality that having multiple processes for various services is sufficient, it would make more sense to split this node up into multiple nodes.
Ports allow service identification, but IP addresses do so as well. For the former case we could also just assign a new IP address to a device which then hosts multiple nodes, so that, we do not need to stress multiple ports for a single node.

I understand that both of you @pavel.kirienko & @scottdixon do have limited time for this. I hope we can keep the discussion running a bit anyways! Maybe @finwood has some more to say about our current stance regarding ports as service identifiers.

Cheers!

pavel.kirienko · June 20, 2020, 1:37am

I don’t think the argument of the ease of forwarding is admissible. The protocol is designed for optimal communication at the application level in embedded systems. Deeply embedded applications can access the network traffic at the data link layer directly (below UDP/IP) and implement the necessary switching logic there. Non-embedded applications can build the same logic using raw sockets.

The many-port design is superior because as explained above it makes heavy use of the existing technology instead of building custom abstractions, and it does to at zero cost for conventional applications (meaning that the forwarding may get slightly more complicated but it’s not a first-class use case).

Regarding UAVCAN over the Internet: at the dev call, there was a bit of miscommunication. We weren’t talking about UAVCAN/UDP over the Internet, that doesn’t make any sense. We were talking about a completely different feature described in section 5.3.13 Internet/LAN forwarding interface. That feature enables UAVCAN nodes to send and receive arbitrary datagrams over public or local computer networks; you can find more info at uavcan.internet.udp. It has absolutely nothing to do with UAVCAN/UDP or any other UAVCAN transport.

Using IP addresses for port identification breaks the hard layering model built into the IP. It’s possible to go this way but it’s hard to justify.

wucke13 · June 20, 2020, 11:41am

I understand, mostly. A few notes:

Non-embedded applications can build the same logic using raw sockets.

Defeats the

makes heavy use of the existing technology

argument in a way. As one needs to implement UDP on top of the raw sockets then again. Sure, that’s not a lot of work, but not zero effort either.

Using IP addresses for port identification breaks the hard layering model built into the IP

Can you elaborate on this further?

pavel.kirienko · June 20, 2020, 6:21pm

Well, yes, but this is not the primary use of the protocol. In the first place, it is intended for building applications, not bridges.

Sure:

The conventional network protocols such as UDP/IP we’re dealing with usually follow the ISO/OSI model of abstraction layers (with newer IP-based protocols such as QUIC this may no longer be the case but that would be a separate story).

IP is at layer 3. Layer 3 is involved with routing packets between computers. It is not involved with sessions or multiplexing – that would be layer 4/5 (the border is occasionally blurry), where the UDP is with its port numbers.

JediJeremy · June 28, 2020, 8:43am

Hey there… just wanted to give you a quick update since I’ve been quiet for a bit.

Basically, I’ve written about 3500 lines of code and have a very pre-alpha library which implements UAVCAN/Serial over USB and WiFi TCP/IP for the ESP chips using the Arduino SDK. It handles multiple concurrent connections, and implements most of the current alpha spec including the datatype hash ID’s. (which I actually like in their current form, btw.) It cuts corners all over the place, but those corners will get filled in once there’s enough working to act as a proper testing framework. Most of the missing corners are in the transport state machine, eg: I’m de-duplicating frames on a short timer (as we discussed before) but I’m not really enforcing the strict sequential order yet, not sending redundant packets, no multiple frame reassembly, that kind of thing.

The repo is here: https://github.com/JediJeremy/libuavesp

I’ve got it working in “loopback” mode (Heartbeat and NodeInfo) and one of my next tasks is getting pyuavcan to connect to the device over TCP/IP and validate that the two libraries can talk.

Alas I’ve been having some trouble with that (Python’s not my first or favorite language) so I’m wondering if there’s an example somewhere? I’ve tried writing one based on the code snippets in the documentation, but can’t seem to get it to make a connection. (as in, it doesn’t even seem to open a TCP/IP connection to my device)

The other big question I have is if there are any WireShark extensions/extras that make debugging UAVCAN packets easier? I’ve only just installed WireShark so I don’t know much about it, but it seems like something that might have been done?

I’m currently working on the UDP transport but I’m having some issues with that… won’t bore you with the details but let’s just say the ESP’s networking API isn’t that well documented and everything works fine so long as I don’t actually try to send the packet. Assembling the packet, opening and closing the port is all fine. sigh Been banging my head on that one for a couple of weeks now, digging ever deeper into the APIs.

What I’m finding is that there is a big overhead in the “simple” APIs which assumes I’m going to create an object (with packet buffers) for each UDP port I intend to send OR receive on. (which would be dozens to hundreds of ports for a complex UAVCAN node) And trying to use the “low level” API to bypass that overhead is causing hard crashes. I’ll get it soon, one way or the other.

The idea of using different UDP ports for each service/subject makes sense if you have lots of memory and hardware-level packet filtering (which the ESP does not) but it also means constructing a parallel set of objects, callbacks, and ports inside the networking API that mirrors what the library keeps for the other transports. I am seriously thinking of putting the ESP into “promiscuous” mode and filtering/decoding each WiFi packet myself, since there may actually be a hard limit on how many UDP ports I’m allowed to have open - possibly only a few dozen. Still working that out.

Once I can make my crash problems go away, I’ll finally have multiple independent devices on the WiFi network exchanging messages without a central point of failure… I mean, central server. (if you don’t count the WiFi AP)

Once that is functioning, I’ll get the transport state machine fully up to spec. while also looking at creating an entirely new transport using the WiFi P2P/Mesh networking features of the ESP, which means I can even do without the WiFi access point. That’s the Holy Grail for me.

I’m also developing an even deeper hatred of C++ than I already had, but I suspect you know all about that.

pavel.kirienko · June 28, 2020, 7:14pm

You should be aware that recently a design flaw was identified in the UAVCAN/serial packet framing (credits to @VadimZ); see the thread UAVCAN/serial: issues with DMA friendliness and bandwidth overhead - #28 by scottdixon. We are going to be replacing the current byte stuffing logic with COBS. That will allow us to gain a near-constant overhead of ~0.4% instead of the variable 0~100%. Vadim doesn’t seem to be available at the moment to replace the framing logic of the UAVCAN/serial implementation in PyUAVCAN, so we could use help here.

JediJeremy · June 29, 2020, 2:19am

Well, that’s a whole hour of writing incremental encoder/decoder state-machines down the drain.

I’m fine with the COBS proposal in general although I’ve got a comment or two that I might append to that discussion, which hopefully doesn’t blow up the consensus. See you over in that thread…

pavel.kirienko · July 20, 2020, 3:38pm

Stabilizing uavcan.node.port.List.0.1, introspection, and switched networks

In the interest of advancing the DS-015 effort, I was working on the set of optional UAVCAN application-level capabilities that are to be made mandatory by DS-015. That forced me to return to the long-postponed issue of stabilizing the port discovery services under uavcan.node.port. (which are, as of today, bear the version number 0.1), particularly the listing service uavcan.node.port.List. Currently, the service is implemented using the pull model, where the inspected node is to respond with a list of the subject- or service-identifiers when requested by the caller.

My recent work on the Interface Design Guidelines brought it to my attention that the design of this particular service (as in architecture, not UAVCAN-service) is not in perfect alignment with our guidelines. Specifically, the pull model does not provide means of notifying inspectors of a change in the pub/sub configuration on the inspected node, and introduces a certain degree of statefulness into the service that is easily avoidable. Additionally, leaving the decision on when to invoke the service whose handling may be time-consuming to external agents complicates the implementation of robust and predictable scheduling on the local node. These observations prompted me to consider replacing the port list service with a subject (still optional, of course) that is to be published periodically and on-change to announce the current subscription configuration of the local node. The publisher configuration can be subjected to the same treatment but it is not immediately required by the application-level objectives at hand, and as such, this part of the problem can be postponed indefinitely – after all, the publication set is always trivially observable on the network by means of mere packet monitoring, which is not the case for the subscription configuration.

The introspection I am speaking about here is vital for the facilitation of the advanced diagnostic tools such as Yukon with its Canvas (shown below), which would be unable to display any meaningful interconnection information without being able to detect not only outputs but also inputs.

yukon-canvas-early-demo

Beyond introspection, this capability is important in switched network transports for automatic configuration of the switching logic. As I briefly touched in the OP post, specialized AFDX switches implement routing and network policy enforcement that are to be configured statically at the system definition time (practical installations of AFDX may employ rigid time schedules generated with the help of automatic theorem provers); acting as the traffic policy enforcer, an AFDX switch is able to confine the fault propagation should one of its ports be affected by a non-conformant emitter (the so-called babbling idiot failure). The output ports for incoming frames are selected based on the static configuration of virtual links; to avoid delving into the details of that technology, this can be thought of as a static switching table.

In the UAVCAN/UDP broadcast model discussed in the OP post, the output port contention and the resulting latency issues are proposed to be managed by statically configuring L2 filters at the switch such that the data that is not relevant at the given port is to be dropped by the switch, thus reducing the latency bound of the relevant data. The following synthetic example illustrates the point – suppose that the camera node generates significant traffic that is not desired beyond the left switch, and the perception node generates low traffic destined towards the mission computer node where the latency is critical:

If the output ports were to be left unfiltered, the traffic from the camera would have been propagated towards the right-side subnet, increasing the output port contention and the latency envelope throughout.

The static L2-filter configuration is efficient at managing the port contention issues if we consider its technical merits only, but it creates issues for quickly evolving applications with lower DAL/ASIL levels where the need to reconfigure the switch (or the excessive rebroadcasting that would result if no filtering is configured) may create undue adoption obstacles.

Drawing upon the theory explained in Safety and Certification Approaches for Ethernet-Based Aviation Databuses [Yann-Hang Lee et al, 2005], section 5.2 Deterministic message transmission in switched network, it is apparently trivial to define a parallel output-queued (POQ) specialized switch that is able to derive the output port filtering policy automatically by merely observing the subscription information arriving from said port regardless of the number of hops and the branching beyond the port. I will leave this as a note to my future self to expand upon this idea later because these technicalities are not needed in this discussion.

The described ideas rely on a compact representation of the entirety of the subscription state of a given node in one message. Let the type be named uavcan.node.port.Subscription. There are a few obvious approaches here.

Bitmask-based. Given the subject-ID space of 2^{15} elements, the required memory footprint is exactly 4096 bytes.

# Node subscription information.
# This message announces the interest of the publishing node in a particular set of subjects.
# The objective of this message is to facilitate automatic filtering in switched networks and network introspection.
# Nodes should publish this message periodically at the recommended rate of ~1 Hz at the priority level ~SLOW.
# Additionally, nodes are recommended to publish this message whenever the subscription set is modified.

uint8 SUBJECT_ID_BIT_LENGTH = 15

bool[2 ** SUBJECT_ID_BIT_LENGTH] subject_id_mask
# The bit at index X is set if the node is interested in receiving messages of subject-ID X.
# Otherwise, the bit is cleared.

List-based with inversion. The worst-case size is substantially higher – over 32 KiB. The advantage of this approach is that low-complexity nodes will not be burdened unnecessarily with managing large outgoing transfers since their subscription lists are likely to be very short. The disadvantage is the comparatively high network throughput (albeit at a low priority) being utilized for mere ancillary functions.

# Node subscription information.
# This message announces the interest of the publishing node in a particular set of subjects.
# The objective of this message is to facilitate automatic filtering in switched networks and network introspection.
# Nodes should publish this message periodically at the recommended rate of ~1 Hz at the priority level ~SLOW.
# Additionally, nodes are recommended to publish this message whenever the subscription set is modified.

@union

uint16 CAPACITY = 2 ** 15 / 2

uavcan.node.port.SubjectID.1.0[CAPACITY] subscribed_subject_ids
# If this option is chosen, the message contains the actual subject-IDs the node is interested in.

uavcan.node.port.SubjectID.1.0[CAPACITY] not_subscribed_subject_ids
# If this option is chosen, the message contains the inverted list: subject-IDs that the node is NOT interested in.

The service ports can be efficiently represented like #1 in the same message because of their limited ID space, shall that be shown to be necessary:

bool[512] server_service_id_mask
bool[512] client_service_id_mask

Is it practical to consider the further restriction of the subject-ID space, assuming that it is to be done without affecting the compatibility of any existing UAVCAN/CAN v1 systems out there?

scottdixon · July 22, 2020, 5:16am

As usual, a well researched article here Pavel; thanks.

I don’t think constraining the subject-ID space is wise. I wouldn’t support that.

There are two refinements I can imagine for your improved subscription reporting scheme:

Use compression to allow the transmission of bitfields where small lists only take up a byte or two and the worst case of transmitting a full 4096 bytes only occurs for a node that actually subscribes to 2^15 subjects*. One can have much fun researching and implementing compressed bitfields for datasets that are normally sparse however some kind of RLE would seem adequate if boring.
Use a “sonar” protocol rather than a periodic publication. This is different then a service call (i.e. our current “pull” model) since the request is a broadcast and the responses are not strongly correlated (I’ll otherwise let the sonar analogy describe the implementation). This has the benefit of avoiding the additional bus load for systems that do not consume the subscription data and avoids publication of the data at rates a subscriber cannot consume. Of course this protocol is easy to abuse since it’s the perfect vector to DoS a bus and does qualify as control coupling where a system must utilize CPU resources to respond to the ping. Mitigations could include ping-response rate limiting and tolerating mute nodes. This also obviates the need for a “notify on changed” message since any listener to these messages can simply maintain a table of values and detect changes.

* do we need a special “subscribes to all subjects” indicator for nodes like bus loggers?

pavel.kirienko · July 22, 2020, 9:29am

Let me be more specific about my idea to make sure we are on the same page. If we removed the two most-significant bits of the subject-ID, thereby limiting the range from [0, 32768) to [0, 8192), the size of the bitmask message would be 1024 bytes, which is guaranteed to fit into one Ethernet frame (or CAN XL frame) assuming a typical MTU. Only now I realize that we neglected to define the required size of the subject-ID space, choosing it based on the available means of implementation (the CAN ID layout) rather than objective application requirements.

Should we approach this problem now?

A sensible combination of the above two methods plus your special case is possible:

@union

uint16 CAPACITY = 2 ** 15  # Or 2 ** 13, see above

bool[CAPACITY]                                mask
uavcan.node.port.SubjectID.1.0[<CAPACITY/8/2] list  # Same maximum size as "mask"
uavcan.primitive.Empty.1.0                    total
# Option "mask" can be used always unconditionally.
# In the interest of bandwidth optimization, option "list" may be used if the number
# of subjects is less than (CAPACITY/16).
# Option "total" is equivalent to "mask" with all bits set (useful for loggers/analyzers).

@assert _offset_.max / 8 == 1 + CAPACITY / 8

Compression is indeed possible but are we not concerned about the resulting complex relationship between the subject-ID distribution and the message size with the variable computational workload necessary for its generation and parsing?

The control coupling (scheduling) problem is part of the reason why I would like to replace services with a subject; the sonar method does not seem helpful here. Another problem here is that in a switched network, a switch is assumed to be a purely reactive device that does not initiate traffic on its own, otherwise, the latency analysis gets complicated. In an operational network, the network switching hardware may be the only consumer that is interested in the subscription information, and if nobody is driving the sonar protocol, the switches will be unable to auto-configure themselves.

The traffic we create with the subscription information can be pushed down to the SLOW or even OPTIONAL priority level because in a statically configured system the loss of any of the messages is tolerable.

scottdixon · July 22, 2020, 2:54pm

I missed that you were hoping to configure switches with this. That would require specialized hardware would it not?

pavel.kirienko · July 22, 2020, 2:56pm

Yes, as is the case with AFDX. But it does not affect compatibility with conventional networking hardware, obviously.

scottdixon · July 22, 2020, 3:37pm

We have defined it. [0, 32768). Perhaps Derrida would be happier with our reasons then Plato but it does seem the decision was made. I think what you mean is that we have new information you want to inject into this decision that might modify it. If so we would need more concrete inputs. A general idea that such a reduction in subject ID scope “might make it easier to build a switch for UAVCAN” isn’t an adequate input.

I think I need to stew on this more given that I was not reading it properly. We seem to be solving an immediate problem while guessing at possible hardware requirements in the future?

pavel.kirienko · July 22, 2020, 4:03pm

I think I am making that mistake again where I just spill out some ideas without explaining the larger context first. The missing bit here is that having released the v1.0-alpha spec half a year ago, we are on track to release v1.0-beta very soon once the last remaining little issue is taken care of. Once the beta is out and the first users from the PX4 community have adopted it (which will happen fast and I am regularly being reminded about the urgency of this epic), we will be stuck with our design decisions for a long time, and the ability to change anything that is not a fatal design flaw and that has the chance of affecting the compatibility with the fielded systems will be essentially gone for a long time.

Now, with that in mind, I am somewhat paranoidly looking for any loose screws we may have left in the foundations and the chain of conclusions that led us to where we are. While doing so, I stumble upon the [0, 32768) hanging in free air supported by nothing, and panic.

I am going to tattoo this line on my face as the best description of developing long-term technical specifications.

scottdixon · July 22, 2020, 4:59pm

I hear you but I’m specifically dubious because of the speculation around hardware support. We have, to date, defined a specification that has no (known) requirements outside of what commonly available hardware can support. Now we are proposing a constraint that materially limits the protocol’s immediate capabilities based on speculations about non-existent hardware. If we are going to build an ASIC switch that supports UAVCAN, for example, then why would a compressed bitfield present difficulty? The decompression algorithm would be in hardware. Yes, the nodes would need to write software compression but we could easily prototype and characterize the complexity of such an algorithm to evaluate the weight that should place on our design.

Perhaps you are thinking that some COTS part designed for a similar protocol might be configurable to support our use case? If so then, again, we need a prototype to evaluate this.

pavel.kirienko · July 22, 2020, 5:49pm

We can safely omit the switching implications from this discussion now and focus on the more immediate objectives, which are two:

The introspection service is a required feature that is demanded by the upcoming DS-015 standard so the question is how we should address it.
Do we leverage the last chance to review the range or do we keep it as is? This choice has implications on the above.