V0, tail data optimization. & time slot optimisation

Maxim · June 4, 2021, 10:26am

Hello.

At the moment I am writing a program manager for the uavcan network, for internal use.
When testing sending and receiving uavcan.protocol.debug.LogMessage messages under a low network load (up to 100 messages per second), quite often, several times a minute, I receive invalid text messages.
Which are mostly cut or incorrectly glued. I am using libcanard library (Copyright © 2016-2017 UAVCAN Team).
When analyzing the work of the assembly receiving mechanism, a problem was found that, in conjunction with the “optimization trimming” of the ends of packets, revealed a weak assembly information mechanism in the last 8th byte of can frames.

Assembly byte (at the end of the frame):
1 - bit - Start multi frame / frame
1 - bit - End multi frame / frame
1 - bit - togle (packet loss control)
5 - bit - Seq / Req ID.

Question 1: what happens when two consecutive frames from the multipacket of this message are simply lost in the network and the “togle” bit detection is valid ?!
Question 2: is there any fixes in the V1 spec regarding this issue ?!
Question 3: is it possible to see a low-level specification of the internal message format of the V1 frame, since I want to embed the processing of V1 messages in the program now and write the assembly mechanism for receiving myself.

Does V0 or V1 provide “mandatory” sync commands to create transaction time windows for devices on the network?
Is there any general rule of synchronization of nodes (for example: behavior for certain commands / messages) for the possibility of implementing timeslots in versions V0 and V1?
Are there commands for the eventual presence of a “transaction” coordinator?

I understand the mechanism and advantages of can bus hardware to regulate collisions, but this is probably good only in theory …
With an increasing network load, sooner or later all physical transmissions will be synchronized by collisions in successive waves of priority transmissions (seen with an oscilloscope).
The number of devices in the collision will increase the packet wavelength.
Setting the transmission intervals of packets from different manufacturers of equipment for two or three devices on the bus is quite real, but if there are several dozen devices, then the setting will be comparable to the difficulty of a Nobel laureate.
Internal transmission intervals of the device are local only for the device, delays in physical transmissions due to hardware “deliberations” of collisions will lead to a decision or increase the transmission buffer (will not solve the problem),
or discard the current transmission because of the next higher priority (not solving the problem, but creating it on the receiving part of the network).

Due to the asynchronous operation of the logic of the device with the physical process of completion of transfers, the operating time of “passively waiting for sending data to the network” will be delayed,
which will increase the operating time of the controller before hibernation (or make hibernation impossible).
Poor setting (any setting for a large network) of message time periods or their erratic behavior (from different manufacturers) can drastically reduce the peak bus bandwidth.

Mandatory support for some synchronization commands can improve the situation radically or make collisions predictable.

Thank you.

pavel.kirienko · June 4, 2021, 6:57pm

Hi Maxim!

The transfer is discarded due to CRC mismatch. However, a frame loss is an exceedingly unlikely event on the CAN bus due to the built-in retransmission logic. A dual frame loss is (assuming that losses are statistically uncorrelated) is, naturally, much less likely to occur.

It is highly unlikely that this issue is caused by the protocol design because it would imply that you are observing multiple CRC collisions per 6000 transfers. Your observations can be explained by a faulty implementation or malfunctioning CAN layer (are you really losing several frames per minute?). Maybe you should log your CAN traffic for careful offline investigation.

If you are diagnosing the network using the old UAVCAN GUI Tool, then you are likely to see false-positive failures because its built-in protocol decoder is imperfect at best.

Sure, it’s here: uavcan.org/specification

Regarding the service discipline: UAVCAN does not define a time-triggered mode because it is, generally, incompatible with full decentralization. It doesn’t prevent one from implementing time-triggered behaviors though, but it is not endorsed by the protocol itself and is unlikely to be.

If I read you correctly, you seem to be implying that non-work-conserving service disciplines (time-triggered mode) are superior to work-conserving disciplines (pure CSMA/CA) in terms of throughput and latency. As far as I understand network calculus, this is not true in general. The benefit of TT mode (or WC disciplines in general) is that may simplify the analysis compared to non-WC disciplines, thereby, perhaps, reducing the cost of design and verification in some aspects, while introducing (usually) centralized time management and possibly limiting the overall network resource utilization factor (which follows from the definition of non-WC policies).

I certainly agree with your proposal that proper allocation of the port priorities that would satisfy the timing constraints can be a daunting task. If you would like to approach the task formally, I can recommend two helpful resources to start with (you can find them online):

“Network Calculus: A Comprehensive Guide”, Bemten & Kellerer
“Safety and Certification Approaches for Ethernet-Based Aviation Databuses”, chapter 5 “Analysis of deterministic data communication”, Lee et al.

Most of the theory is focused on point-to-point switched networks which is not directly applicable to the CAN bus. However, at least to some extent, you can model the CAN bus as an IQ-switch (the bus itself is then viewed as the switch’s fabric with no speed-up) to apply the theory developed for switched networks to CAN.

Deterministic scheduling is an interesting topic that is critically important for many UAVCAN applications. It is true that it is mostly overlooked by the existing tools out there, making life unnecessarily difficult for adopters like you, but I was hoping to approach it one day as part of our work the Yukon project. Maybe you or whoever is reading this could lend us a hand here, that would be appreciated.