[Cyphal/UDP] Architectural issues caused by the dependency between the node's IP address and its identity

scottdixon · October 26, 2022, 5:42am

A brokerage service is a non-starter. It violates one of the core design tenets of Cyphal (from Section 1.3 of the specification):

Democratic network — There will be no master node. All nodes in the network will have the same commu-
nication rights; there should be no single point of failure

Multiple ports are a fire-walling nightmare (FTP anyone?). I’m not interested in making it hard to secure a system by using dynamic port assignments.

scottdixon · October 26, 2022, 5:47am

This is an assumption I’m not willing to concede.

scottdixon · October 26, 2022, 5:52am

Please define. Do you mean “dynamic, multi-cast-preconfiguration” or “dynamic-multi-cast preconfiguration”? Dynamic and preconfiguration seems to be opposing concepts so I cannot parse your argument.

scottdixon · October 26, 2022, 6:11am

This thread stopped being a useful discussion at item 7 and I TLDR the rest of it all.

We will reset this thread and pretend it didn’t descend into a flame war.

Here are some immutable facts we don’t have to argue about:

@Sergey_Andreev thinks the proposal on the table is over-complicated.
@pavel.kirienko thinks the proposal on the table is simpler the then current spec.
Both @Sergey_Andreev and @pavel.kirienko think that simplicity is a key design tenet (as do I).
Multi-tenant nodes are first-class use cases for production systems since Cyphal is relevant for application processors running operating systems as well as bare-metal firmware. This means either each node on an Ethernet end-system has to be individually addressable via Ethernet or we have to build a Cyphal router specification to handle this use-case.
A brokerage service will never be considered by Cyphal. We are not reinventing DDS. We are certainly not reinventing CORBA nor DCOM. We will not require a broker. Ever.
Dynamic IP port assignment is a nightmare to firewall. It reduces security because people normally just disable the firewall altogether rather than figure it out.

Okay, now we can stop talking about 1-6 and can start this discussion from the top.

I think Pavel’s idea is a good proposal but I am concerned about the number of multicast groups switches would have to manage. That concern is based on no data so we should go get some. Is anyone interested in prototyping this proposal and seeing if it destroys COTS Ethernet switches and/or Linux kernels when load tested?

pavel.kirienko · October 26, 2022, 5:06pm

@maksim.drachov is working on a PoC based on PyCyphal. Maksim, do we have anything to share?

maksim.drachov · October 27, 2022, 6:50am

Sorry, last weekend I had to take a break. Will continue this one.

Sergey_Andreev · October 27, 2022, 3:46pm

Yes, my proposal of dynamic UDP ports allows both options: either to statically pre-configure all the nodes (i.e. manually), or automatically setup nodes in real-time from the range of pre-allocated UDP ports.

If you don’t want a centralized broker for the latter option, then a distributed dynamic configuration is also possible - for example, similar to how ARP works. The children nodes inside a big host on a single IP can advertise their presence and suggest specific UDP ports to talk. And nodes outside that host could have also asked on the network, like - “Hey, what children nodes does that specific host/IP address have?” We can even add negotiation and acknowledge on specific UDP ports when needed for a multi-node host use case.

I mean, the proposal of dynamic ports is very flexible (yet standard) and brings additional features/value to Cyphal networking. I.e. it expands opportunities and looks worthwhile to consider and evaluate.

scottdixon · November 17, 2022, 5:16am

This is my proposal. It’s close but adheres to RFC 2365 a “bit” better (pun intended) and flushes out the full Cyphal Header to be a lot more robust:

This is an optimization for UDP/IP on Ethernet. By limiting the multicast group ID to the least significant 23 bits, Ethernet hosts can avoid additional filtering responsibilities above layer 2.
RFC 2365, Section 6.2.1 reserves 239.0.0.0/10 and 239.64.0.0/10 for future use (because of footnote 1, Cyphal/UDP does not have access to the 239.128.0.0/10 scope). Cyphal/UDP uses this bit to isolate version 1 traffic to the 239.64.0.0/10 scope.
If not set then all 13 bits of the port ID are a Subject-ID. If set then the first 9 bits of this Port-ID are a Service-ID, the 11th and 12th bits are reserved and set to 0. and the 13th bit becomes the “Request, not response” bit.
I’ve omitted the subnet concept for now. I think we should introduce that in a later change once the Cyphal/UDP specification is more mature.
We’ll register UDP ports later. These are just an examples.
I’m not sure this makes sense. Cyphal relies on CAN for priorities so why wouldn’t we also require that prioritization has to be handled by a lower layer for UDP?
We want to propose using bit 23 of the Cyphal/CAN spec as the same flag with is used to mark a given transfer as containing valid data. I’m not going to dive into this proposal right now but the TLDR is this bit must be the same for all transfers composing the same message.
If the “has synchronized time value” bit is set then Cyphal routers can interpret the first 56 bits after the Cyphal header as a uavcan.time.SynchronizedTimestamp-1.0 field. This packet inspection is to enable custom routing rules based on time but the specific rules are not controlled by the specification.

pavel.kirienko · November 17, 2022, 9:47am

The same is expected of Cyphal/UDP: traffic prioritization is to be handled by the lower layers. The priority field is duplicated in the header to simplify transfer forwarding as the QoS field is not always easily reachable (depending on the API used).

I don’t understand why in this proposal the responsibility of traffic routing is moved from the IP layer to the Cyphal layer. Here’s the transport model as a visual aid:

In most IP-based stacks, the responsibility of delivering datagrams to the intended recipient is delegated to the IP layer, whether it is unicast or multicast being irrelevant. My original proposal is in line with this principle; specifically:

Cyphal message delivery is based on the subject-ID; hence, the subject-ID is manifested in the multicast group address.
Cyphal RPC service transfer delivery (request or response) is based on the destination node-ID (the service-ID is irrelevant here); hence, the node-ID is manifested in the multicast group address.

The new proposal is identical for subjects but different for services and the difference creates a problem: the IP layer can no longer solve the problem of delivering service transfers to the intended recipient because the required information is not available at this layer: instead of the destination node-ID there is the service-ID, which is irrelevant for this task. This leads to the obvious problem: all nodes that provide/invoke a given service will have to receive and process (discard) service transfers that neither originate nor terminate at the local node, which creates an obvious scalability and latency problem.

I don’t think this is an acceptable proposal.

scottdixon · November 17, 2022, 8:22pm

You are, as you normally are, quite right. Here’s my updated proposal (there may be errors as I’m trying to work quickly here):

This is an optimization for UDP/IP on Ethernet. By limiting the multicast group ID to the least significant 23 bits, Ethernet hosts can avoid additional filtering responsibilities above layer 2.
RFC 2365, Section 6.2.1 reserves 239.0.0.0/10 and 239.64.0.0/10 for future use (because of footnote 1, Cyphal/UDP does not have access to the 239.128.0.0/10 scope). Cyphal/UDP uses this bit to isolate version 1 traffic to the 239.64.0.0/10 scope.
If set then this is an RPC request or response and the 16 LSbs of the destination IP address is the full-range destination node identifier. If not set then the 13 LSbs of the destination IP address are a subject identifier for a pub/sub message and the 14th, 15th, and 16th LSbs are 0.
I’ve omitted the subnet concept for now. I think we should introduce that in a later change once the Cyphal/UDP specification is more mature. As such this is 0 on transmit and ignore on receipt.
We’ll register UDP ports later. These are just an examples.
Like in CAN: 0 – highest priority, 7 – lowest priority. This data is duplicated from lower-layer QoS fields but provided in the Extended Cyphal header to simplify transfer forwarding where the QoS data is not readily available above the transport layer.
We want to propose using bit 23 of the Cyphal/CAN spec as this same flag. It would mark a given transfer as containing valid data. I’m not going to dive into this proposal right now but the TLDR is this bit must be the same for all transfers composing the same message.
If the “has synchronized time value” bit is set then Cyphal routers can interpret the first 56 bits after the Cyphal header as a uavcan.time.SynchronizedTimestamp-1.0 field. This deep packet inspection enables custom routing rules based on time but the specific rules are not controlled by the specification.
Per RFC 1112, the default TTL is 1, which is unacceptable. Therefore, publishers should use the TTL value of 16 by default, which is chosen as a sensible default suitable for any intravehicular network.
The Cyphal Header Length is modeled on the IPv4 IHL and is a number of 32-bit words that points to the start of the Cyphal payload offset from the start of the Cyphal Header. By default is is 5 (and cannot be < 5) but values > 5 open up an optional extended header area. See note 11 for our proposed Extended Header structure. Note that the checksum must include the data in the extended header if it is present.
The first version of the extended Cyphal header (see Extended Header Version field) is designed to surface lower-layer data not strictly required to allow implementations to optimize, analyze, and monitor Cyphal traffic.
This is a bit of a wart; we have 20 bits reserved which is quite a lot. It’d be nice to get the header down to 4 32-bit words so I wonder if we can find 12-bits somewhere to squeeze it all in? Between the Version and the CHL we have a lot of flexibility to change this later without reserved bits.

pavel.kirienko · November 18, 2022, 10:18am

This is much better, the IP address structure seems sensible. I have no objections against the temporary removal of the subnet-ID (domain-ID).

I couldn’t find where the service-ID is specified. In the existing PoC, it is encoded in the destination port number; my current proposal does not change this. I understand you want to use some fixed UDP port (or a small set of ports) to simplify traffic policing, so the service-ID should be located in the Cyphal header somewhere.

What is the objective of segregating anonymous traffic from regular traffic by destination port?

Both UDP and Ethernet frames contain checksums for the payload. While the UDP checksum is quite weak, the Ethernet checksum provides the Hamming distance of 4 at the maximum MTU (up to three random bit errors). Is this considered insufficient? If yes, what is the required Hamming distance, and at what MTU? Koopman CRC32 provides the Hamming distance of 6 for up to 32K of data, which I think is the state-of-the-art at the moment. The cost of these extra two bits is, however, high because the Cyphal CRC calculation cannot be delegated to the hardware (unlike Ethernet/UDP checksums). Unless there are strong arguments, I would suggest that we avoid the introduction of additional checksums here. This remark does not apply to the overall transfer CRC, which has to stay regardless.

The CHL field (header length) looks a little troubling as it means that the header would not be possible to decode by means of direct memory aliasing, and hardware traffic policing might get either complicated or impossible due to the variable offsets. I understand the appeal of the gained flexibility but are we sure we are not defying a core design goal of keeping things simple? I would rather suggest that we provide more fixed-size reserved space and keep all offsets fixed.

We want to propose using bit 23 of the Cyphal/CAN spec as this same flag. It would mark a given transfer as containing valid data. I’m not going to dive into this proposal right now but the TLDR is this bit must be the same for all transfers composing the same message.

We should either discuss the specifics or simply mark this bit as “transmit zero/ignore” for now.

scottdixon · November 18, 2022, 7:34pm

My thought was, given that anonymous traffic is a rare special-case, dedicating a UDP port to it would provide a nice way to route anonymous traffic to services that can handle it like PnP. Additionally, this allows administrators to disable or route anonymous traffic using routing or firewall rules.

pavel.kirienko · November 18, 2022, 7:37pm

I see that there is indeed some advantage, but then an implementation that does not need to differentiate between the two kinds of traffic would have to keep two ports open per subject, which is a disadvantage. Are we sure the advantage outweighs the disadvantage?

scottdixon · November 18, 2022, 7:38pm

Would it not be that, practically speaking, only a PnP allocator would be subscribed to messages on the anonymous port?

pavel.kirienko · November 18, 2022, 7:46pm

Maybe. But if we adopt this assumption, implementations would have to expose some option at the top-level API allowing the user to opt into anonymous traffic.

scottdixon · November 18, 2022, 7:52pm

what’s your counter proposal?

pavel.kirienko · November 18, 2022, 7:54pm

I propose to direct both anonymous and regular traffic to the same destination port; to optimize for the general case.

scottdixon · November 18, 2022, 8:10pm

I’m having to think carefully about this. On the one hand, most high-assurance systems require “end-to-end integrity checking” which often leads to every layer in a network stack adding a checksum before handing off to the next layer. This ensures that errors introduced after serialization but before transmission are caught. That said, if you are on an embedded system with lockstep CPUs and ECC RAM the expectation is your program and network stack are inherently robust against data corruption. The one weakness would be if the message was moved into non-ECC hardware buffers before the Ethernet checksum was calculated (e.g. perhaps a stack copies a message into DMA before the Ethernet header is added? Not sure). Perhaps the strategy should be to reserve enough space for such a checksum but not call out the checksum just yet?

Thoughts?

scottdixon · November 18, 2022, 8:11pm

I’m fine with this but I need to think about the overall size of these headers and protocol efficiency. At what point have we bloated Cyphal/UDP too much that it hinders simpler use cases?

scottdixon · November 19, 2022, 12:58am

Updating the proposal based on the conversation so far:

This is an optimization for UDP/IP on Ethernet. By limiting the multicast group ID to the least significant 23 bits, Ethernet hosts can avoid additional filtering responsibilities above layer 2.
RFC 2365, Section 6.2.1 reserves 239.0.0.0/10 and 239.64.0.0/10 for future use (because of footnote 1, Cyphal/UDP does not have access to the 239.128.0.0/10 scope). Cyphal/UDP uses this bit to isolate version 1 traffic to the 239.64.0.0/10 scope.
SNM (Service, Not Message): If set then this is an RPC request or response and the 16 LSbs of the destination IP address is the full-range destination node identifier. If not set then the 13 LSbs of the destination IP address are a subject identifier for a pub/sub message and the 14th, 15th, and 16th LSbs are 0.
I’ve omitted the subnet concept for now. I think we should introduce that in a later change once the Cyphal/UDP specification is more mature. As such this is 0 on transmit and ignore on receipt.
We’ll register UDP ports later. These are just an examples.
Per RFC 1112, the default TTL is 1, which is unacceptable. Therefore, publishers should use the TTL value of 16 by default, which is chosen as a sensible default suitable for any intravehicular network.
Reserved bits we can use for a future version of the header that supports a variable size or we can decide to do other stuff with this bits. In anticipation of the former this shall always be 8 for version 1 of the header.
I want to propose using bit 23 of the Cyphal/CAN spec as this same flag. It would mark a given transfer as containing valid data. I’m not going to dive into this proposal right now but the TLDR is this bit must be the same for all transfers composing the same message.
If the SNM10 bit is set then this is a 10-bit service identifier with a 1-bit IRNR11 flag, otherwise it is a 13-bit subject identifier.
SNM (Service, Not Message). Same value as found in the destination IP header (SNM3).
IRNR (Is Request Not Response) if SNM10 is set.
SwST (Starts with Synchronized Time): If set then Cyphal routers can interpret the first 56 bits after the Cyphal header as a uavcan.time.SynchronizedTimestamp-1.0 field. This deep packet inspection enables custom routing rules based on time but the specific rules are not controlled by the specification.
Like in CAN: 0 – highest priority, 7 – lowest priority. This data is duplicated from lower-layer QoS fields but provided in the Extended Cyphal header to simplify transfer forwarding where the QoS data is not readily available above the transport layer.
Opaque data a system can use for diagnostics and message tracing.