Cyphal/UDP Routing over Multiple Networks

schoberm · April 28, 2022, 11:50pm

What are your thoughts on routing Cyphal service requests and responses over Multiple Networks? What questions and scenarios should we be considering when developing the Cyphal/UDP specification?

I’m working with Scott Dixon (@scottdixon) and just started looking at the development of the Cyphal/UDP specification.

I see there is some current development and proposals in forum posts by Pavel (see 1, 2, 3 below), but I’m unsure if there is more work already completed or researched, specifically around how transfers will be routed through a network. From what I have read it looks like we can use multicast as a specialized broadcast to cover subjects and messaging and we can use ports for service. But how will we address a destination server for requests and responses?

For example, if we have two local networks A and B that are connected over UDP where each network A and B contains multiple Nodes how will a node Node2 on network A (nodeID = 2) address a node Node2 on network B? See the image below. This simple case, we could conceivably make a more complicated set of networks and nodes, but this case should be enough to get into some discussions.

Within each of the networks simply utilizing the last 2 octets of the IP address should suffice (as described in source 2 by Pavel) for addressing and routing, but outside of the network we need some way to also encode the network ID.

My naive approach to solving problem this assumes that we are in a controlled embedded system and that we can use statically pre-configured routing tables and some port forwarding to address nodes across networks (NetworkA:Node2 → NetworkB:Node2). However, that might interfere with how ports would be used locally for subjects and services.

pavel.kirienko · April 29, 2022, 11:16am

I think the best and the most up-to-date reference on the experimental Cyphal/UDP transport is here:

https://pycyphal.readthedocs.io/en/stable/api/pycyphal.transport.udp.html

Did you have a chance to play with it yet? There is an illustrative demo provided in PyCyphal docs.

This problem appears to be out of the scope of Cyphal/UDP if we delegate the routing to the underlying IP layer entirely. In your example specifically, one could do either:

Use a virtual network spanning both A and B.
Join networks A and B by changing the second octet in one of them from 0 to anything else. E.g., Node2 in Network B would become 192.168.1.2, rendering it available to all nodes in NetworkA as Node-ID 258 ((1<<8)+(2<<0)). This should make the nodes reachable by means of the conventional IP layer without the need to customize the routing rules.

scottdixon · April 29, 2022, 8:48pm

But, if we simply delegate to the IP layer then we do not expand our address space beyond the 127 limit for NodeIDs. Do we not think that 127 could be a crippling limit on a vehicle system with enough complexity to have both CAN and Ethernet networks?

I was wondering if we should explore using NAT?

pavel.kirienko · April 30, 2022, 12:04pm

Why 127? If we use the 16 least significant bits, like we do in the existing PoC, the limit is 65534 (considering the limitations of the IP layer).

scottdixon · May 3, 2022, 10:20pm

In a configuration like this:

we need to specify how NodeID (NID) 1 on CAN bus 0 (for example) maps to an IP address that can be proxied on a receiving CAN bus. Let’s say NID 1 on CAN bus 0 sends a message on port 1 and an attached CAN/Ethernet gateway node (NID 3 in the diagrams) has a routing rule to re-publish port 1 messages on Ethernet…

When this message arrives on our CAN bus 1 we need a valid Node ID to indicate the source (and to respond if this is a service call). One possibility that comes to mind is a static mapping of IP addresses to node identifiers reserved for remote proxies. In our example we could say that NID 3 on bus 1 has a map where multicast messages from a given IP address would be repeated as NID 4 on bus 1. This does mean that, for very large vehicles, one cannot expect every node on every network to be proxied on every other network. The integrator must be careful to proxy remote nodes only as needed to avoid exhausting the CAN node address space.

It also means that the view of the vehicle is different for nodes on a given network segment. To rebuild a full log of all messages across all buses where addresses are unambiguous the log must track both the network, the node identifier, and have the mapping tables available used by CAN/Eth gateways.

Going through this idea I can’t help but think this is similar to NAT which is why I’m wondering if that’s the technology we want to use on the Ethernet side.

schoberm · May 3, 2022, 10:31pm

Followup

Does (or will) the specification support multiple mixed networks?

For example we could conceivably create a network like the following:

The problems or situations we might run into here are:

What if CAN Bus 1 Node_1 wants to broadcast a message and CAN Bus 2 Node_n wants to subscribe? Do we support that? Would there be a conflict between CAN Bus 2 Node_1 and CAN Bus 1 Node_1?
How will we handle a broadcast from CAN Bus 1 Node_n to any UDP Only node? For example the MTU of the CAN bus will be smaller than that of UDP? Should we specify Node_m should rebuild the full message before transmitting via UDP? Or should it transmit in chunks?
What if CAN Bus 2 Node_n wants to communicate with CAN Bus 2 Node_258? We need to make that UDP/CAN node compatible with both UDP and CAN (192.168.1.1 and node is 258). In this case Node_258 would be outside of the acceptable CAN NodeID range to be compatible with a X.Y.1.Z IP address
This one might be unsupported, but what if we want to use a service message from CAN Bus 1 Node_n to Node_258, or to CAN Bus 2 Node_n?

Edit: Looks like Scott got to this before me

pavel.kirienko · May 4, 2022, 9:06am

Okay, I now see what Scott meant by pointing out the limit of 128 (sic!) nodes. I assumed we were talking about Cyphal/UDP exclusively, not a bridged architecture. I will have to return to his question now then: is it acceptable to limit the total number of CAN nodes across all CAN segments in a given logical network to 128 (extra nodes are possible if certain conditions are satisfied, see below)? If yes, then the resulting network topology will be entirely flat across all segments, with the CAN-addressable node-IDs being limited to [0, 128), while the presumably more complex Cyphal/UDP nodes will be able to address the entire range of node-IDs.

I don’t immediately see how an automatic NAT-like/masquerading solution can be implemented because the node-ID addressing used in Cyphal is flat, unlike IP addressing which is hierarchical. If I understand the proposed idea correctly, masquerading requires that a node on a given segment is able to express its intent to address a peer outside of its segment explicitly. With a hierarchical address, this is done by directing traffic towards an endpoint whose address lies outside of the local subnet (defined by its mask). There is no similar mechanism in Cyphal. We could perhaps define an extension here but it comes with drawbacks such as a further increase in complexity of the transport layer.

A mapping between CAN frames and UDP frames does not exist (mostly because of the different transfer-ID management: it is cyclic in CAN and monotonic in UDP) so it does not appear to be possible to forward traffic between different networks without full transfer reassembly/segmentation at the bridge node. This will create complications when forwarding service calls because a service request and its response are matched with the help of the transfer-ID yet this information is lost when the transfer crosses a bridge. Perhaps some local state will have to be kept at the bridge to address this case (it is one aspect that seems similar to NAT).

If we were to accept the limit of 128 nodes across all CAN segments (but note once again that this limit does not affect non-CAN nodes, so the network itself may be far larger), then we could explore one obvious solution: a CAN-UDP bridge would simply absorb all traffic on one side and re-emit it on the other side. Downlink traffic (UDP->CAN) will have to be filtered by source node-ID since anything above 127 is not representable in Cyphal/CAN. The bridge will also have to rely on the uavcan.node.port.List announcements emitted by the CAN nodes to configure its own subscriptions. This is necessary because the UDP node needs to subscribe to the relevant multicast groups (send IGMP membership announcements) such that the network router would forward the required multicast traffic to it.

The special condition I mentioned earlier is that if the bridge detects that a certain node-ID is present on both downlink (CAN side) and uplink (the other side) segments, then such traffic needs not to be forwarded because it would constitute a collision. This should enable the architect to expand the set of CAN nodes beyond 128 assuming that there is a subset of them that need not communicate across IP.

There is also an option of static routing but it does come with undesirable side effects as Scott has highlighted so should we focus our attention on the (virtually) zero-configuration solutions for now?

pavel.kirienko · May 4, 2022, 1:48pm

pavel.kirienko:

If we were to accept the limit of 128 nodes across all CAN segments (but note once again that this limit does not affect non-CAN nodes, so the network itself may be far larger), then we could explore one obvious solution: a CAN-UDP bridge would simply absorb all traffic on one side and re-emit it on the other side. Downlink traffic (UDP->CAN) will have to be filtered by source node-ID since anything above 127 is not representable in Cyphal/CAN. The bridge will also have to rely on the uavcan.node.port.List announcements emitted by the CAN nodes to configure its own subscriptions. This is necessary because the UDP node needs to subscribe to the relevant multicast groups (send IGMP membership announcements) such that the network router would forward the required multicast traffic to it.

I just implemented this and played a little with it, please find instructions in the README:

Interestingly, about one-quarter of the script is dealing with the transfer-ID linearization problem.

github.com/pavel-kirienko/cyphal-bridge-and-router

main.py

86f12ca41


      
          class TransferIDRectifier:
              """
              Unwraps cyclic transfer-IDs making them monotonic.
              This allows bridging transports with cyclic/monotonic transfer-IDs in both directions.
              Doing so requires keeping 64 bits per (port, node);
              for CAN, this requires (8192 + 512*2) * 128 nodes * 8 bytes = 9 MiB of RAM.
              Notice that the service-ID space is multiplied by two to account for requests and responses.
          
              NOTE: This method works with service calls because it is guaranteed that the modulus of a rectified transfer
              equals the original transfer-ID before rectification. This means that when a service response transfer is
              forwarded back to the original network segment we will obtain the correct original response transfer-ID
              expected by the caller.
              """

Notice that the bridge is not a node of its own, it does not exist above the transport layer.

pavel.kirienko · May 10, 2022, 4:44pm

A thought occurred to me that this is actually possible if we are willing to sacrifice service transfers between different network segments.

The demo script I shared above operates directly at the transport layer – it is not present at the application layer, does not have a node-ID of its own, hence we call it a bridge.

If we were to climb one level up and set up a dedicated node for the task – let’s call it a router node as opposed to a bridge – we could implement different behaviors that might suit some applications better (and some less so) compared to the bridge solution.

The router node will be able to forward messages from one segment to the other acting on its own behalf instead of relying on spoofed transfers as the bridge does. This makes it essentially non-transparent, which calls for a NAT analogy, except that there is no address masquerading happening due to the lack of hierarchical addressing in Cyphal I already mentioned. The lack of masquerading means that service transfers cannot cross between network segments, and this is the key limitation of this approach. Message transfers can be routed automatically in much the same way the bridge solution does it: the router would subscribe to uavcan.node.port.List announcements on both sides and use this information to establish appropriate subscriptions on the opposite side. Transfer-ID linearization becomes unnecessary as there are no service requests/responses to match with each other. The rest of the forwarding logic should be virtually identical.

I guess we should perhaps create a router demo script for completeness, would that be helpful?

schoberm · May 10, 2022, 5:23pm

Would the router node be a statically designated node? Or would any node that is “bridging” networks be designated as a router? Would this cause any issues if there are multiple UDP/CAN nodes in a CAN network (edit: see image below)?

In this example Node A or B could be a routing node or act as a bridge.

edit reformatting:
Consider the following scenario:

Node B is a designated router
Node C and A subscribe to Subject 123
Node E publishes to Subject 123

Would Node A see the publication from Node E and Node B?
Would Node C see the publication from Node B only?

edit:

I think having that demo would be helpful to compare with the bridge demo

pavel.kirienko · May 11, 2022, 9:59am

I assume that we are only discussing fully automatic solutions that are able to function without manual configuration. This is partly because the case where a node is manually set up to forward specified messages is actually trivial and there is not much to discuss.

Unless we introduced some mechanism for router nodes to identify each other’s traffic, the configuration shown in your post is not admissible because it creates a routing loop. Observe:

Node C is a publisher on subject X, and nodes D and E are subscribers on subject X.
Node C publishes a message to X.
Nodes A and B forward it to the UDP segment.
Nodes A and B see the message on subject X sent by their counterparts to the UDP segment, and since they are aware that there is a subscription for this subject on the CAN segment (node D), they each forward the message back to the CAN segment.
Nodes A and B see the message on subject X sent by their counterparts to the CAN segment, and since they are aware that there is a subscription for this subject on the UDP segment (node E), they each forward the message back to the UDP segment.
goto 4

There is more than one way to work around this:

Make router nodes recognize each other; e.g., by querying a specific register like uavcan.router which would only be available and non-zero/non-empty if the node is a router. The register should provide sufficient information for routers that are connected between the same pair of network segments to ignore each other’s traffic while accepting traffic from other router nodes (that originates from other network segments). Perhaps the simplest way to do it is to populate this register with a large random number (if you see the same number on different segments, you are dealing with a router linking these segments, thus its traffic has to be ignored). EDIT: never mind the register, we already have uavcan.node.GetInfo.unique_id that can be used for this purpose as-is, no extra entities needed.
Borrow ideas from conventional dynamic routing protocols. For instance, introduce a standard probing message (probably on a fixed subject-ID), like:
```
uavcan.node.ID.1.0[<=31] trace
```
Any node can publish this message with the trace containing its own node-ID initially. All routers would forward messages of this type adding their own node-ID to the trace. If the trace is full or already contains the node-ID of the current router, the message is dropped. Thus routers can automatically identify their peers connected on the same network boundary and respond to changes in the network configuration (e.g., in the event of failure of one of the routers another one can take its place).
Statically assign a list of node-IDs per router whose traffic should be ignored. In your case, we would configure node A to ignore traffic from node B, and vice versa (on both interfaces, so four configurations in total).
I didn’t explore this in detail but I am confident that more options are available.

The main question is: why does your network require more than one router per segment boundary? Is this a form of modular redundancy? Is this an unintentional side effect of your network configuration?

Whatever the solution is, redundant router nodes should not forward traffic simultaneously because this would lead to its duplication. This is not a problem in the bridge case because a bridge retains the transfer-ID and the source node-ID, allowing recipients of the traffic to deduplicate it automatically (see Specification, section 4.1.1.7 Transfer-ID).

pavel.kirienko · May 11, 2022, 11:38am

I added a router demo script here:

It is far simpler than the bridge script. No capture/spoofing, no transfer-ID mapping. We simply receive messages from one segment and publish them to the other one (and vice versa). The set of subjects to forward is determined dynamically based on the uavcan.node.port.List announcements.

schoberm · May 11, 2022, 3:11pm

To clarify, these are all hypothetical networks of Cyphal nodes.

With regards to defining a specification for UDP transport in OpenCyphal, it seems like routers and bridges are more of an application (correct me if I’m using the wrong terminology here) of Cyphal to enable communication across boundaries. The specification for transports would not natively (or automatically?) support communication across boundaries. Does that sound correct to you or am I misunderstanding?

Should a Node bridging CAN and UDP have a single NodeID shared for both transports?
If we decide that service request/response messages across boundaries could we have separate NodeIDs for each transport? The hypothetical case here is that you have 256 UDP and CAN nodes that share a single Ethernet network but 2 CAN networks where each CAN network uses each available NodeID.

This is simple!

pavel.kirienko · May 11, 2022, 3:18pm

This is my understanding also. Considering our discussion so far there is nothing that would have to be manifested in the transport layer design.

Per my proposal above both options are acceptable.

I think you mean service transfers (there is no such thing as a service request/response message). I don’t quite understand how do you see a service transfer to cross a router if its destination node-ID is not valid on the other side of the router.

scottdixon · May 11, 2022, 5:44pm

The router idea is super-interesting. This simple approach using IGMP is a great start and it provides a clear path to 802.1Qcc (SRP), 802.1CB (FRER), and other TSN protocols. Generally, this is what I’m looking for; a clear path to mapping Cyphal/CAN ports to TSN streams. This requires translation rather than encapsulation in order to maintain hardware acceleration for anything that isn’t a boundary node (i.e. we have to accept Cyphal specialization somewhere between CAN and Ethernet but once we route onto an Ethernet segment we should expect no further specialization unless/until the packet reaches another Cyphal/Ethernet<->Cyphal/CAN router).

The lack of RPC seems inelegant but, perhaps, acceptable. This is something I’ll need to consider.

scottdixon · May 11, 2022, 5:55pm

My impulse is that a router is a special Cyphal node and that it is acceptable to require a valid network architecture without loops as a simplifying factor for production systems, however; such a requirement may make experimental configurations more problematic. As such, we should look to Ethernet for a solution. Don’t the multicast protocols provide mechanisms for detecting and avoiding cyclic paths?

pavel.kirienko · May 11, 2022, 6:09pm

Not to be overly pedantic but did you mean “mapping Cyphal ports to TSN streams”? My point is that there is nothing special to Cyphal/CAN ports as opposed to Cyphal/anything ports. One of the core design objectives should be to ensure a clear boundary between transport-specific features and abstract Cyphal concepts like ports.

Indeed. I want Cyphal/(TSN|UDP) to be a first-class transport that has value on its own, beyond merely tunneling Cyphal/CAN through Ethernet networks.

scottdixon · May 11, 2022, 6:22pm

Absolutely. You are correct sir. Thanks.

lydiagh · May 20, 2022, 2:09am

Using Scott’s diagram from a couple of posts ago:

In regards to the problem of the “router node” approach not handling service transfers, one possibly naive solution might be to encapsulate the protocol in cases of cross network communication to include the extra bits needed for the transport layer. So, let’s say we have NID1 on bus 0 and it wants to send a transfer to NID1 on bus 1, we can use a special routing service id X. Each node should have a routing table. So in the graph we will have:

Bus 0 NID 1:
Bus 0 → local

→ Bus 0 NID 3 (star means all others here)

Bus 0 NID 3:
Bus 0 → local
Eth → local
Bus 1 → Bus 1 NID 3

Bus 1 NID 1:
Bus 0 → Bus 1 NID 3
Eth → Bus 1 NID 3
Bus 1 → local

To send a service transfer, Bus 0 NID 1 will craft the following request:

service(type X, src:NID1, dst:NID3 [srcBus: bus0, dstBus: bus1, actual service(type Y, src:NID1, dst:NID1)])

where [ ] is the data field.

Bus 0 NID3 will receive this transfer on the local bus (bus 0) and see that it is a routing service type X, then it will look at the data and see that the destination bus is Bus 1. Bus 0 NID3 will look at its routing table and determine that it needs to send the transfer over UDP to Bus 1 NID3. The data that was sent to it is going to be in the data section of the UDP transfer. When Bus 1 NID 3 receives the transfer, it will look inside the data and see that it needs to be sent to the local bus 1 NID1. The bus 1 NID 1 will have enough information to send a response to bus 0 NID 1.

The way to think of this approach is that we are recreating the IP address of each node by using some extra bits in the data field for the network mask. So, if we operate in 192.168.x.x then we can say that bus 0 is 192.168.0.x, bus 1 is 192.168.1.x and the Ethernet network is 192.168.2.x and all the IP can be directly mapped from the node id to an IP and back (simplified by keeping it to a 24 bit mask, but 25 fits better to the 128 NID/IP limit)

If we set the number of bits for the src/dst bus in the above service transfer to 25 bits, we could talk to the entire internet(!) with the only limitation that a local network on a bus cannot have more than 128 IPs. As an optimization, we could choose a number smaller than 25 bits by fixing the first Y bits (e.g. fixing 192.168.0 as the prefix would allow us to talk to two networks and have a single bit overhead)

A nice feature of this approach is that it adds no overhead for intra-bus communication. You only pay for the cross network communication if you request it.

pavel.kirienko · May 20, 2022, 7:52am

The uavcan.metatransport namespace appears somewhat related to what you described but it approaches the problem differently, by simply tunneling transport frames on dedicated subjects:

https://nunaweb.opencyphal.org/api/storage/docs/docs/uavcan/index.html#uavcan_metatransport

These matters might be out of the scope of the Cyphal/UDP transport design though, as @schoberm pointed out above. If we want to tunnel things at the application layer then it matters little whether the underlying transport is CAN, UDP, or pigeons.

What are the specific use cases for RPC-service forwarding through the router nodes? In my understanding, they are to act as logical isolators between network segments, emitting/consuming data to/from topics on their own behalf. That is, in the view of a subscriber consuming data from a topic published by a router node, it is the router node itself that is the data provider on this topic and not some hidden agent on a different network segment. Is this not compatible with your requirements? Is, by any chance, talking to the entire Internet a hard requirement? (in that case, perhaps, it should be addressed differently)