[Cyphal/UDP] Architectural issues caused by the dependency between the node's IP address and its identity

I will try to implement the proposed changes on this branch.

After reading pycyphal.transport.udp docs, I have a small question.

There’s this table:

Supported transfers Unicast Broadcast
Message No Yes
Service Yes Banned by Spec.

“Broadcast” should have been “Multicast” right?

2 Likes

Yes. It’s Cyphal parlance for multicast. Maybe we should change the wording to make this less confusing.

This is a big win and should simplify both the implementations and readability of the specification and implementations!

1 Like

This is an acceptable proposal worthy of a prototype.

1 Like

The proposal looks unnecessary overcomplicated. Why not to use a standard approach of dynamic ports for multi-node configuration on the same IP? Can we just make Cyphal/UDP a multi-port service by design? You can define one fixed UDP port and a range of dynamic ports for testing/development purpose and it will be easy to implement yet robust and reliable.

We can even make it an optional production feature like follows:

  1. Single node per IP - no changes needed, but just using fixed known UDP port.
  2. Multiple nodes on same IP: a brokerage service has to start first on the fixed UDP port and then managing allocation/deallocation of dynamic UDP ports for regular Cyphal nodes on the same IP.

The proposal actually simplifies the current draft, as I attempted to illustrate, since it reduces the design from two modes of communication down to one. The introduction of the local brokerage service would add further complexity to the original design with unclear benefits. Could you maybe elaborate on what advantages do you see in the multiport design based on the local brokerage service?

Main reasons:

  1. There are already two use cases: single node on a single IP, multiple nodes on a single IP. The former is more for actual operation of the system in production, but the latter is more for testing/development/simulation. It is important to stay focused and not mix additional/optional features (i.e. nice to haves) into the main workflow. That’s why I would suggest to clearly separate two use cases from the beginning. The proposed dynamic ports concept will be fully optional and not a dependency for the single node operation. I.e. Cyphal/UDP development and production usage for nodes will not be affected by that additional feature at all. I.e. no quality, no schedules, no spec requirement, etc. - i.e. nothing, but it stays as is. Adding multi-node support will be flexible and replaceable - i.e. we can implement various brokerage APIs and port allocation schemes when (and if) needed. In simpliest way, it can be just a hardcoded port number for each service on the multi-node computer and no need to have a brokerage service at all. More advanced implementations will need nodes to pull a port number dynamically (from a file, shared memory, some API, maybe a networked call, etc.) - but this concept still stays extremely simple and natural.
  2. Not to reinvent a wheel. Dynamic ports and multi-port services are quite standard ways to do this. Moreover, a concept on ports was exactly created to support multiple service on a single IP - why on earth should we avoid this in Cyphal?
  3. Not to overcomplicate Cyphal spec, but remove very custom dependencies and pre-requisites on dynamic multi-cast pre-configuration. It looks really fragile to rely on such advanced configuration to let just a basic system work. That multicast configuration concept is really complicated by design and who knows what would it be in the implementation and in real life?
  4. Again, simplicity - to keep the main production workflow simple, straightforward and not dependent on quite artificial use case of testing/dev/simulation with multi-nodes…
  5. Extensibility - hard coding the initial concept freezes the spec and low level implementation, but what if we decide to extend, change or even remove that multi-node feature? The previous version of Cyphal will be broken and it will be again a big-bang backwards incompatible change in v.2 of Cyphal. But if we decide on just very generic concept of dynamic ports, then it becomes compatible with vast variety of implementation (from no-op, to hard coded port ids, to file read ports ids, to API call, etc.). And the changes won’t be breaking, but even can coexist on the same system with different implementation on different computers!

I am not convinced so far, but it could be that I am missing something.

Per my proposal, multi-tenant nodes are intended as a key capability for production use rather than as a development-only extension, as there are valid use cases that require collaboration between low-level (nearly) baremetal nodes and their higher-level counterparts running on a higher-level POSIX OS.

Both the original draft and my new proposal require multicasting for pub/sub. Are you proposing to use some other transport for pub/sub that is not IP multicast? Unless you do, the dependency on multicasting is going to stay regardless of the design chosen.

1 Like

Can you please clarify more on that “Uber” enormous single computer use case with numerous virtual nodes in it, requiring several separate multi-cast groups talking to different publishers (or subscribers) over Ethernet outside and inside the computer - and that everything on just a single IP address? I don’t see any real production use case for this example, but it looks truly artificial to me. Why is it such deficit of IPs and Ethernet cards for such a huge (and I believe already very expensive) computer/server? If it was a real monstrous super compute node in the system, then placing multiple network cards to support real networking needs looks the most appropriate solution. Or else, run software routing and make a virtual subnet (private network) inside that machine. But again - it is very low-level networking already available in Ethernet standards, so why should we reinvent a wheel here and go to the low level modifying how standard IP multicast work?

Yes, this huge system will be most likely needed for simulation and testing - i.e. placing the entire system with all the nodes on a single computer and simulate communications between them. But all the real h/w nodes look pretty modular devices decoupling functionality per one device. So, one IP per one h/w node looks very reasonable and the most common production use case. We can create a work around for testing/simulation easily - software routing, virtual private network, or just multiple ethernet cards physically connected together.

My proposal/example with different port numbers on a single IP was for rather realistic use cases when a few nodes are within the same multicast (or broadcast, unicast) group on the same IP. For example, multi motor controlling board, some aggregated battery management system, etc. Having simple small groups of same devices is much better for reliability, redundancy and reducing complexity rather than making a spaghetti-like big centralized systems with huge functionality and numerous tricky complicated communication paths. Such big systems just do not scale, but require huge development and maintenance effort. They are almost impossible to extend, improve and evolve in the future. You can see a lot of such examples on web apps and enterprise systems - everybody prefer to go to micro-services, simple small IoT devices, decentralized systems, simple RESTful interfaces, NoSql databases without huge central storage, etc. lightweight protocols.

So, my proposal is simple: let’s follow standards and decades of Ethernet industry knowledge - that community have already verified a lot of concepts. If it looks we want something too complicated and breaking existing low-level Ethernet standards, then most likely we are wrong, not the entire Ethernet community and experts who already have addressed vast majority of use cases. Probably, we should Invent and Simplify here? I would highly recommend to limit the scope of Cyphal and start with something small, but robust, reliable, standard and extendable by design from the beginning.

The case of multi-tenant nodes is based on the core design principles (the relevant part is “Cyphal targets a wide variety of embedded systems, from high-performance on-board computers to extremely resource-constrained microcontrollers”). The design principles are not going to be reviewed, hence, this requirement is going to stay. The existing draft addresses the multi-tenant case poorly, therefore, it is inferior compared to the new proposal; the same holds for the alternative solution based on the local brokerage service. The problem is not that the local IP addresses or NICs are scarce resources; it’s that they complicate the node configuration process, and this extra complexity is avoidable as illustrated in my proposal.

I don’t understand a few things that you are saying. Could you please elaborate on these so that we are on the same page:


why should we reinvent a wheel here and go to the low level modifying how standard IP multicast work?
<…>
If it looks we want something too complicated and breaking existing low-level Ethernet standards

I don’t understand what you are referring to. My proposal is based entirely on the existing standards and does not call for modification of the standard IP multicast mechanism (or any other standard). It is implementable using existing off-the-shelf technologies and standard APIs without the need for customization (unlike, say, AFDX).


My proposal/example with different port numbers on a single IP was for rather realistic use cases when a few nodes are within the same multicast (or broadcast, unicast) group on the same IP

What are unicast groups?


We have node A and node B, each running on a single-tenant hardware unit. Node A publishes a message on subject X; node B is a subscriber and hence should receive the message. Please explain in detail how the said message is to be transferred per your proposal. Same for the case if A and B share the same multi-tenant hardware unit.


let’s follow standards and decades of Ethernet industry knowledge

We are on the same page here.

Can we make a data-driven decision with a proof from real products/technologies and based on real needs from the industry?
It would be great to collect feedback from several experts from various industries and consider pros/cons data points.

  1. If you are saying the proposed hack with hidden bits in multicast group addresses and fully eliminating unicast and broadcast communications are very standard - please provide the proof of working products, systems, specifications of standards in use of such approach. No of such evidence provided so far. My feedback - it is overcomplicated and for already complex concept of multicast groups. the proposed communication scheme is very non-standard - thus potentially a source of bugs, human errors, and misconfiguration.

  2. If the use case of a highly powerful embedded device with numerous unrelated nodes in it but having one and only IP address advertised as the must - again, please provide a proof of existing of such product. In the discussion above, it is clear that it is even difficult to imagine or artificially build-up such a device. My feedback - such use case does not exist in production and is not needed for my industry (moreover, it smells a bad embedded design of such huge single embedded device with multiple unrelated responsibilities and a single point of failure).

  3. For your concerns on testing and lab research - can you please provide proof/evidence that the problem cannot be solved with multiple networks cards?
    My feedback - there are cheap multiport Ethernet cards to easily address a need of simulating multiple embedded devices on a single server:
    Amazon.com - 2 ports
    https://www.amazon.com/dp/B01HH6WETO - 4 ports

  4. And can you provide any data points that the already existing standard approach of using multiple/dynamic ports on UDP did not work? My feedback - a concept of ports on a single IP has been
    specifically designed and proven by decades for exact this situation of simultaneous communications of different sort on a single IP address. As an inventor of a new/custom idea, you need to provide data points that existing standards do not work. I am strongly against reinventing a wheel, but for relying on already avilable reliable and proven approaches. My data points collected so far on standard multi-port networking services for multiple nodes on a single IP address:

  • Simple hardcoded port numbers can work just fine
  • File based configuration of different ports is more advanced
  • Reading dynamically changed config files with port numbers - even more powerful
  • A brokerage API - covers all the needs and beyond

I have to restate a few things that might have been lost in this discussion:

  • The core design goals are not going to be reviewed. Any discussion around them is off-topic.

  • The scarcity of local NICs or IP addresses is a non-issue. The issue is the complexity of commissioning a node due to the coupling between the local node-ID and its IP address. This is covered in my proposal.

I understand that you dislike IP multicasting, but I don’t understand what alternative you are proposing. Could you please answer the questions I asked in the previous post so we can move the discussion forward?


  1. I do not share the perception that IP multicasting as technology is inherently complex. The underlying principles are simple and well-understood. It is also unclear how one could avoid the reliance on multicasting without a centralized broker or a dedicated peer discovery mechanism (in the style of RTPS perhaps).

  2. Off-topic.

  3. The problem can be solved with multiple NICs or multiple local IPs per NIC, but this is beside the point. See above.

  4. I don’t understand what you are proposing here. Please provide a detailed description of your intended design as I requested in my previous post so that we can discuss it.

Again, opinions, data points and proof from different industry experts are still needed on this topic. Unfortunately, just a subjective matter of opinion (and now a sort of executive orders :wink: were provided in support of this spec change request so far… But real/strong data points, industry feedback, and proof from the existing network standards are not heard, but just being rejected without any reason and without providing any opposite evidence/proof…
Is there any data, proof, evidence, standards, etc. ground for this change request apart of saying like it is the right thing to do, because it is a correct approach?

Answering your questions that I might have missed above:

What are unicast groups?
1-to-1 group of two nodes directly talking to each other. It would be too bad if we do not support that basic use case in Cyphal at all…

Regarding my proposed alternative designs:
Option 1. Do nothing in s/w.
The problem does not exist, but we can just add several h/w Ethernet NIC ports in that “uber” embedded device as needed.

Option 2. Use a concept of dynamically allocated multiple ports on a single IP.
Instead of hiding children nodes info into the multicast group addresses with the same single IP, we can explicitly group nodes by different port numbers if there was overlapping on the same IP.
I.e. if there are nodes A, B, and C on the same IP address and they belong to different pub/sub groups, then assign some specific port numbers to them, say port 10001 to the node A, 10002 → node B, 10003 → node C. The external traffic from other IP addresses to the nodes A, B, C will be routed by the local Ethernet driver by the port number. Such assignment is crystal clear and a standard way to separate different unrelated consumers of network traffic on the same IP. There are a lot of options to configure port numbers per node (hardcoding, static config file, dynamic config file, shared memory, API call to a local broker service, network call to a remote broker service, etc.).
In the case when communication between nodes are needed locally (i.e. pubs and subs are on the same IP inside that “uber” device) - use the localhost IP address (i.e. 127.0.0.1). That’s a standard concept of loopback networking.

The rationale is given in my proposal. I did not mean to imply that the proposed approach is the only correct way; if my write-up reads this way, I apologize as it is not intentional. I am open to consider competing proposals but so far there have been none. I asked you repeatedly to answer my specific questions so that the discussion could progress. Allow me to be redundant and restate the question once more:


Multicast IP packets are transferred point-to-point (switching notwithstanding) as long as the source and destination share the same L2 domain, so in this sense, multicast is equivalent to unicast.

The concept of dynamic port assignment is trivial. What is not trivial is how one could avoid IP multicasting without centralized brokering or active network discovery. Please see my question above.

Probably, I did not mention that in your example with two nodes on different IPs - no changes would be needed. Just use multicast to messages and unicast for service requests as it is in the current Cyphal spec.
For the case of two nodes on the same IP - my feedback is as follows.

  1. This use case is not based on data. I.e. no real requests from the industry to support it. Just do not do it, but provide separate IP addresses per each node.
  2. If the need arrived in the future, then that unusual use case can be supported by standard features. I.e. static definition of the network and adding a few ports to the unicast and multicast UDP requests. Different port numbers is a standard discriminator of network traffic routing.
  3. If someone says we need auto-discovery and automatic network configuration, the answer again - this use case does not exist and has never been required nor justified by requests from industry experts. But anyway, if someone wants it - a bunch of standard tools available already from a shared file, memory, etc. custom communication between nodes on that single IP to more sophisticated broker service. For example, what’s wrong with a lightweight SSDP?
  4. And finally, if someone asks not to use static network config nor brokerage services at all, but some magic hacks with a few bits inside a multicast group address - then they need to prove that request is valid and why nobody have asked/implemented that so far after decades on networking protocols evolution…

I think you are overemphasizing the importance of peer-to-peer (unicast) communication for a DCPS system, where p2p is considered irrelevant or an anti-pattern, depending on how you squint. In a typical Cyphal deployment, the service traffic may account for less than 1% of the total traffic, plus it is possible to implement a valid Cyphal network without reliance on the service traffic at all (more on this in the Guide), so optimizing the solution for the benefit of this rather marginal case is unwise.

The advantage of my proposal over the original draft is that it eliminates the extra entities that are currently used to support this 1% of use cases (i.e., unicast exchanges) by unifying their implementation with the other part of the protocol that supports the remaining 99% of the usage (i.e., multicast exchanges). The problem with your intended proposal, which I still have not seen but it seems like a safe guess, is that it brings extra complexity to support the part of the protocol that is of low importance to most applications.

The part where you speak about port-based discrimination still seems hand-wavy; if you could provide a comprehensive description of the scenario that I asked about earlier (nodes A&B, subject X), that would be appreciated.

I recommend that we avoid emotionally loaded language like “magic hacks” because it communicates no useful information and does not bring us closer to consensus. The multicast-based solution I proposed does not use any “hacks” but is based on the correct utilization of a well-known and simple technology. I am tempted to describe your proposed alternative based on local brokering the same way, but you might see perhaps how that would be counter-productive to our discussion.

Thanks a lot for clarifying the main intent of this request was removal of unicast communications from Cyphal.
I am really glad that we are close to the deal breaking point. At Amazon, we account data, industrial proof, research/docs as a main source of technical decisions.
Some emotions are actually fine as we are humans and not robots (yet, :slight_smile:

Can you please defend those really bold statements with data, proof links, opinions from a few industry experts, any other similar industrial systems? You stated:

  1. Unicast/p2p are not important, irrelevant and anti-pattern for Cyphal
  2. Possibility to implement Cyphal network without service traffic (i.e. p2p) at all
  3. Unicast/p2p is 1% of use cases

The exact data points against those statements (but proofs and exact usecases requiring unicast p2p communications) are as follows:

Security
Authentication and authorization require p2p communication. Otherwise it is an immediate security threat welcoming man-in-the-middle attack.
Moreover, multicast groups do not even allow to determine the COUNT of participants in the group. So, it is like allowing a group of random people staying around your at the ATM without knowledge the size of that group and detecting if it is present at all!
Can you please get feedback from industry experts regarding a multicast scenario for authentication/authorization in a highly secure and sensitive system?
My feedback - it makes no sense by definition: the exact single authority approves a specific device via direct and non-altered communications.

Encryption
Maintaining keys and security certificates requires p2p communication. We do not want keys leakage or even worse (replacement!) by a man-in-the-middle on a pub/sub group.

Firmware update
This is a sensitive part of the system and has to be directly delivered to the specific device. Moreover, the host needs to control the process of updating each node - it implies p2p communication.
Security is paramount here as well and we do not want firmware update process being intercepted or altered.

Configuration Management
Also highly sensitive and very device specific. This is p2p communication. There is nothing about pub/sub here, but direct communication between a single authority and a specific device being configured.

Device Calibration
This is specific 1-to-1 communication following a special procedure (i.e. a state machine) to calibrate an individual device. I.e. a bunch of service messages (commands, responses). It also needs to be highly secure, protected and not altered.

Retrieving logs
A simple use case of collecting/persisting all the network communications within the embedded system and then uploading for offline analysis, auditing, some other online analytics, etc.
Thats essentionally the entire 100% of production traffic to be delivered via unicast p2p logs retrieval process. I really don’t see any ground in your statement of 1% number of unicast communications…
And that process can be done all the times - either constantly in realtime, or at some specific moments (Like after completing some major parts of operations)

File transfers
It is close to the logs use cases above, but more generic. This is exact p2p communication of sensitive information.

"Safety device"
Most of aerospace (and some others) have a concept of a redundant safety device to be used for post mortem analysis after crashes of the main system. It is p2p highly secure communication here on recovering data from this device and we also do not want communication being altered or intercepted.

External Systems integration: cloud, other networks
Connecting to some bridge, cloud, or other external network requires unicast p2p access in general. I’ve provided some specific integration examples above and they can be extrapolated to any external cloud, network, orchestration, etc. system.

This is not correct, and I said nothing that could be interpreted as such. The main intent is the improvement of the Cyphal/UDP transport definition. Please actually do read my proposal.

Please read The Cyphal Guide - Applications & Usage - OpenCyphal Forum, where you will find detailed elaboration and references. Pay specific attention to the fact that Cyphal is, in fact, a DCPS solution, and point-to-point communication is not something that is at the focus. If your application requires strictly p2p, Cyphal is not for you and there is nothing more to discuss.

Security and encryption are out of the scope at this stage of the project. This is something that we have on the longer-term roadmap, though. I don’t see how these objectives can be used to argue against the DCPS architecture unless you are willing to claim that all pub/sub architectures are inherently flawed by design.

Do we really need to continue this conversation? It is tiring.

I am fine to continue discussion and to see feedback from various industries and experts, because no real data, proofs, other people feedback, etc. have been provided to support this change apart of a personal opinion so far…

If my asks to provide data, proof, real needs from industries for the proposed MAJOR change in Cyphal is “tiring discussion” - I am very sorry about that. But it is how successful commercial projects work. Successful industrial grade products are not abstract theories nor math exercises, but relevant to the actual customers and their real needs.

I hope Cyphal project is not going to a quite abstract academic research nor personal toy project, though someone could have found a sort of red flags in that direction.
If Cyphal is vulnerable to voluntary major changes without getting a consensus with the industry but rejecting all and any valid concerns and data points, then it looks not an open-source, but more like quite proprietary software with vendor locking on decisions from a single party.

Is there any voting process and ability to bring more people to provide feedback on this request from various industries?
Can you please consider keeping the original specification, especially as this request was proposed to be “minor” originally?

A brokerage service is a non-starter. It violates one of the core design tenets of Cyphal (from Section 1.3 of the specification):

Democratic network — There will be no master node. All nodes in the network will have the same commu-
nication rights; there should be no single point of failure

Multiple ports are a fire-walling nightmare (FTP anyone?). I’m not interested in making it hard to secure a system by using dynamic port assignments.