Pattern-matchable topic name hashing -- an idea that might be useful in the future

While working on v1.1 spec draft proposal, I encountered in my notes an idea that didn’t make it into my proposed v1.1 design but it seems interesting enough to write it down in case we need to revisit it later.

The current design carries the topic name directly in gossip messages:

uint8  type
void16
int8   topic_log_age    # floor(log2(topic_age)) if topic_age>0 else -1.
uint32 incompatibility
uint64 topic_hash
uint32 topic_evictions  # For pinned topics, this is 0xFFFFFFFF-subject_id.
void24                  # May be used to extend the evictions counter and/or some other purpose.
utf8[<=200] topic_name  # Has 1 byte length prefix. The name is normalized.
# Total size is 24 bytes + topic name length.

This works but the topic name may be large (we rather arbitrarily support up to 200 bytes) and pattern subscriptions require name pattern matching, which is somewhere around O(n log k) in the pattern length n and topic count k. We have our 64-bit topic hash but it alone doesn’t carry any information about the name structure so it cannot be used for pattern matching. We could avoid the need to exchange and pattern-match string names if we applied a different hash function that preserves the name structure.

The simplest idea I could come up with so far is to replace (or amend) the 64-bit hash with a wider hash that is still much narrower than the longest practical topic name. That wide hash, say w bits wide, is split into n equal-size bit chunks, with n being the number of segments in the name (e.g., foo/bar/baz has 3 segments), such that each segment gets s = clamp(floor(w/n), 40, 64) bits, with the lower limit chosen to avoid the birthday problem and the upper limit chosen to keep it easily implementable.

The construction of the hash is then simply: split the name into segments, hash each segment independently, bit-shift each segment appropriately and bitwise-or them. If s is clamped, segments that didn’t fit can be coalesced with the last one, such that we still support long names but lose the pattern matching selectivity in their last segment.

Pattern matching then becomes virtually constant-time and very cheap.

The major downside is that we lose introspection, which is perhaps the main reason why I decided to shelve this idea. The current implementation that exchanges verbatim normalized names tells you exactly how a pattern matched a topic name; if one were to subscribe to sensor/*/temperature and the subscription were to match sensor/left/temperature and sensor/green/temperature, the subscriber will be provided with pattern substitutions left and green — an essential part of service discovery. This hashing idea loses that feature which is a big deal. One could argue that we could modify the hash function such that small integers are hashed down to their own values, meaning that sensor/1234/temperature yields a meaningful 1234 (integer) as a substitution, which is exactly why I didn’t discard this idea outright but decided to write it down here.

This may be revised at some point moving forward.