Draft of a communication protocol (c11np)
In this article I draft a protocol for the emergence of communication in a collective of Bayesian agents $A$. A fundamental assumption is that agents can associate random binary strings (messages) with rewards from events in the environment and by chance eventually converge on a jargon^{1} under which those messages convey information. The objective is to (i) increase the convergence probability and to (ii) provide tools to analyze the system.
I shall refer to this protocol as c11np. The protocol consists of several conditioned procedures called reflexes over which the agents have no control. An agent $a_n$ communicates via channel $C^n$ with two handles. $C_{out}^n$ samples messages (e.g. output of NN) and relays them to recipients. $C_{in}^n$ allows the agent to observe incoming messages (e.g. input to NN). Let’s keep the term “message” somewhat loose and instead define what specifically can be transmitted over $C^n$ using different term.
Words
A binary string $s$ of length $n$ with $s^1$ on bits (i.e. count of 1’s) is a word if and only if $s^1 \in [w_{min}, w_{max}]$. During sampling $C_{out}$ discards any $s$ which isn’t a word. The cardinality of the set of all words is given by^{2}:
$$ W = \sum_{i=w_{min}}^{w_{max}} {n \choose i} $$
The probability that $s \in W$ is given by:
$$ p=\frac{W}{2^n} $$
For example if $n = 32$, $w_{min} = 7$ and $w_{max} = 9$ then $W = 41932956$ and $p \approx 0.01$.
Larger $n$ demands that the $w_{min}$ and $w_{max}$ are increased as well in order to keep $p$ practical. Although in the following sections I define other incentives for communication, random search of possible messages is still the fundamental assumption. Nevertheless, a word is the smallest constituent and agents can compose them. An evaluation which compares different word lengths might not be illuminating as the results would be task and algorithm specific. I suspect that onesizefitall value such as in the example above is sufficient.
A common trait of MAS research is a continuous transmission of random messages during training. The sparsity interval defined above controls the initial churn.
Immediate feedback
$C_{out}^n$ transmits $w$ to all handles in the set $C_{in}^A$. Since $a_n \in A$, $w$ is effectively immediately fed back into $C_{in}^n$. The immediate feedback reflex is biologically inspired and intuitively well suited for inference. Its role in c11np is to support convergence to common vocabulary by making the sender also a recipient. On the other end of the wire a related reflex is employed to reinforce patterns.
Word repetition
Word repetition is a reflex which bypasses the agent and hijacks $C_{out}^n$ sampling procedure. We transmit received $w$ from $C_{in}^n$ immediately back to $C_{in}^A$. This reflex reinforces collective familiarity with specific $w$ and contributes to the channel activity. Channel active in situations where information exchange means more efficient reward collection will more likely nudge agents to opportunistically form vocabulary.
We can tune this reflex with respect to the task at hand with a binary decision function $\delta_r(\cdot)$. When $C_{in}^n$ receives $w$ it decides according to $\delta_r(\cdot)$ whether to repeat it. If $w$ was repeated unconditionally then the channel would saturate because of an exponential growth of total replicas $Z_w$. This is in fact the limiting factor which helps us evaluate suitable $\delta_r(\cdot)$. We model this system as a branching process^{3}’^{4}.
$A$ is the number of recipients of a single transmitted replica because of immediate feedback. Let $p_w^n$ be the probability that a word $w$ is repeated by agent $a_n$. Let $X$ be a random variable of number of replicas (offspring) of any $w$ in one generation, i.e. how many replicas of $w$ are produced by all agents together. In one generation, $w$ can be repeated up to $A$ times or not repeated at all, there are therefore $A + 1$ possible outcomes of $X$.
If $X$ is i.i.d, i.e. $\forall (i, j) : p_w^i = p_w^j$, then $X$ follows the binomial distribution $X \sim \operatorname{B}(A, p_w)$. Since $p_w \in ({0}, {1})$ and $A \ge 2$, the distribution is well approximated^{5} by Poisson distribution with $\lambda = A p_w$. Let $e^{\star}$ be the ultimate extinction probability of $w$ in the branching process. To avoid saturation we need $e^* = 1$. It has been shown^{6} that $e^* = 1$ for $\lambda < 1$.
\begin{align} A p_w &< 1 \notag \\ p_w &< A^{1} \label{hom_decision_bounds} \end{align}
\eqref{hom_decision_bounds} gives us an upper bound on $p_w$. $Z_w$ can be estimated with the Borel’s distribution^{7} where $\mu = \lambda$. Therefore the homogeneous system can be modeled using known properties of the Borel’s distribution.
If $X$ is independent but nonidentically distributed, i.e. $\exists (i, j) : p_w^i \neq p_w^j$, then $X$ follows the Poisson binomial distribution^{8} $X \sim \operatorname{PoisB}(A, p_w)$. Let’s model this with p.g.f for Galtonâ€“Watson process $\{e_t\}_{t=0}^\infty$^{9}:
\begin{align} e_t &= G(e_{t  1}) = P(z_t = 0  z_0 = 1) \notag \\ \lim_{t \to \infty} e_t &= e^* \notag \\ G(e_{t + 1}) &= \sum_{i=0}^{A} p(x_i) e_t^i \notag \end{align}
We find^{10} the mean of $X$ (also known as the basic reproduction number $R_0$). The larger $E(X)$ is the longer $w$ lingers on the channel.
$$ E(X) = G’(1) = \sum_{i=1}^{A} i p(x_i) $$
\begin{align}\label{het_decision_bounds} \begin{cases} e^{*} = 1, &\text{if } E(X) < 1 \\ e^{*} < 1, &\text{otherwise} \end{cases} \end{align}
We show that bound \eqref{het_decision_bounds} is a generalized form of \eqref{hom_decision_bounds} by leveraging the fact that $\operatorname{B}(\cdot)$ is a special case of $\operatorname{PoisB}(\cdot)$^{11}:
\begin{align*} \lambda &= \sum*{i=1}^{A} i {A \choose i} (p_w)^i (1  p_w)^{A  i} \\ A p_w &= E*{\operatorname{B}(A, p_w)}(Z_1) \\ A p_w &= A p_w \end{align*}
The channel won’t become permanently saturated with $w$ as long as $\delta_r(\cdot)$ makes decisions which satisfy \eqref{het_decision_bounds}. Branching processes are often useful models for the word repetition reflex. However, while in principle one could always use $\delta_r(\cdot)$ which makes decision stochastically approximated before the system evolves, there are interesting $\delta_r(\cdot)$ for which some of the assumptions don’t hold (e.g. replication independence or knowing $A$ in advance.)
There are other practical ways of avoiding saturation such as runtime probabilistic counting with countmin sketch and similar algorithms.
Sensory deprivation
It seems selfevident that saturating with noise or withdrawing an agent’s access to observations degrades its performance. Sensory deprivation affects reward without the need to account for c11np in the Bellman equations.
Seeding
With the tools above we approximate how many new words are discovered and then model their lifetime in the system. We can adjust the rate of new words entering the system by publishing them artificially. This can be tuned with respect to the events in the environment.
To be continued…
After I completely describe c11np it remains to be experimentally tested. The objective is to bootstrap communication and fade out the reflexes as agents converge to a jargon. Therefore the reflexes focus on keeping channel active and giving the designer control.

In the academic research jargons are mainly dictionaries, no grammar beyond simple word order emerged to the best of my knowledge. See my previous article MAS communication: The State of the Art for more information. ↩︎

S. Ahmad, J. Hawkins: Properties of Sparse Distributed Representations and their Application to Hierarchical Temporal Memory ↩︎

A. Agresti: On the Extinction Times of Varying and Random Environment Branching Processes ↩︎

J. Pitman: Enumeration of Trees and Forests Related to Branching Processes and Random Walks ↩︎

Y. H. Wang: On the Number of Successes in Independent Trials ↩︎

Wikipedia: GaltonWatson survival curves for Poisson branching ↩︎

Wikipedia: Borel distribution: Derivation and branching process interpretation ↩︎

S. X. Chen, J. S. Liu: Statistical Applications of the PoissonBinomial and Conditional Bernoulli Distributions ↩︎

jbstatistics: Proof that the Binomial Distribution tends to the Poisson Distribution ↩︎