MAS communication: The State of the Art
In this article I present references to existing literature which is relevant to my endeavour. I sourced primarily from
Let’s examine some mechanisms by which messages are relayed between agents. When defining a channel, authors tend to use the dichotomy of continuous (ℝ) versus discreet (ℕ) messages. A single message tends to be low dimensional. However authors rarely limit the emit frequency and agents tend to be “forced” to emit one message per iteration.
A shared ℝ channel is presented in 3. Agents modulate their “yelling”, that is how far their signal propagates through an ℝ² environment. When presented with several independent channels with variable levels of noise, the agents pick the cleanest one. Similarly, in 4 authors restrict access to the channel based on the distance between actors.
A ℕ² message board (one-to-all) channel is presented in 5. A message is a string from a binary alphabet with 4 unique code words. The agents play a predator-prey pursuit in a grid-world. In the same game, a more recent 6 shows that ℝ⁴ messages improve performance over ℝ².
In the research from the past decade messages are typically outputs of DNNs. There are some established methods to discretize real outputs: DRU7 which is a logistic function over normal distribution; gibbs distribution8; Gumbel-Softmax estimator9’10.
Some authors report an unsatisfactory results with discrete messages:
[..] discrete communication does not yield consistent improvements when the complexity of the environment grows, and it only manages to marginally improve on the baselines when the agents are constrained to share the same weight parameters, a rather unrealistic assumption. 2
10 finds that a small vocabulary size limit conflates concepts. Instead of limiting the vocabulary size, the authors, inspired by 11, put a negative reward on low frequency symbols. On first glance this result is in conflict with 7 which considers a limited channel bandwidth a helpful constrain.
12 increases the chances of emergent communication by rewarding positive signaling and positive listening. As per definitions in 13, positive signaling rewards speaker for correlating the messages with observations or actions, while positive listening rewards when listener correlates messages with its actions.
14 studies emergent communication via negotiation. Authors present two types of channel: with inherent semantics and unconstrained, i.e. “cheap talk”. They find that the cheap talk type is better for cooperative agents. However, their message length is rather short as the agents have to decide on a list of ~4 items based on their hidden rewards for different item kinds.
Self-interested agents do not appear to ground cheap talk: […] This result suggests that the self-interested agents we implement find it difficult to cheap talk to exchange meaningful information about how to divide up the items, […] once agent interests diverge by a finite amount, no communication is to be expected. 14
14 tests performance with and without agent IDs:
Our results show that providing agent ID information does help the fixed agent achieve its aims. This effect is particularly pronounced for a selfish fixed agent; here, providing ID information uniformly improves performance. 14
15 shows larger messages and more symbols improve performance only to a certain point (which is probably given by the complexity of the task):
16 considers how overhearing opponent’s utterances affects the settings of competing teams:
Sharing messages via overhearing dialog improves generalization performance: Dialog overhearing contributes the most towards improvement in test accuracy, from below 60% in the baselines without to 75.8% with dialog overhearing. We believe this is because dialog overhearing transmits the most amount of information to the other team, as compared to a single scalar in reward sharing or a single image in task sharing. 16
Let’s examine some tools researchers use to understand and evaluate their systems.
[..] just ablating the language channel and showing a drop in task success does not prove much, as the extra capacity afforded by the channel architecture might have helped the agents’ learning process without being used to establish communication. 2
In another words, the control has resources at its disposal which are otherwise dedicated to making sense of the communication protocol. Therefore, what is compared is whether dedicating resources to communication performs better than dedicating them towards e.g. making sense of the task. This argument is refused by 13 which identifies ∆r as a useful metric to tell whether a communication is happening. In my opinion, ∆r is useful to confirm communication in RL settings if it’s positive, i.e. having a communication channel leads to a greater reward.
In ℝ² channel of 6, the Pearson correlation coefficient between the message and the agent’s state and action reveals correlation between a message and an action, but not between a message and a state.
In 8 authors build a matrix whose row i corresponds to a sampled state i, column j corresponds to a sampled message j, and an entry ij tells how many times has the message j been sent in the state i. Then they perform SVD to find correlations.
In 17 authors reduce the number of agents to 2 and then interpret some messages quite convincingly. However, others5’2 conjecture that understanding the role of a randomly sampled message becomes hard fast.
Zipf’s law is an empirical law which applies to natural languages. The most frequent word appears to occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. 11 bases a mathematical model on this law and 14 argues that their results have Zipf’s law in common with natural languages.
11 suggests that there is a critical number of events that peers desire to exchange information about that favours syntax evolution. The idea is plausible, but they make some assumptions which render the exact equations they derive not useful:
[…] assumptions: first, in any one interaction between two individuals only a single new word can be learned; second, words are memorized independently of each other. 11 Making the (somewhat rough) assumption that all nouns and all verbs, respectively, occur on average at the same frequency 11
In 14 the authors collect messages and the state over many episodes and then train a classifier to predict the state given a message. The outputs of the trained classifier are the basis for the following claim:
The symbol unigram and bigram distributions of messages exchanged by prosocial agents show that agent A is not transmitting any information using the linguistic channel. On the other hand, agent B uses a diversity of symbols resulting in a long-tailed bigram symbol distribution, reminiscent of the Zipfian distribution found in natural languages. This suggests that, even though the task is symmetric, the agents differentiate into speaker and listener roles […] Thus, they have adopted a simple strategy for solving the game: an agent shares its utilities (the speaker), while the other (the listener) compares the shared utilities with their own and generates the optimal proposal. 14
18 investigates what happens when two groups which converge on their own “languages” are exposed to each other. The authors vary population ratios to explore their results. They define a desirable property of an agent’s communication ability:
We formally define mutual intelligibility in the communication protocol as the ability for each agent to play the game against itself. This is an elegant solution: if a shared communication protocol has emerged, the agent would not have any trouble playing a game with itself during test time (she “understands” what she “says” and “says” what she “understands”). 18
19 proposes a visualization technique for language evolution by sampling “meanings” and associating them to signals produced by an agent. Eventually they observe a structure in this mapping. This technique is related to 15:
Topographic similarity is the correlation of the distances between all the possible pairs of meanings and the corresponding pairs of signals 15
The speaker consistency is a measure of positive signaling, as it indicates a statistical relationship between the messages and actions of an agent, and does not tell us anything about how these messages are interpreted. On the surface, SC is a useful measure of communication because it tells us how much an agent’s message reduces the uncertainty about its subsequent action. 13
13 also reviews another metric:
Context Independence is designed to measure the degree of alignment between an agent’s messages and task concepts (corresponding to a particular part of the input space). […] This quantity relies on having a well-defined notion of “concept” (e.g. the number and colour of objects) […] Context independence captures the same intuition as speaker consistency: if a speaker is consistently using a specific word to refer to a specific concept, then communication has most likely emerged. Thus, it is also a measure of positive signaling. The difference is that CI emphasizes that a single symbol should represent a certain concept or input, whereas a high speaker consistency can be obtained using a set of symbols for each input, so long as this set of symbols is (roughly) disjoint for different inputs. 13
One surprising result of 13 is that positive signaling does not imply positive listening. The authors create scenarios in which they show that some metrics give false positives about ongoing communication, when in fact random messages lead to the same results. They further define Causal Influence of Communication (CIC) which measures the causal effect that one agent’s message has on another agent’s behaviour. This metric correctly predicts that no communication is happening in their pit-fall scenarios.
Algorithms and concepts
In 25 value-decomposition for shared rewards improves performance as agents don’t mistake between the reward from their actions and actions of their peers.
Experience replay is a significant challenge in MAS because other agents are non-stationary, which destroys convergence guarantees. Some work opts for no experience replay7 while other empirically shows increased chance of convergence with architecture changes27.
28 gives an information theoretic argument on why some languages emerge hierarchical.
To train the agents to communicate, we augment our initial network with an additional A3C output head, that learns a communication policy π c over which symbol to emit, and a communication value function. 29
On creole languages:
They report that, if two agents will develop different idiolects, it suffices for the community to include a third agent for a shared code to emerge. One of their most interesting findings is that, when agent communities of similar size are put in contact, the agents develop a mixed code that is simpler than either original language. 2
On task optimality problem:
Essentially, agents will develop a code that is sufficient to solve the task at hand, and hoping that such code will possess further desirable characteristics is wishful thinking. 2
On modality transition problem:
It is a natural question whether non-verbal communication could act as a stepping stone towards more complex forms of language. 2
Authors of 7 touch on the improbability that agents associate a reward with sending and interpreting a single message with a random search. However, their assertion that the sender is discouraged from resending a message is not general:
Since protocols are mappings from action-observation histories to sequences of messages, the space of protocols is extremely high-dimensional. Automatically discovering effective protocols in this space remains an elusive challenge. In particular, the difficulty of exploring this space of protocols is exacerbated by the need for agents to coordinate the sending and interpreting of messages. For example, if one agent sends a useful message to another agent, it will only receive a positive reward if the receiving agent correctly interprets and acts upon that message. If it does not, the sender will be discouraged from sending that message again. Hence, positive rewards are sparse, arising only when sending and interpreting are properly coordinated, which is hard to discover via random exploration. 7
On the cry for attention:
Actions that lead to relatively higher change in the other agent are considered to be highly influential and are rewarded. We show how this reward is related to maximizing the mutual information between agents’ actions, and is thus a form of social empowerment. We hypothesize that rewarding influence may therefore encourage cooperation between agents. It may also have correlates in human cognition; experiments show that newborn infants are sensitive to correspondences between their own actions and the actions of other people, and use this to coordinate their behavior with others. 29
The learning bottleneck is, unsurprisingly, why compositional languages prevail:
While interaction and learning bias play a role in this process, much of this work emphasises the role of the learning bottleneck in driving the evolution of structure: language learners must attempt to learn a large or infinitely expressive linguistic system on the basis of a relatively small set of linguistic data. A key finding is that compositional languages (in which the meaning of a complex expression is composed of the meanings of parts of that expression) emerge from holistic (i.e. unstructured) languages as a result of repeated transmission through the learning bottleneck. 24
P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, T. Graepel: Value-Decomposition Networks For Cooperative Multi-Agent Learning ↩︎