I’ll restrict myself to thinking about only the cooperative Multi-agent RL (MARL) setting. Here, multiple agents are tasked with maximizing the joint team reward shared by everyone. This setting has a lot of practical applicability such has robot warehouse automation and has garnered a significant amount of interest.
There is a lot of ground to cover with such MARL systems. Some of the research areas in this field include:
This is probably the simplest idea in MARL algorithms. Simply perform Q-learning on individual agents as if they are completely unaware of other agents. No inherent sharing of experience as each agent only conditions their actions on their own histories.
The problem of non-stationarity induced by having multiple agents is exemplified here. From the perspective of a single agent, the actions of the other agents are simply the part of the environment, hence making the environment non-stationary. IQL typically acts as a baseline in most papers due to the ease with which it can be implemented.
Value Decomposition Networks