New Zealand

Kunikazu Kobayashi, Koji Nakano, Takashi Kuremoto and Masanao Obayashi (February 1st 2010). Objective-based Reinforcement Learning System for Cooperative Behavior Acquisition, Application of Machine Learning, Yagang Zhang, IntechOpen, DOI: 10.5772/8615.

Daniela P. Alves, Li Weigang and Bueno B. Souza (January 1st 2008). Reinforcement Learning to Support Meta-Level Control in Air Traffic Management, Reinforcement Learning, Cornelius Weber, Mark Elshaw and Norbert Michael Mayer, IntechOpen, DOI: 10.5772/5293.

Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.


AAMAS 2020

Safe Policy Improvement with an Estimated Baseline Policy

safe policy improvement

learning

reinforcement learning

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-23 is the Thirty-Seventh AAAI Conference on Artificial Intelligence. The theme of this conference is to create collaborative bridges within and beyond AI. Like previous AAAI conferences, AAAI-23 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and two new activities: a Bridge Program and a Lab Program. Many of these activities are tailored to the theme of bridges and all are selected according to the highest standards, with additional programs for students and young researchers. 
AAAI is providing you with a conference planner, which you can use to help organize your itinerary of activities. This includes talks to attend in person, talks to attend remotely, breaks with colleagues and your site seeing activities. To access this conference planner, please go to [https://aaai-2023.takemobi.io](https://aaai-2023.takemobi.io).

In order to access this site, you need to register. If you haven't already, please register [here](https://aaai.org/Conferences/AAAI-23/registration/).


AAAI 2023

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines.

technical paper

AAMAS is the leading scientific conference for research in autonomous agents and multi-agent systems. The AAMAS conference series was initiated in 2002 as the merging of three respected scientific meetings: the International Conference on Multi-Agent Systems (ICMAS), the International Workshop on Agent Theories, Architectures, and Languages (ATAL), and the International Conference on Autonomous Agents (AA). The aim of the joint conference is to provide a single, high-profile, internationally-respected archival forum for scientific research in the theory and practice of autonomous agents and multi-agent systems.

Browse keynotes, discussions, panels and over 300 presentations.


AAMAS is the leading scientific conference for research in autonomous agents and multi-agent systems. The AAMAS conference series was initiated in 2002 as the merging of three respected scientific meetings: the International Conference on Multi-Agent Systems (ICMAS), the International Workshop on Agent Theories, Architectures, and Languages (ATAL), and the International Conference on Autonomous Agents (AA). 

The computational problem of Influence maximization concerns
the selection of an initial set of nodes in a social network such that,
by sending this set a certain message, its exposure through the
network will be the highest.We propose to study this problem from
a utilitarian point of view. That is, we study a model where there are
two types of messages; one that is more likely to be propagated but
gives a lower utility per user obtaining this message, and another
that is less likely to be propagated but gives a higher utility. In our
model the utility from a user that receives both messages is not
necessarily the sum of the two utilities. The goal is to maximize
the overall utility.
Using an analysis based on bisubmodular functions, we show a
greedy algorithm with a tight approximation ratio of ½. We develop
a dynamic programming based algorithm that is more suitable to
our setting and show through extensive simulations that it outperforms
the greedy algorithm.

Viral Vs. Effective: Utility Based Influence Maximization

We propose a novel technique for Active Malware Analysis (AMA) formalized as a Bayesian game between an analyzer agent and a malware agent, focusing on the decision making strategy for the analyzer. In our model, the analyzer performs an action on the system to trigger the malware into showing a malicious behavior, i.e., by activating its payload. The formalization is built upon the link between malware families and the notion of types in Bayesian games. A key point is the design of the utility function, which reflects the amount of uncertainty on the type of the adversary after the execution of an analyzer action. This allows us to devise an algorithm to play the game with the aim of minimizing the entropy of the analyzer's belief at every stage of the game in a myopic fashion. Empirical evaluation indicates that our approach results in a significant improvement both in terms of learning speed and classification score when compared to other state-of-the-art AMA techniques

Bayesian Active Malware Analysis

First price auctions are widely used in government contracts and ads auctions. In this paper, we consider the Bayesian Nash Equilibrium (BNE) in first price auctions with discrete value distributions. We study the characterization of the BNE in the first price auction and provide an algorithm to compute the BNE at the same time. Moreover, we prove the existence and the uniqueness of the BNE. Some of the previous results in the case of continuous value distributions do not apply to the case of discrete value distributions. In the meanwhile, the uniqueness result in discrete case cannot be implied by the uniqueness property in the continuous case. Unlike
in the continuous case, we do not need to solve ordinary differential equations and thus do not suffer from the solution errors therein. Compared to the method of using continuous distributions to approximate discrete ones, our experiments show that our algorithm is both faster and more accurate."


Bayesian Nash Equilibrium in First-Price Auction with Discrete Value Distributions

Predicting Persuasive Effectivness for Multimodal Behavior Adaptation using Bipolar Weighted Argument Graphs

We study the synthesis of policies for multi-agent systems to implement spatial-temporal tasks. We formalize the problem as a factored Markov decision process subject to so-called graph temporal logic specifications. The transition function and the spatial-temporal task of each agent depend on the agent itself and its neighboring agents. The structure in the model and the specifications enable to develop a distributed algorithm that, given a factored Markov decision process and a graph temporal logic formula, decomposes the synthesis problem into a set of smaller synthesis problems, one for each agent. We prove that the algorithm runs in time linear in the total number of agents. The size of the synthesis problem for each agent is exponential only in the number of neighboring agents,
which is typically much smaller than the number of agents. We demonstrate the algorithm in case studies on disease control and urban security. The numerical examples show that the algorithm can scale to hundreds of agents.
"


Policy Synthesis for Factored MDPs with Graph Temporal Logic Specifications

This work presents the concept of an adaptive safe padding that forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process. Policies are synthesised to satisfy a goal, expressed as a temporal logic formula, with maximal probability. Enforcing the RL agent to stay safe during learning might limit the exploration, however we show that the proposed architecture is able to automatically handle the trade-off between efficient progress in exploration (towards goal satisfaction) and ensuring safety. Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm. Experimental results are provided to showcase the performance of the proposed method.

Cautious Reinforcement Learning with Logical Constraints

Law codes and regulations help organise societies for centuries, and as AI systems gain more autonomy, we question how human-agent systems can operate as peers under the same norms, especially when resources are contended. We posit that agents must be accountable and explainable by referring to which rules justify their decisions. The need for explanations is associated with user acceptance and trust. This paper's contribution is twofold: i) we propose an argumentation-based human-agent architecture to map human regulations into a culture for artificial agents with explainable behaviour. Our architecture leans on the notion of argumentative dialogues and generates explanations from the history of such dialogues, and ii) we validate our architecture with a user study in the context of human-agent path deconfliction. Our results show that explanations provide a significantly higher improvement in human performance when systems are more complex. Consequently, we argue that the criteria defining the need of explanations should also consider the complexity of a system. Qualitative findings show that when rules are more complex, explanations significantly reduce the perception of challenge for humans.

Culture-Based Explainable Human-Agent Deconfliction

Ontology alignments enable agents to communicate while preserving heterogeneity in their information. Alignments may not be provided as input and should be able to evolve when communication fails or when new information contradicting the alignment is acquired. In the Alignment Repair Game (ARG) this evolution is achieved via adaptation operators. ARG was evaluated experimentally and the experiments showed that agents converge towards successful communication and improve their alignments. However, whether the adaptation operators are formally correct, complete or redundant is still an open question. In this presentation, we introduce a formal framework based on Dynamic Epistemic Logic that allows us to answer this question. This framework allows us (1) to express the ontologies and alignments used, (2) to model the ARG adaptation operators through announcements and conservative upgrades and (3) to formally establish the correctness, partial redundancy and incompleteness of the adaptation operators in ARG.


Agent Ontology Alignment Repair through Dynamic Epistemic Logic

Swarms can be applied in many relevant domains, such as patrolling or rescue. They usually follow simple local rules, leading to complex emergent behavior. Given their wide applicability, an agent may need to take decisions in an environment containing a swarm that is not under its control, and that may even be an antagonist. Predicting the behavior of each swarm member is a great challenge, and must be done under real time constraints, since they usually move constantly following quick reactive algorithms. We propose the first two solutions for this novel problem, showing integrated on-line learning and planning for decision-making with unknown swarms: (i) we learn an ellipse abstraction of the swarm based on statistical models, and predict its future parameters using time-series; (ii) we learn algorithm parameters followed by each swarm member, in order to directly simulate them. We find in our experiments that we are significantly faster to reach an objective than local repulsive forces, at the cost of success rate in some situations. Additionally, we show that this is a challenging problem for reinforcement learning.


Real-time Learning and Planning in Environments with Swarms: A Hierarchical and a Parameter-based Simulation Approach

This paper investigates how to efficiently transition and update policies, trained initially with demonstrations, using off-policy actor-critic reinforcement learning. It is well-known that techniques based on Learning from Demonstrations, for example behavior cloning, can lead to proficient policies given limited data. However, it is currently unclear how to efficiently update that policy using reinforcement learning as these approaches are inherently optimizing different objective functions. Previous works have used loss functions, which combine behavior cloning losses with reinforcement learning losses to enable this update. However, the components of these loss functions are often set anecdotally, and their individual contributions are not well understood. In this work, we propose the Cycle-of-Learning (CoL) framework that uses an actor-critic architecture with a loss function that combines behavior cloning and 1-step Q-learning losses with an off-policy pre-training step from human demonstrations. This enables transition from behavior cloning to reinforcement learning without performance degradation and improves reinforcement learning in terms of overall performance and training time. Additionally, we carefully study the composition of these combined losses and their impact on overall policy learning. We show that our approach outperforms state-of-the-art techniques for combining behavior cloning and reinforcement learning for both dense and sparse reward scenarios. Our results also suggest that directly including the behavior cloning loss on demonstration data helps to ensure stable learning and ground future policy updates.


Downloads

Next from AAMAS 2020

Viral Vs. Effective: Utility Based Influence Maximization

Similar lecture

OPT-GAN: A Broad-Spectrum Global Optimizer for Black-box Problems by Learning Distribution

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAMAS 2020

Viral Vs. Effective: Utility Based Influence Maximization

Similar lecture

OPT-GAN: A Broad-Spectrum Global Optimizer for Black-box Problems by Learning Distribution

Downloads