In the previous blog post we discussed some theory of how to select optimal and possibly optimal interventions in a causal framework. For those interested in the decision science, this blog post may be more inspiring. This next task involves applying counterfactual quantities to boost learning performance. This is clearly very important for an RL agent where its entire learning mechanism is based on interventions in a system. What if intervention isn’t possible? Let’s begin!

## This Series

- Causal Reinforcement Learning
- Preliminaries for CRL
- CRL Task 1: Generalised Policy Learning
- CRL Task 2: Interventions – When and Where?
- CRL Task 3: Counterfactual Decision Making
- CRL Task 4: Generalisability and Robustness
- Task 5: Learning Causal Models
*(Coming soon)*Task 6: Causal Imitation Learning*(Coming soon)*Wrapping Up: Where To From Here?

## Counterfactual Decision Making

A key feature of causal inference is its ability to deal with counterfactual queries. Reinforcement learning, by its nature, deals with interventional quantities in a trial-and-error style of learning. Perhaps the most obvious question is: how can we implement RL in such a way as to deal with counterfactual and other causal factors in uncertain environments? In the preliminaries section we discussed the notions of a multi-armed bandit and their associated policy regret. This is a natural starting point for the merger of reinforcement learning and causal inference theory to solving counterfactual decision problems. We now discuss some interesting work done in this area, especially in the context of unobserved confounders in decision frameworks.

Obviously, the goal of optimising a policy is to minimise the regret an agent experiences. In addition, it should achieve this minimum regret as soon as possible in the learning process. Auer et al. [25] show that policies can achieve logarithmic regret uniformly over time. Applying a policy they call *UCB2*, the authors show that with input of plays is bounded by

All this says is that we should consider the intent of the agent and consider how the causal system is trying to exploit this intent. The agent should acknowledge it’s own intent and modify its policy accordingly. The authors propose incorporating this into Thompson Sampling [27] to form Causal Thompson Sampling (), shown in the algorithm below. This algorithm leverages observational data to seed the algorithm and adds a rule to improve the arm exploration in the MABUC case. The algorithm is shown to dramatically improve convergence and regret against traditional methods under certain scenarios. Comments have been added to make the algorithm self-contained. The reader is encouraged to work through it.

The is empirically shown to outperform conventional methods by converging upon an optimal policy faster and with less regret than the non-causally empowered approach. Ultimately we are interested in sequential decision making in the context of planning – the cases reinforcement learning addresses. [28] generalises the previous work on MABs to take a causal approach to Markov decision processes (MDPs) with unobserved confounders. In a similar fashion to [21], the authors construct a motivating example and show that MDP algorithms perform sub-optimally when applying conventional (non-causal) methods. First, we note why confounding in MDPs is different to confounding in MABs. Confounding in MDPs affects state and outcome variables in addition to action and outcome, and thus requires special treatment. Unlike in MABs, the MDP setting requires maximisation of reward over a long-term horizon (planning). Maximising with respect to the immediate future (greedy behaviour) thus fails to account for potentially superior long term trajectories in state-action space (long term strategies). The authors proceed by showing conventional MDP algorithms are not guaranteed to learn an optimal policy in the presence of unobserved confounders. Reformulating MDPs in terms of causal inference, we can show that counterfactual-aware policies outperform purely experimental algorithms.

**MDP with Unobserved Confounders (MDPUC) [28]:** A Markov Decision Process with Unobserved Confounders is an SCM with actions , states , and a binary reward :

- is the discount factor.
- is the unobserved confounder at time-step .
- is the set of observed variables at time-step , where , , and .
- are the set o f structural equations relative to such that , , and . In other words, they determine the next state and associated reward.
- encodes the probability distribution over the unobserved (exogenous) variables .

A key difference to the MABUC case is that different sets of variables can be confounded over the time horizon. The following figure shows the four different ways in which MDPUC variables can be confounded.

The figure above shows (a) MDPUC with action (decided by the policy) to reward path, confounded. (b) MDP without confounders. This is the traditional RL instance. (c) MDPUC with action to reward path, , and action to state path, , confounded. (d) MDPUC with only action to state path, , confounded. Extracted from [28].

Most literature discussing reinforcement learning will develop the theory in terms of value functions. We develop these notions for the MDPUC case here. This will allow us (in theory) to exploit existing RL literature with relative ease.

**Value Functions:** Given a MDPUC model and an arbitrary deterministic policy , we can define the value function starting from state , taking action , and thereafter following policy as: The state-action value function is similarly defined:

The interpretation of these functions is the same as in the usual RL literature. They simply associate a state or state-action with an expected reward over all future time steps. Using these definitions, we can derive expressions for the well known Bellman equation and recursive definitions for the state and state-action value functions [12] for the situation where unobserved confounders are present. The authors rest their analysis of these results on the axioms of counterfactuals and the Markov property presented in the following theorem. Proofs for the following theorems are available in the source paper [28] and are not included here for brevity.

**Markovian Property in MDPUCs [28]:** For a MDPUC model M , a policy (state to action map), and a starting state , the agents performs actions at round and afterwards , the following statement holds:

We can further extend MDPUCs to counterfactual policies by considering , the set of functions between the current state , the intuition of the agent , and the action . Then and . With this we can derive a remarkable result encoded as a theorem below.

**Theorem:** Given an MDPUC instance , let and . For any state , the following statement holds:

In other words, we can never do *worse* by considering counterfactual quantities (intent).

Zhang and Bareinboim [28] continue to implement a counterfactual-aware MORMAX MDP algorithm and empirically show it superior to conventional MORMAX [29] approaches in intent-sensitive scenarios. Forney and Bareinboim [11] extend this counterfactual awareness to the design of experiments.

Forney, Pearl and Bareinboim [30] expand upon the ideas presented above by showing that counterfactual-based decision-making circumvents problems of naive randomisation when UCs are present. The formalism presented coherently fuses observational and experimental data to make well informed decisions by estimating counterfactual quantities empirically. More concretely, they study conditions under which data collected from distinct sources and conditions can be combined to improve learning performance by an RL agent. The key insight here is that *seeing* does not equate to *doing* in the world of data. Applying a developed heuristic, a variant of Thompson Sampling [27] is introduced and empirically shown to outperform previous state-of-the-art agents. An extension of the motivating example of (potentially) drunk gamblers from [21] is extended to consider an extra dependence on whether or not all the machines have blinking lights. This yields four combinations of the states of sobriety and blinking machines.

We start by noticing the interventional quantities, can be written in terms of counterfactuals as – that is, the expected value of had been . Applying the law of total probabilities, we arrive at the useful representation

is interventional by definition. is either observational or counterfactual depending on whether or respectively. That is, whether or not has occurred. As we noted earlier, counterfactual quantities are, by their nature, not empirically available in general. Interestingly, however, intents of the agent contain information about the agent’s decision process and can reveal encoded information about unobserved confounders, as we noted in the MAB case earlier. Applying randomisation to intent conditions (with RDC) can allow computation of counterfactual quantities. In this way, observational data actually *adds* information to the seemingly more informative interventional data. This counterfactual quantity – that is not naturally realisable – is often referred to as the *effect of the treatment on the treated* [31], by way of the fact that doctors cannot retroactively observe how changing the treatment would have affected a specific patient – well, they shouldn’t!

**Estimation of ETT [30]:** The counterfactual quantity referred to as the effect of the treatment on the treated (ETT) is empirically estimable for any number of action choices when agents condition on their intent and estimate the response to their final action choice .

**Proof:** The ETT counterfactual quantity can be written as . Applying law of total probability and conditional independence relation , we have:

Now, we notice that because in (graph with edges into removed) we have . Thus,

where the last line follows since all quantities are in relation to the same variable and so the counterfactual quantity can be written as an interventional quantity. We now notice that since is observational, the intent will always match the outcome. We can thus rewrite this as an indicator function. The result follows.

[30] makes use of this result to suggest heuristics for learning counterfactual quantities from (possibly noisy) experimental and observational data.

This figure shows the counterfactual possibilities for different actions and action-intents. The diagonal indicates the counterfactual quantities that have occurred. We can apply knowledge of other counterfactual quantities to learn about other possible counterfactuals. (B) indicates cross-intent learning. (C) indicates cross-arm learning. Extracted form [30].

1. **Cross-Intent Learning:** Thinking about these equation carefully, we notice that since they holds for every arm, we have a system of equations for outcomes conditioned on different intents. Thus to find we can learn about other intent conditions:

2. **Cross-Arm Learning:** Similarly to (1), we can observe that given information about two different arms under the same intent, we can learn about a third arm under the same intent. We have

Which is the same as writing

Combining these results and solving for we find:

This estimate is not robust to noise in the samples. We can take a pooling approach and account for variance of reward payouts by applying an inverse-variance-weighted average:

where evaluates the cross-arm learning equation above, and is the reward variance for arm under intent .

3. **Combined Approach:** Of course, we can combine the previous two approaches by sampling (collecting) estimates during execution and applying *cross-intent* and *cross-arm* learning strategies. A fairly straightforward derivation yields a combined approach formula:

This figure shows the flow and process of fusing data using counterfactual reasoning as outlined in this section. The agent employs both the history of interventional and observational data to compute counterfactual quantities. Along with its intended action, the agent makes a counterfactual and intent aware decision to account for unobserved confounders and make use of available information. Figure extracted from [30].

The authors proceed to experimentally verify that this data-fusion approach, applied to Thompson Sampling, results in significantly less regret than competitive MABUC algorithms. The (conventional) gold standard for dealing with unobserved confounders involves randomised control trials (RCTs) [32], especially useful in medical drug testing, for example. As we noted in the data-fusion and earlier MABUC processes, randomisation of treatments may yield population-level treatment outcomes but can fail to account for individual-level characteristics. The authors provide a motivating example in the domain of personalised medicine, or the effect of the treatment on the treated, motivated by Stead et al.’s [33] observation that

Despite their attention to evidence, studies repeatedly show marked variability in what healthcare providers actually do in a given situation.– Stead et al.’s [33]

They proceed to formalise the existence of different treatment policies of actors in confounded decision-making scenarios. This new theory is applied to generalise RCT procedures to allow recovery of individualised treatment effects from existing RCT results. Further, they present an online algorithm which can recommend actions in the face of multiple treatment opinions in the context of unobserved confounders. For the sake of clarity, the motivating example is now briefly presented. The reader is encouraged to refer to the source material [26] for further details.

Consider two drugs which appear to be equally effective under an FDA-approved RCT. In practice, however, one physician finds agreement with the RCT results while another does not. Let us consider two potential unobserved confounders – socioeconomic status (SES) and the patient’s treatment request (R). Crucially, we note that juxtaposing observational and experimental data fails to reveal these invisible confounders. Key to this confounded decision making (CDM) scenario is that the deciding agent (physician) do not possess a fully specified causal model (in the form of an SCM).

This formalism was used to define a regret decision criterion (RDC) for optimising actions in the face of unobserved confounders. In the physician motivating example we just discussed, the intent-specific recovery rates of the first physician do not appear to differ from the observational or interventional recovery rates. The results of RDC for the second physician is only slightly off the data, at 72.5%. What is happening here? The key here is the heterogeneous intents of the agents. We now develop theory to account and exploit information implicitly contained in the multiple intents of agents.

**Heterogeneous Intents [26]: **Let and be two actors within a CDM instance, and be the SCM associated with the choice of policies of and likewise of . For any decision variable and its associated intent , the actors are said to have heterogeneous intent in and are distinct.

Acknowledging the possibility of heterogeneous intents in deciding actors (such as second opinions), we can extend the notion of SCMs. The figure below shows a model for combining intents of different agents.

**HI Structural Causal Model (HI-SCM) [26]:** A heterogeneous intent structural causal model is an SCM that combines the individual SCMs of actors such that each decision variable in is a function of each actors’ individual intents.

This figure shows an example of HI-SCM. Here multiple intents of different actors contribute to the decision variable . Unobserved confounder affects both the intents and the outcome. The agent history is encoded as . Recreated based on figure 1 in [26].

Of course, it would be naive to assume every actor adds valuable information to the total knowledge of the system. With this is mind, we develop the notion of an intent equivalence class.

**Intent Equivalence Class (IEC) [26]:** In a HI-SCM , we say that any two actors belong to separate intent equivalence classes of intent functions if .

With this definition in place, we can cluster equivalent actors – and their associated intents – by their IEC. For example, given then

It turns out that IEC-specific optimisation is always at least equivalent to each actor’s individual optimal action.

**IEC-Specific Reward Superiority [26]:** Let be a decision variable in a HI-SCM with measured outcome , and let and be the heterogeneous intents of two distinct IECs in the set of all IECs in the system, . Then

**Proof:**

WLOG, consider the case of a binary intents . Let Then

Letting we can rewrite the above inequality as (corrected mistake from proof in appendix of source). We can write it as such since we are considering the binary case. Thus if we have one case occurs, necessarily the other doesn’t. We can then exhaust the cases:

- .
- .
- .

Thus, in each case we have that the HI-specific rewards are greater than or equal to the homogeneous-intent-specific rewards.

With this important result in place, we can develop new criteria for decision making in a CDM with heterogeneous intents.

**HI Regret Decision Criteria (HI-RDC) [26]:** In a CDM scenario modelled by a HI-SCM with treatment , outcome , actor intended treatments , and a set of actor IECs , the optimal treatment maximises the IEC-specific treatment outcome. Formally:

The authors point out that HI-RDC requires knowledge of IECs, which are not always obvious. This motivates the need for empirical means of clustering the actors into equivalence classes. Since these intents are observational (they are indicated by what naturally occurs), sampling and grouping by the following criteria suffices.

**Empirical IEC Clustering Criteria [26]:** Let , be two agents modelled by a HI-SCM, and let their associated intents be , for some decision. We cluster agents into the same IEC, , whenever their intended actions over the same units correlate. In other words, if their intent-specific treatment outcomes will agree. Correlation indicated by , as is common in statistics literature. Formally:

The authors argue that this condition can be too strict in practice and should be softened to allow for some actor-specific noise. Applying HI-RDC and empirical IEC clustering directly to the online recommendation system presented in [21] is possible but not necessarily always practical or recommended because (1) the ethics of exploring different treatment options is not always clear and (2) if UCs are present in the system and the treatment has passed experimental testing, this implies that the UCs have gone unnoticed and we – the data analysts – wouldn’t necessarily know to look for them there. This motivates the need for an extension to RCTs to involve heterogeneous intents.

**HI Randomised Control Trial (HI-RDC) [26]:** Let be the treatment of the RCT in which all participants of the trial are randomly assigned to some experimental condition. In other words, they have been intervened on via with some measured outcome . Let be the set of all IECs for agents in the HI-SCM for which the RCT is meant to apply. A Heterogeneous Intent RCT (HI-RCT) is an RCT wherein treatments are randomly assigned to each participant but, in addition, the intended treatments of the sampled agents are collected for each participant.

This figure shows the HI-RDC procedure. The traditional RCT procedure is found by following the orange path. HI-RDC adds an additional layer of actor-intent collection over and above the traditional RCT procedure. Figure extracted form [26].

This procedure yields actor IECs as well as experimental, observational, counterfactual and HI-specific treatment effects. Finally, we can implement a criterion which reveals confounding beyond what a simple mix of observational and interventional data can expose.

**HI-RDC Confounding Criteria [26]:** Consider a CDM scenario modelled by a HI-SCM with treatment , outcome , actor intended treatments , and a set of actor IECs . Whenever there exists some in , then there exists some unobserved such that .

In addition to the above theory, [26] provides a procedure for an agent to attempt to repair harmful influences of unobserved confounders in an online procedure. Once again we come across the Multi-Armed Bandit with Unobserved Confounders (MABUC) in the attempt to maximise the reward (recovery in the physician example) – or minimise the regret – of the decision making process. One problem that arises in the online case is that the IECs the agent learns are not necessarily exhausitive. Let us consider whether we can find a mapping , representing a map from the online to the offline IEC sets respectively. If we can, the HI-RDC reveals optimal treatment. If not, HI-RCT data can be used to accelerate learning by gathering a *calibration unit set* – a small sample questionnaire in which actors are asked to provide intended treatments. In the case where HI-RCT IECs correspond to the sampled offline IECs, a mapping can be made. This acts as a sort of ‘bootstrap’ to the HI-RCT procedure by serving as a procedure to collect initial conditions for the learning process.

**Actor Calibration-Set Heuristic [26]:** A collection of some calibration units from an offline HI-RDC dataset can be used to learn the IECs of agents in an online domain. Three heuristics scores can be used to guide the selection procedure:

**Consistency:**how consistent agents in the same IEC are with their intended treatment,**Diversity**: how often a configuration of has been chosen, favouring a diverse set of IEC intent combinations, . This acts by favouring ‘exploration’ in terms of the IEC space.**Optimism**: a bias towards choosing units in which the randomly assig ned treatment was optimal and succeeded or suboptimal and failed,The calibration set is given by .

The authors experimentally show that agents maximising HI-specific rewards, clustering by IEC, and using calibration sets with and without the Actor Calibration-Set Heuristic, each outperforms the previous version. In other words, each step improves upon the regret the agent experiences. When compared against an *oracle* – that treated UCs as observed (unrealisable) – the agent performs well. [34] asks some interesting questions and presents intriguing results. They tackle the problem of an agent and humans having different sensory capabilities, and thereby “disagreeing” on their observations. The authors find that when leaving human intuition out of the loop – even when the agent’s sensory abilities are superior – results in worse performance. The theory presented in this section is sufficient to understand the presentation of the results in [34], and the reader is encouraged to engage with the source.

Now that we have considered counterfactual decision making within a causal system, we should consider how we can transfer and generalise causal results across domains. This is what the next section tackles.

### References

- [Header Image](https://www.businesssystemsuk.co.uk/i/uploads/gallery/what%20if%20scenarios%20wfm.jpg)
- [11] Elias Bareinboim and J. Pearl. Causal inference and the data-fusion problem.Proceedings of the NationalAcademy of Sciences, 113:7345 – 7352, 2016.
- [12] Richard Sutton and Andrew Barto.Reinforcement Learning: An Introduction. MIT Press, second editionedition, 2018.
- [21] Elias Bareinboim, Andrew Forney, and J. Pearl. Bandits with unobserved confounders: A causal approach.InNIPS, 2015.
- [25] P. Auer, Nicolò Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2004.
- [26] Andrew Forney and Elias Bareinboim. Counterfactual randomization: Rescuing experimental studies fromobscured confounding. InAAAI, 2019.
- [27] W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence oftwo samples.Biometrika, 25:285–294, 1933.
- [28] J. Zhang and Elias Bareinboim. Markov decision processes with unobserved confounders : A causal ap-proach. 2016.
- [29] I. Szita and Csaba Szepesvari. Model-based reinforcement learning with nearly tight exploration complexitybounds. InICML, 2010.
- [30] Andrew Forney, J. Pearl, and Elias Bareinboim. Counterfactual data-fusion for online reinforcement learn-ers. InICML, 2017.
- [31] J. Heckman. Randomization and social policy evaluation. 1991.
- [32] R.A. Fisher.The Design of Experiments. The Design of Experiments. Oliver and Boyd, 1935.
- [33] W. W. Stead, Starmer J. M., and M. McClellan. Beyond expert based practice.Evidence-Based Medicineand the Changing Nature of Healthcare: 2007 IOM Annual Meeting Summary, 2008.
- [34] J. Zhang and Elias Bareinboim. Can humans be out of the loop? 2020.

CRL Task 2: Interventions – When and Where? – MathemafricaAugust 7, 2021 at 10:54 am[…] Previous Next […]

Preliminaries for CRL – MathemafricaAugust 7, 2021 at 10:54 am[…] CRL Task 3: Counterfactual Decision Making […]

Causal Reinforcement Learning: A Primer – MathemafricaSeptember 19, 2021 at 2:31 pm[…] CRL Task 3: Counterfactual Decision Making […]

CRL Task 1: Generalised Policy Learning – MathemafricaSeptember 19, 2021 at 2:55 pm[…] CRL Task 3: Counterfactual Decision Making […]

Peter ZAugust 7, 2022 at 11:04 amHi, for CRL, I want to know that is “state” in MDP a Confounder? Because that state influence action and reward, which a common cause for action and reward.

St John GrimblyAugust 11, 2022 at 10:22 amHi Peter. Your intuition is certainly correct. In an MDP, the environment transitions from one state to another based on the action selected. Like you say, the structure of the MDP technically means that “state” is a confounder if we think of our input as action and output as reward. We can technically call any common cause structure like this a confounder. See https://en.wikipedia.org/wiki/Confounding#/media/File:Simple_Confounding_Case.svg

The problem is we usually use this language when a variable that is not of interest is influencing your input and output. In the case of an MDP, the state and action are of central interest. i.e. Your input is both state and action, output being reward. In this sense, it is a bit meaningless to talk about state confounding an action – this is the behaviour one is looking for. If it wasn’t a “confounder” (e.g. If we removed it from the decision making) then an agent would just choose its “best” action regardless of the state it finds itself in.

With that said, the MDP was not formulated in the same way that causal models are set up. If you want to consider a formulation of the MDP in terms of causal models (confounding, counterfactuals), take a look at these papers (non-exhaustive):

https://causalai.net/mdp-causal.pdf

https://arxiv.org/pdf/1811.06272.pdf

Let me know if you need any clarification.