## Reinforcement Learning in Knowledge Graph Reasoning

### List of sections

• Markov decision process

• Policy evaluation

• (Deep) Reinforcement Learning on KGR

• Overview of Reinforcement Learning Solutions

### 1. Markov Decision Process (MDP)

A markov decision process is a tuple $$\langle S, A, T, R, \gamma\rangle$$, Where

• $$S$$ is a finite set of states.

• $$A$$ is a finite set of actions.

• $$T$$ is a state transition probability. $$T\left(s,a,s^{\prime}\right)=\mathbb{P}\left[S_{t+1}=s^{\prime} \mid S_{t}=s, A_{t}=a\right]$$, this specifies the probability of ending up in state $$s^{\prime}$$ if taken action $$a$$ in state $$s$$.

• $$R$$ is a reward function. $$R\left(s,a,s^{\prime}\right)=\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]$$

• $$\gamma$$ is a discount factor. $$\gamma \in[0,1]$$

The reason why there is a $$T$$ is that when you take an action, it's unsure which state you will get next. This is scenario setting.

### 2. Policy Evaluation

Now given a MDP, we want to carry out a sequence of actions, to make final reward maximun. Such sequential actions is called policy.

Definition of Policy: A policy $$\pi$$ is a mapping from each state $$s \in S$$ to an action $$a \in A$$.

If there is a specific policy $$\pi$$, we want to evaluate if this policy is good, this is called policy evaluation. This can be calculated by Bellman Equation.

#### State-value function

The state-value function $$v_{\pi(s)}$$ of an MDP is the expected return starting from state $$s$$, and then following policy $$\pi$$.

$$v_{\pi}(s) =\sum_{a \in A} \pi(a \mid s) \sum_{s^{\prime} \in S} T(s, a, s^{\prime}) \left(R(s, a, s^{\prime})+ \gamma v_{\pi}\left(s^{\prime}\right)\right)$$

#### Action-value function

The action-value function $$q_{\pi(s,a)}$$ is the expected return starting from state $$s$$, taking action $$a$$, and then following policy $$\pi$$.

$$q_{\pi(s,a)} = \sum_{s^{\prime} \in S} T(s, a, s^{\prime}) \left(R(s, a, s^{\prime})+ \gamma v_{\pi}\left(s^{\prime}\right)\right)$$

Obviously, $$v_{\pi}(s) = \sum_{a \in A} \pi(a \mid s) q_{\pi(s,a)}$$

#### Optimal value function

In all of the above equations we are using a given policy to follow, which may not be the optimal actions to take. The key goal in reinforcement learning is to find the optimal policy which will maximise our return.

The optimal state-value function $$v_{*}(s)$$ is the maximum value function over all policies. It tells us the maximum possible reward you can extract from the system.

$$v_{*}(s)=\max _{\pi} v_{\pi}(s)$$

The optimal action-value function $$q_{*}(s,a)$$ is the maximum action-value function over all policies. It tells us what is the maximum possible reward you can extract from the system starting at state $$s$$ and taking action $$a$$.

$$q_{*}(s, a)=\max _{\pi} q_{\pi}(s, a)$$

If you know $$q_{∗}$$ then you know the right action to take and behave optimally in the MDP and therefore solving the MDP.

$$\pi_{*}(a \mid s) = 1 if a =\operatorname{argmax}_{a \in A} q_{*}(s, a)$$, otherwise, $$\pi_{*}(a \mid s) = 0$$

### 3. Reinforcement Learning on KGR

#### Problem Definition

Given a tuple with a missing tail entity (s, r, ?), find the correct tail entity and giving the reasoning path at the same time.


#### General Procedure

• define MDP: $$S$$, $$A$$, $$T$$, $$R$$. Usually the transition is deterministic.

• define policy network: $$P$$

• define training function

#### Paper 1: DeepPath (EMNLP 2017)

DeepPath divides training triples accroding to their relation types. Under a relation $$r$$, for a given entity pair $$\left(\mathrm{e}_{1}, \mathrm{e}_{2}\right)$$,

• $$S$$: $$\mathbf{s}_{t}=\left(\mathbf{e}_{t}, \mathbf{e}_{t a r g e t}-\mathbf{e}_{t}\right)$$

• $$A$$: Action space is defined as all the relations in the KG, not only outgoing relations related to $$\mathbf{e}_{t}$$.

• $$T$$: Deterministic.

• $$R$$: Consists of 3 different rewards: global accuracy, path efficiency and path diversity. Global accuracy refers to if the path inference is correct, if it is, then $$r_{\mathrm{GLOBAL}}$$ is 1, othwise is -1. Path efficiency measures the path length. Shorter a path length is, better the path is. $$r_{\text {EFFICIENCY }}=\frac{1}{\operatorname{length}(p)}$$. Path diversity encourages agent to find diverse paths.

• $$P$$: A fully-connected neural network to parameterize the policy function $$\pi(s ; \theta)$$ that maps the state vector s to a proba- bility distribution over all possible actions.

• $$Tranining$$: First using supervised policy learning strategy to train the policy network. Detailedly, they use bi-BFS to dig correct paths between given entity pair, and plugging in these paths to train. Then the reward is used for re-train.

#### Paper 2: MINERVA (ICLR 2018)

For a query $$\left(\mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)$$,

• $$S$$: $$S=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)$$

• $$A$$: $$\mathcal{A}_{S}=\left\{\left(\mathrm{e}_{\mathrm{t}}, r, v\right) \in E: S=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right), r \in \mathcal{R}, v \in V\right\} \cup\{(s, \varnothing, s)\}$$

• $$T$$: $$\delta: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$$

• $$R$$: if $$S_{T}=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)$$ is the final state, then receive a reward of +1, otherwise +0.

• $$O$$: Observation. $$\mathcal{O}: \mathcal{S} \rightarrow \dot{\mathcal{E}} \times \mathcal{E} \times \mathcal{R}$$ is defined as $$\mathcal{O}\left(s=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)\right)=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 q}, \mathrm{r}_{\mathrm{q}}\right)$$.

• $$P$$: a randomized non-stationary history-dependent policy $$\pi=\left(\mathbf{d}_{\mathbf{1}}, \mathbf{d}_{\mathbf{2}}, \ldots, \mathbf{d}_{\mathbf{T}-\mathbf{1}}\right)$$, where $$\mathbf{d}_{\mathbf{t}}: H_{t} \rightarrow \mathcal{P}\left(\mathcal{A}_{S_{t}}\right)$$ and $$H_{t}=\left(H_{t-1}, A_{t-1}, O_{t}\right)$$. $$\mathbf{T}$$ is a fixed step num. Based on the history embedding $$\mathbf{h}_{\mathbf{t}}$$, the policy network makes the decision to choose an action from all available actions.

$$\mathbf{h}_{\mathbf{t}}=\operatorname{LSTM}\left(\mathbf{h}_{\mathbf{t}-\mathbf{1}},\left[\mathbf{a}_{\mathbf{t}-\mathbf{1}} ; \mathbf{o}_{\mathbf{t}}\right]\right)$$

$$\mathbf{d}_{\mathbf{t}}=\operatorname{softmax}\left(\mathbf{A}_{\mathbf{t}}\left(\mathbf{W}_{\mathbf{2}} \operatorname{ReLU}\left(\mathbf{W}_{\mathbf{1}}\left[\mathbf{h}_{\mathbf{t}} ; \mathbf{o}_{\mathbf{t}} ; \mathbf{r}_{\mathbf{q}}\right]\right)\right)\right)$$

$$A_{t} \sim$$ Categorical $$\left(\mathbf{d}_{\mathbf{t}}\right)$$

• $$Tranining$$: $$J(\theta)=\mathbb{E}_{\left(e_{1}, r, e_{2}\right) \sim D} \mathbb{E}_{A_{1}, \ldots, A_{T-1} \sim \pi_{\theta}}\left[R\left(S_{T}\right) \mid S_{1}=\left(e_{1}, e_{1}, r, e_{2}\right)\right]$$ maximize the expected rewards.

#### Paper 3: Meta-KGR (EMNLP 2019)

This paper combines reinforcement learning and meta-learning. Here only presenting reinforcement learning parts. For a given query $$\left(e_{s}, r_{q}, e_{o}\right)$$,

• $$S$$: $$s_{t}=\left(r_{q}, e_{s}, \hat{e}_{t}\right)$$

• $$A$$: $$\mathcal{A}_{t}=\left\{\left(r_{t+1}, \hat{e}_{t+1}\right) \mid \left(\hat{e}_{t}, r_{t+1}, \hat{e}_{t+1}\right) \in \mathcal{G}\right\}$$. All outgoing edges and next entities.

• $$T$$: $$\delta\left(s_{t}, \mathcal{A}_{t}\right)=\left(r_{q}, e_{s}, \hat{e}_{t+1}\right)$$. Same as MINERVA.

• $$R$$: $$R\left(s_{T} \mid r_{q}, e_{s}\right)$$ will be 1 if the agent finally stops at the right entity, otherwise it will get a embedding-based reward $$f\left(e_{s}, r_{q}, \hat{e}_{T}\right)$$.

• $$P$$: Also considering history. Each action is represented as $$\mathbf{a}_{t}=\left[\mathbf{r}_{t+1} ; \hat{\mathbf{e}}_{t+1}\right]$$. Using an LSTM to encode the search path.

$$\mathbf{h}_{t}=\operatorname{LSTM}\left(\mathbf{h}_{t-1}, \mathbf{a}_{t-1}\right)$$

$$\pi_{\theta}\left(a_{t} \mid s_{t}\right)=\operatorname{softmax}\left(\mathbf{A}_{t}\left(\mathbf{W}_{2} \operatorname{ReLU}\left(\mathbf{W}_{1}\left[\hat{\mathbf{e}}_{t} ; \mathbf{h}_{t} ; \mathbf{r}_{q}\right]\right)\right)\right)$$

• $$Training$$: $$\mathcal{L}_{r}^{\mathcal{D}}(\theta)=-\mathbb{E}_{\left(e_{s}, r, e_{o}\right) \in \mathcal{D}} \mathbb{E}_{a_{1}, \ldots, a_{T-1} \in \pi_{\theta}}\left[R\left(s_{T} \mid e_{s}, r\right)\right]$$.

#### Paper 4: RuleGuider (EMNLP 2020)

This paper leverages rules in the reasoning process. And split agent into two sub-agents: relation agent and entity agent.

Fistly, it produces rules as format like $$r(X, Y) \leftarrow b_{1}\left(X, A_{2}\right) \wedge \ldots \wedge b_{n}\left(A_{n}, Y\right)$$, and each rule has its confidence score which is used for pre-train of RL. Then,

• Relation Agent: This process can be formulated as $$r_{t}=P^{R}\left(r_{q}, e_{t-1}, \mathbf{R}, \boldsymbol{h}_{t}^{R}\right)$$. The policy network is parameterized by embeddings of $$\boldsymbol{r}_{q}$$ and relation history $$\boldsymbol{h}_{t}^{R}$$, where $$\boldsymbol{h}_{t}^{R}=\operatorname{LSTM}\left(\boldsymbol{h}_{t-1}^{R}, \boldsymbol{r}_{t-1}\right)$$. $$\boldsymbol{R}_{t} \in \mathbb{R}^{\left|R_{t}\right| \times d}$$ consist embeddings of all the relations in relation space $$R_{t}$$ at step $$t$$. Finally, relation agent outputs a probability distribution $$\boldsymbol{d}_{t}^{R}=\sigma\left(\boldsymbol{R}_{t} \times \boldsymbol{W}_{1} \operatorname{ReLU}\left(\boldsymbol{W}_{2}\left[\boldsymbol{h}_{t}^{R} ; \boldsymbol{r}_{q}\right]\right)\right)$$. Relation agent’s history-dependent policy is $$\boldsymbol{\pi}^{R}=\left(\boldsymbol{d}_{1}^{R}, \boldsymbol{d}_{2}^{R}, \ldots, \boldsymbol{d}_{T}^{R}\right)$$.

• Entity Agent: Similarily, this process can be formulated as $$e_{t}=P^{E}\left(e_{s}, r_{q}, r_{t}, \boldsymbol{h}_{t}^{E}\right)$$. $$\boldsymbol{h}_{t}^{E}=\operatorname{LSTM}\left(\boldsymbol{h}_{t-1}^{E}, \boldsymbol{e}_{t-1}\right)$$, and the probability distribution of entities $$\boldsymbol{d}_{t}^{E}=\sigma\left(\boldsymbol{E}_{t} \times \boldsymbol{W}_{3} \operatorname{ReLU}\left(\boldsymbol{W}_{4}\left[\boldsymbol{h}_{t}^{E} ; \boldsymbol{r}_{q}; \boldsymbol{e}_{s}; \boldsymbol{e}_{t}\right]\right)\right)$$.

• $$R$$: Reward is 1 if the predicted triple is correct, otherwise $$R_h = \mathbb{I}(\epsilon \in \mathcal{G})+(1-\mathbb{I}(\epsilon \in \mathcal{G}) f(\epsilon)$$

• Traning: Using REINFORCE (Williams, 1992) algorithm to maximize rewards.

### 4. Overview of Reinforcement Learning Solutions (Policy)

There are two main approaches to solve reinforcement learning problems (learn the best policy):

• value function-based methods

• policy search-based methods

#### Value function-based methods

1. Dynamic Programming

This is model-based (finite states, known rewards etc.). It uses iteratively calculates value funtion value and value-action function value for every state, and greedily chooses the optimal action for every step.

Temporal difference (TD) is a significant algorithm under dynamic programming method. Two variants of TD are SARSA (on-policy) and Q-learning (off-policy).

1. Monte Carlo

Not been iteratively carried out. Instead it generates one eposide each time by random walk.

Monte Carlo can be combined with TD.

#### Policy search-based methods

1. REINFORCE

Policy is determined by a neural network with parameters. After agent finishes one eposide, it will receive reward and doing back-propogation to update policy parameters.

1. REINFORCE with baseline

The enhanced REINFORCE algorithm aimed to reduce the noisy during training.

1. Actor-critic

A combination of value function methods and policy search-based methods.