Reinforcement Learning in Knowledge Graph Reasoning
List of sections

Markov decision process

Policy evaluation

(Deep) Reinforcement Learning on KGR

Overview of Reinforcement Learning Solutions
1. Markov Decision Process (MDP)
[A helpful introduction at here]
A markov decision process is a tuple \(\langle S, A, T, R, \gamma\rangle\), Where

\(S\) is a finite set of states.

\(A\) is a finite set of actions.

\(T\) is a state transition probability. \(T\left(s,a,s^{\prime}\right)=\mathbb{P}\left[S_{t+1}=s^{\prime} \mid S_{t}=s, A_{t}=a\right]\), this specifies the probability of ending up in state \(s^{\prime}\) if taken action \(a\) in state \(s\).

\(R\) is a reward function. \(R\left(s,a,s^{\prime}\right)=\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]\)

\(\gamma\) is a discount factor. \(\gamma \in[0,1]\)
The reason why there is a \(T\) is that when you take an action, it's unsure which state you will get next. This is scenario setting.
2. Policy Evaluation
Now given a MDP, we want to carry out a sequence of actions, to make final reward maximun. Such sequential actions is called policy.
Definition of Policy: A policy \(\pi\) is a mapping from each state \(s \in S\) to an action \(a \in A\).
If there is a specific policy \(\pi\), we want to evaluate if this policy is good, this is called policy evaluation. This can be calculated by Bellman Equation.
Statevalue function
The statevalue function \(v_{\pi(s)}\) of an MDP is the expected return starting from state \(s\), and then following policy \(\pi\).
\(v_{\pi}(s) =\sum_{a \in A} \pi(a \mid s) \sum_{s^{\prime} \in S} T(s, a, s^{\prime}) \left(R(s, a, s^{\prime})+ \gamma v_{\pi}\left(s^{\prime}\right)\right)\)
Actionvalue function
The actionvalue function \(q_{\pi(s,a)}\) is the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\).
\(q_{\pi(s,a)} = \sum_{s^{\prime} \in S} T(s, a, s^{\prime}) \left(R(s, a, s^{\prime})+ \gamma v_{\pi}\left(s^{\prime}\right)\right)\)
Obviously, \(v_{\pi}(s) = \sum_{a \in A} \pi(a \mid s) q_{\pi(s,a)}\)
Optimal value function
In all of the above equations we are using a given policy to follow, which may not be the optimal actions to take. The key goal in reinforcement learning is to find the optimal policy which will maximise our return.
The optimal statevalue function \(v_{*}(s)\) is the maximum value function over all policies. It tells us the maximum possible reward you can extract from the system.
\(v_{*}(s)=\max _{\pi} v_{\pi}(s)\)
The optimal actionvalue function \(q_{*}(s,a)\) is the maximum actionvalue function over all policies. It tells us what is the maximum possible reward you can extract from the system starting at state \(s\) and taking action \(a\).
\(q_{*}(s, a)=\max _{\pi} q_{\pi}(s, a)\)
If you know \(q_{∗}\) then you know the right action to take and behave optimally in the MDP and therefore solving the MDP.
\(\pi_{*}(a \mid s) = 1 if a =\operatorname{argmax}_{a \in A} q_{*}(s, a)\), otherwise, \(\pi_{*}(a \mid s) = 0\)
3. Reinforcement Learning on KGR
Problem Definition
Given a tuple with a missing tail entity (s, r, ?), find the correct tail entity and giving the reasoning path at the same time.
General Procedure

define MDP: \(S\), \(A\), \(T\), \(R\). Usually the transition is deterministic.

define policy network: \(P\)

define training function
Paper 1: DeepPath (EMNLP 2017)
DeepPath divides training triples accroding to their relation types. Under a relation \(r\), for a given entity pair \(\left(\mathrm{e}_{1}, \mathrm{e}_{2}\right)\),

\(S\): \(\mathbf{s}_{t}=\left(\mathbf{e}_{t}, \mathbf{e}_{t a r g e t}\mathbf{e}_{t}\right)\)

\(A\): Action space is defined as all the relations in the KG, not only outgoing relations related to \(\mathbf{e}_{t}\).

\(T\): Deterministic.

\(R\): Consists of 3 different rewards: global accuracy, path efficiency and path diversity. Global accuracy refers to if the path inference is correct, if it is, then \(r_{\mathrm{GLOBAL}}\) is 1, othwise is 1. Path efficiency measures the path length. Shorter a path length is, better the path is. \(r_{\text {EFFICIENCY }}=\frac{1}{\operatorname{length}(p)}\). Path diversity encourages agent to find diverse paths.

\(P\): A fullyconnected neural network to parameterize the policy function \(\pi(s ; \theta)\) that maps the state vector s to a proba bility distribution over all possible actions.

\(Tranining\): First using supervised policy learning strategy to train the policy network. Detailedly, they use biBFS to dig correct paths between given entity pair, and plugging in these paths to train. Then the reward is used for retrain.
Paper 2: MINERVA (ICLR 2018)
For a query \(\left(\mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)\),

\(S\): \(S=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)\)

\(A\): \(\mathcal{A}_{S}=\left\{\left(\mathrm{e}_{\mathrm{t}}, r, v\right) \in E: S=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right), r \in \mathcal{R}, v \in V\right\} \cup\{(s, \varnothing, s)\}\)

\(T\): \(\delta: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}\)

\(R\): if \(S_{T}=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)\) is the final state, then receive a reward of +1, otherwise +0.

\(O\): Observation. \(\mathcal{O}: \mathcal{S} \rightarrow \dot{\mathcal{E}} \times \mathcal{E} \times \mathcal{R}\) is defined as \(\mathcal{O}\left(s=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 \mathrm{q}}, \mathrm{r}_{\mathrm{q}}, \mathrm{e}_{2 \mathrm{q}}\right)\right)=\left(\mathrm{e}_{\mathrm{t}}, \mathrm{e}_{1 q}, \mathrm{r}_{\mathrm{q}}\right)\).

\(P\): a randomized nonstationary historydependent policy \(\pi=\left(\mathbf{d}_{\mathbf{1}}, \mathbf{d}_{\mathbf{2}}, \ldots, \mathbf{d}_{\mathbf{T}\mathbf{1}}\right)\), where \(\mathbf{d}_{\mathbf{t}}: H_{t} \rightarrow \mathcal{P}\left(\mathcal{A}_{S_{t}}\right)\) and \(H_{t}=\left(H_{t1}, A_{t1}, O_{t}\right)\). \(\mathbf{T}\) is a fixed step num. Based on the history embedding \(\mathbf{h}_{\mathbf{t}}\), the policy network makes the decision to choose an action from all available actions.
\(\mathbf{h}_{\mathbf{t}}=\operatorname{LSTM}\left(\mathbf{h}_{\mathbf{t}\mathbf{1}},\left[\mathbf{a}_{\mathbf{t}\mathbf{1}} ; \mathbf{o}_{\mathbf{t}}\right]\right)\)
\(\mathbf{d}_{\mathbf{t}}=\operatorname{softmax}\left(\mathbf{A}_{\mathbf{t}}\left(\mathbf{W}_{\mathbf{2}} \operatorname{ReLU}\left(\mathbf{W}_{\mathbf{1}}\left[\mathbf{h}_{\mathbf{t}} ; \mathbf{o}_{\mathbf{t}} ; \mathbf{r}_{\mathbf{q}}\right]\right)\right)\right)\)
\(A_{t} \sim\) Categorical \(\left(\mathbf{d}_{\mathbf{t}}\right)\)

\(Tranining\): \(J(\theta)=\mathbb{E}_{\left(e_{1}, r, e_{2}\right) \sim D} \mathbb{E}_{A_{1}, \ldots, A_{T1} \sim \pi_{\theta}}\left[R\left(S_{T}\right) \mid S_{1}=\left(e_{1}, e_{1}, r, e_{2}\right)\right]\) maximize the expected rewards.
Paper 3: MetaKGR (EMNLP 2019)
This paper combines reinforcement learning and metalearning. Here only presenting reinforcement learning parts. For a given query \(\left(e_{s}, r_{q}, e_{o}\right)\),

\(S\): \(s_{t}=\left(r_{q}, e_{s}, \hat{e}_{t}\right)\)

\(A\): \(\mathcal{A}_{t}=\left\{\left(r_{t+1}, \hat{e}_{t+1}\right) \mid \left(\hat{e}_{t}, r_{t+1}, \hat{e}_{t+1}\right) \in \mathcal{G}\right\}\). All outgoing edges and next entities.

\(T\): \(\delta\left(s_{t}, \mathcal{A}_{t}\right)=\left(r_{q}, e_{s}, \hat{e}_{t+1}\right)\). Same as MINERVA.

\(R\): \(R\left(s_{T} \mid r_{q}, e_{s}\right)\) will be 1 if the agent finally stops at the right entity, otherwise it will get a embeddingbased reward \(f\left(e_{s}, r_{q}, \hat{e}_{T}\right)\).

\(P\): Also considering history. Each action is represented as \(\mathbf{a}_{t}=\left[\mathbf{r}_{t+1} ; \hat{\mathbf{e}}_{t+1}\right]\). Using an LSTM to encode the search path.
\(\mathbf{h}_{t}=\operatorname{LSTM}\left(\mathbf{h}_{t1}, \mathbf{a}_{t1}\right)\)
\(\pi_{\theta}\left(a_{t} \mid s_{t}\right)=\operatorname{softmax}\left(\mathbf{A}_{t}\left(\mathbf{W}_{2} \operatorname{ReLU}\left(\mathbf{W}_{1}\left[\hat{\mathbf{e}}_{t} ; \mathbf{h}_{t} ; \mathbf{r}_{q}\right]\right)\right)\right)\)

\(Training\): \(\mathcal{L}_{r}^{\mathcal{D}}(\theta)=\mathbb{E}_{\left(e_{s}, r, e_{o}\right) \in \mathcal{D}} \mathbb{E}_{a_{1}, \ldots, a_{T1} \in \pi_{\theta}}\left[R\left(s_{T} \mid e_{s}, r\right)\right]\).
Paper 4: RuleGuider (EMNLP 2020)
This paper leverages rules in the reasoning process. And split agent into two subagents: relation agent and entity agent.
Fistly, it produces rules as format like \(r(X, Y) \leftarrow b_{1}\left(X, A_{2}\right) \wedge \ldots \wedge b_{n}\left(A_{n}, Y\right)\), and each rule has its confidence score which is used for pretrain of RL. Then,

Relation Agent: This process can be formulated as \(r_{t}=P^{R}\left(r_{q}, e_{t1}, \mathbf{R}, \boldsymbol{h}_{t}^{R}\right)\). The policy network is parameterized by embeddings of \(\boldsymbol{r}_{q}\) and relation history \(\boldsymbol{h}_{t}^{R}\), where \(\boldsymbol{h}_{t}^{R}=\operatorname{LSTM}\left(\boldsymbol{h}_{t1}^{R}, \boldsymbol{r}_{t1}\right)\). \(\boldsymbol{R}_{t} \in \mathbb{R}^{\leftR_{t}\right \times d}\) consist embeddings of all the relations in relation space \(R_{t}\) at step \(t\). Finally, relation agent outputs a probability distribution \(\boldsymbol{d}_{t}^{R}=\sigma\left(\boldsymbol{R}_{t} \times \boldsymbol{W}_{1} \operatorname{ReLU}\left(\boldsymbol{W}_{2}\left[\boldsymbol{h}_{t}^{R} ; \boldsymbol{r}_{q}\right]\right)\right)\). Relation agent’s historydependent policy is \(\boldsymbol{\pi}^{R}=\left(\boldsymbol{d}_{1}^{R}, \boldsymbol{d}_{2}^{R}, \ldots, \boldsymbol{d}_{T}^{R}\right)\).

Entity Agent: Similarily, this process can be formulated as \(e_{t}=P^{E}\left(e_{s}, r_{q}, r_{t}, \boldsymbol{h}_{t}^{E}\right)\). \(\boldsymbol{h}_{t}^{E}=\operatorname{LSTM}\left(\boldsymbol{h}_{t1}^{E}, \boldsymbol{e}_{t1}\right)\), and the probability distribution of entities \(\boldsymbol{d}_{t}^{E}=\sigma\left(\boldsymbol{E}_{t} \times \boldsymbol{W}_{3} \operatorname{ReLU}\left(\boldsymbol{W}_{4}\left[\boldsymbol{h}_{t}^{E} ; \boldsymbol{r}_{q}; \boldsymbol{e}_{s}; \boldsymbol{e}_{t}\right]\right)\right)\).

\(R\): Reward is 1 if the predicted triple is correct, otherwise \(R_h = \mathbb{I}(\epsilon \in \mathcal{G})+(1\mathbb{I}(\epsilon \in \mathcal{G}) f(\epsilon)\)

Traning: Using REINFORCE (Williams, 1992) algorithm to maximize rewards.
4. Overview of Reinforcement Learning Solutions (Policy)
There are two main approaches to solve reinforcement learning problems (learn the best policy):

value functionbased methods

policy searchbased methods
Value functionbased methods
 Dynamic Programming
This is modelbased (finite states, known rewards etc.). It uses iteratively calculates value funtion value and valueaction function value for every state, and greedily chooses the optimal action for every step.
Temporal difference (TD) is a significant algorithm under dynamic programming method. Two variants of TD are SARSA (onpolicy) and Qlearning (offpolicy).
 Monte Carlo
Not been iteratively carried out. Instead it generates one eposide each time by random walk.
Monte Carlo can be combined with TD.
Policy searchbased methods
 REINFORCE
Policy is determined by a neural network with parameters. After agent finishes one eposide, it will receive reward and doing backpropogation to update policy parameters.
 REINFORCE with baseline
The enhanced REINFORCE algorithm aimed to reduce the noisy during training.
 Actorcritic
A combination of value function methods and policy searchbased methods.