Chinese NER based Bi-LSTM and CRF

2017-11-02

1 Introduction

Named-Entity Recognition is a task of identifying names of people, organizations, locations, or other entities, which is also a subtask of information extraction, question answering, machine translation from natural language. There are mainly two kinds of methods in the field NER, one is based on the rules, building its recognition system is relatively simple, but we must design rules manually, so that the robustness and portability of system are very poor. Now NER is more likely to be based on statistical machine learning method, in which robustness and flexibility are better than the rules. There are many representative traditional machine learning algorithms in the field NER, such as Hidden Markov Model, Conditional Random Fields, Maximum Entropy and so on.

At present, the NER for English has reached a high level. However, for Chinese language, because of no space to mark the boundaries of words, no characteristics of capital letters and the fuzzy interpretation of words, NER for Chinese is more difficult than for English. In this paper, we introduce a hybrid architecture based on a new recurrent neural network architecture named LSTM and Conditional Random Fields for Chinese NER tasks. This architecture includes Word-Embedding layer, Bi-LSTMs layer and CRFs layer. In Word-Embedding layer, we use the trained word vector (word2vec) to generate the inputs of neural network. Then we feed these inputs into a Bi-LSTMs neural network to get more abstract contextual information as the node of CRFs model. Finally, we can train the network to get the Chinese Named-Entity Recognituon model.

Experiments show that this architecture can achieve results close to the current state-of-the-art model in the case of less training samples and coarse-grained word vector.

2 Bi-LSTM-CRF Model

We provide a brief description of LSTM and CRF, and present a hybrid tagging architecture.

2.1 Long Short-term Memory Network

Recurrent neural networks (RNNs) are a family of neural networks that operate on sequential data. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. Long Short-term Memory Networks (LSTMs) have been designed to combat this issue by incorporating a memory-cell and have been shown to capture long-term dependencies.

LSTMs were designed to combat vanishing gradients through a gating mechanism. To understand what this means, let’s look at how a LSTM calculates a hidden state $s_t$:
$$\begin{aligned} i &=\sigma(x_tU^i + s_{t-1} W^i) \\ f &=\sigma(x_t U^f +s_{t-1} W^f) \\ o &=\sigma(x_t U^o + s_{t-1} W^o) \\ g &=\tanh(x_t U^g + s_{t-1}W^g) \\ c_t &= c_{t-1} \circ f + g \circ i \\ s_t &=\tanh(c_t) \circ o \end{aligned} $$
where $\circ$ means element-wise product, and $\sigma$ means sigmoid function. In RNNs,$s_t$ is the hidden state at time step $t$. It’s the "memor" of the network. $s_t$ is calculated based on the previous hidden state and the input at the current step: $s_t=f(U* x_t+W*s_{t-1})$. A LSTM unit does the exact same thing, just in a different way. To summarize:

  • $i$, $f$, $o$ are called the input, forget and output gates, respectively. Note that they have the exact same equations, just with different parameter matrices. They care called gates because the sigmoid function squashes the values of these vectors between $0$ and $1$, and by multiplying them element-wise with another vector you define how much of that other vector you want to “let through”.
  • $g$ is a “candidate” hidden state that is computed based on the current input and the previous hidden state. It is exactly the same equation we had in our vanilla RNN, we just renamed the parameters U and W to $U^g$ and $W^g$. However, instead of taking $g$ as the new hidden state as we did in the RNN, we will use the input gate from above to pick some of it.
  • $c_t$ is the internal memory of the unit. It is a combination of the previous memory $c_{t-1}$ multiplied by the forget gate, and the newly computed hidden state $g$, multiplied by the input gate. Thus, intuitively it is a combination of how we want to combine previous memory and the new input.
  • Given the memory $c_t$, we finally compute the output hidden state $s_t$ by multiplying the memory with the output gate.

For a given sentence $(x_1, x_2, . . . , x_n)$ containing $n$ words, each represented as a $d$-dimensional vector, an LSTM computes a representation $\overrightarrow{h_t}$ of the left context of the sentence at every word $t$. Naturally, generating a representation of the right context $\overleftarrow{h_t}$ as well should add useful information. This can be achieved using a second LSTM that reads the same sequence in reverse. We will refer to the former as the forward LSTM and the latter as the backward LSTM. These are two distinct networks with different parameters. This forward and backward LSTM pair is referred to as a bidirectional LSTM.

The representation of a word using this model is obtained by concatenating its left and right context representations, $h_t = [\overrightarrow{h_t} ; \overleftarrow{h_t}]$. These representations effectively include a representation of a word in context, which is useful for numerous tagging applications.

Figure 1 is the process of Bi-LSTMs layer. At first, we input some words that represent as word embedding into the Bi-LSTM network so that we can combine the forward and backward LSTM pair to represent the context of original words.

2.2 Conditional Random Fields

In the next step, we can use the $h_t$ we get from Bi-LSTMs as the input of CRF. We can assume the input representation $(h_1,h_2,...,h_n)$ is still a sentence. Then we can follow the traditional method that using CRF for part-of-speech tagging.

First, we have an input sentence
$$X=(x_1,x_2,...x_n)$$
we consider $P$ to be the matrix of scores output by the Bi-LSTM network. $P$ is of size $n*k$, where $n$ is the number of words in a sentence, $k$ is the number of distinct tags, and $P_{i,j}$ is the score between the $i^{th}$ word and $j^{th}$ tag.

For tags collection
$$T=(t_1,t_2,...t_k)$$
we consider $A$ to be the matrix of transition scores that represent the transition probability from one tag to another. $A$ is of size $k*k$, where $k$ is the number of distinct tags and $A_{i,j}$ is the score of a transition from tag $i$ to tag $j$. Cause we need to have two extra tags to represent the start and the end of a sentence, so $A$ is therefore a square matrix of size $k+2$.

So for a sequences of predictions:
$$y=(y_1,y_2,...y_n)$$
we define its score to be
$$s(X,y)=\sum_{i=0}^{n} A_{y_i,y_{i+1}}+\sum_{i=1}^n P_{i,y_i}$$
$y_0$ and $y_n$ are the start and end tags of a sentence, that we add to the set of possible tags.

We add a softmax layer over all possible tag sequences yields a probability for the sequence $y$:
$$p(y|X)=\frac{e^{s(X,y)}}{\sum_{\tilde{y}\in{Y_X}} e^{s(X,\tilde{y})}}$$
here $Y_X$ is represents all possible tag sequences for a sentence $X$. Traning this model, we can maximize the log-probability of the correct tag sequence:
$$\log(p(y|X))=s(X,y)-\log(\sum_{\tilde{y}\in{Y_X}} e^{s(X,\tilde{y})})$$
From the formulation above, it is evident that we encourage our network to produce a valid sequence of output labels. While decoding, we predict the output sequence that obtains the maximum score given by:
$$y^*=argmax _{\tilde{y}\in{Y_X}} s(X,\tilde{y})$$

2.3 Hybrid Model

From the introduction above, we have known the Bidirectional Long Short-term Memory Network and Conditional Random Fields model. Just combine both of them so that we can get a hybrid model for NER tasks.


Figure 2 describes the whole procedure for Chinese NER by using our hybrid architecture model.

3 Model Details

3.1 Tagging Scheme

The task of named entity recognition is to assign a named entity label to every word in a sentence. A single named entity could span several tokens within a sentence. Sentences are usually represented in the IOB format (Inside, Outside, Beginning) where every token is labeled as B-label if the token is the beginning of a named entity, I-label if it is inside a named entity but not the first token within the named entity, or O otherwise. However, we decided to use the IOBES tagging scheme, a variant of IOB commonly used for named entity recognition, which encodes information about singleton entities (S) and explicitly marks the end of named entities (E). Using this scheme, tagging a word as I-label with high-confidence narrows down the choices for the subsequent word to I-label or E-label, however, the IOB scheme is only capable of determining that the subsequent word cannot be the interior of another label. and showed that using a more expressive tagging scheme like IOBES improves model performance marginally. However, I did not observe a significant improvement over the IOB tagging scheme.

3.2 Training Model

The scores associated with each tagging decision for each token (i.e. $P_{i,j}$) are defined to be the dot product between the embedding of a word-in-context computed with a bidirectional LSTM. The parameters of this model are thus the matrix of bigram compatibility scores $A$, and the parameters that give rise to the matrix $P$, namely the parameters of the bidirectional LSTM, the linear feature weights, and the word embeddings.

The parameters are trained to maximize the log-probability of observed sequences of NER tags in an annotated corpus, given the observed words.

3.3 Viterbe Decode

In the process of running model, when we get the word-in-context computed with a bidirectional LSTM, we can use Viterbe decode to find the optimal tagging sequence.

4 Experiments

For training of this hybrid model, I used back-propagation algorithm updating the parameters on every training example, one at a time, used stochastic gradient descent (SGD) with a learning rate of $0.01$. To evaluate the quality of a NER system’s output, we look at precision, recall and the $F_1$ measure.

  • Precision is calculated as the ratio of correct non-null labels predicted to the total number of non-null labels predicted.
  • Recall is calculated as the ratio of correct non-null labels predicted to the total number of correct non-null labels.
  • F1 is the harmonic mean of the two: $F_1=\frac{2pr}{p+r}$
Named-Entity Precision Recall $F_1$
LOC 92.30% 92.16% 92.23
ORG 86.71% 87.50% 87.10
PER 92.78% 92.99% 92.88
All 90.96% 91.15% 91.05

Table 1 shows the our result for Chinese Named-Entity Recognition and we can see that we got a great result.

5 Thinking More

When I haven't finish this paper, I was thinking about the effect of CRF layer. Here we use the CRF because it can capture the transition relation between any pair tags. So why not use neural network to capture this relation. I design a new architecture that using a Bi-LSTM to capture the tag dependencies, instead of CRF. Here is the architecture.


Figure 3 describes the whole procedure for Chinese NER by using a pure neural network architecture. In the Encoder step, we can get the embedding of a word-in-context:
$$\begin{aligned} \overrightarrow{h_i} &=\overrightarrow{f}(x_i,\overrightarrow{h_{i-1}}) \\ \overleftarrow{h_i} &=\overleftarrow{f}(x_i,\overleftarrow{h_{i-1}}) \\ c_i&=\phi(\overrightarrow{h_i},\overleftarrow{h_i}) \end{aligned}$$
here $x_i$ is the word embedding of current input and $h_{i-1}$ is the previous hidden state we get from forward and backward LSTM. $\phi$ is the function that describes how we combine the outputs from forward and backward LSTM and $c_i$ is the context we get from the combination.

In the Decoder step, we can use the context we get from Encoder, combine the previous tag to predict the next tag:
$$\begin{aligned} z_i&=\tilde{f}(c_i,y_i,z_{i-1}) \\ o_i &=g(y_{i-1},z_i,c_i) \end{aligned}$$
here $z_i$ is the hidden state of Bi-LSTM in Decoder step. $\tilde{f}$ is a simple representation of calculation in Bi-LSTM and $o_i$ is a function that predict the tag according the previous tag $y_{i-1}$, the hidden state in Bi-LSTM $z_i$ and the context $c_i$ we get from Encoder step.

I am doing something for this new framework and will use a paper to systematically explain them.

6 Conclusion

This paper presents two architectures for Chinese NER task, one is a hybrid architecture and another is a pure neural network architecture. The key idea of this paper is that not only the relations between context in sentences, but also the relations between all tags are very important. So we need to design some mechanism to capture the relation between all tags, maybe HMM, maybe CRF, or just using a RNN. Word embedding is very important in our architecture, in our paper I used a pre-trained word embedding to get a good result. Avoiding overfitting in the training data, I used dropout for the input layer and context layer.