散文網(wǎng) » 科技 »學(xué)習(xí) » python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】

2023-04-25 19:46 作者:紙飛機(jī)get 0人讀過 | 我要投稿

Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析

引言：語義解析作為自然語言處理的重要方面，其主要作用如下：在詞的層次上，語義分析的基本任務(wù)是進(jìn)行詞義消歧；在句子層面上，語義角色標(biāo)注是所關(guān)心的問題；在文章層次上，指代消解、篇章語義分析是重點(diǎn)。

而實(shí)體識(shí)別和關(guān)系抽取是構(gòu)建知識(shí)圖譜等上層自然語言處理應(yīng)用的基礎(chǔ)。關(guān)系抽取可以簡單理解為一個(gè)分類問題：給定兩個(gè)實(shí)體和兩個(gè)實(shí)體共同出現(xiàn)的句子文本，判別兩個(gè)實(shí)體之間的關(guān)系。

使用CNN或者雙向RNN加Attention的深度學(xué)習(xí)方法被認(rèn)為是現(xiàn)在關(guān)系抽取state of art的解決方案。已有的文獻(xiàn)和代碼，大都是針對英文語料，使用詞向量作為輸入進(jìn)行訓(xùn)練。這里以實(shí)踐為目的，介紹一個(gè)用雙向GRU、字與句子的雙重Attention模型，以天然適配中文特性的字向量(characterembedding)作為輸入，網(wǎng)絡(luò)爬取數(shù)據(jù)作為訓(xùn)練語料構(gòu)建的中文關(guān)系抽取模型。代碼主要是基于清華的開源項(xiàng)目thunlp/TensorFlow-NRE開發(fā)，其中效果如下：

一、實(shí)驗(yàn)前的準(zhǔn)備：

首先我們使用的python版本是3.6.5所用到的模塊如下：

tensorflow模塊用來創(chuàng)建整個(gè)模型訓(xùn)練和保存調(diào)用以及網(wǎng)絡(luò)的搭建框架等等。

numpy模塊用來處理數(shù)據(jù)矩陣運(yùn)算。

Sklearn模塊是一些機(jī)器學(xué)習(xí)算法的集成模塊。

二、模型的網(wǎng)絡(luò)搭建

其中模型的網(wǎng)絡(luò)圖如下：

雙向GRU加字級別attention的模型想法來自文章 “Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification” [Zhou et al.,2016]。這里將原文的模型結(jié)構(gòu)中的LSTM改為GRU，且對句子中的每一個(gè)中文字符輸入為character embedding。這樣的模型對每一個(gè)句子輸入做訓(xùn)練，加入字級別的attention。

句子級別attention的想法來自文章 “Neural Relation Extraction with Selective Attention over Instances” [Lin et al.,2016]。原文模型結(jié)構(gòu)圖如下，這里將其中對每個(gè)句子進(jìn)行encoding的CNN模塊換成上面的雙向GRU模型。這樣的模型對每一種類別的句子輸入做共同訓(xùn)練，加入句子級別的attention。

建立network.py文件，定義詞向量大小、步數(shù)、類別數(shù)等等：
def __init__(self):
? self.vocab_size = 16691
? self.num_steps = 70
? self.num_epochs = 10
? self.num_classes = 12
? self.gru_size = 230
? self.keep_prob = 0.5
? self.num_layers = 1
? self.pos_size = 5
? self.pos_num = 123
? # the number of entity pairs of each batch during training or testing
? self.big_num = 50

然后建立GRU網(wǎng)絡(luò)。按照所給出的網(wǎng)絡(luò)模型圖，定義出網(wǎng)絡(luò)基本框架作為具體參數(shù)的調(diào)用：

def __init__(self, is_training, word_embeddings, settings):
? ?self.num_steps = num_steps = settings.num_steps
? ?self.vocab_size = vocab_size = settings.vocab_size
? ?self.num_classes = num_classes = settings.num_classes
? ?self.gru_size = gru_size = settings.gru_size
? ?self.big_num = big_num = settings.big_num
? ?self.input_word = tf.placeholder(dtype=tf.int32, shape=[None, num_steps], name='input_word')
? self.input_pos1 = tf.placeholder(dtype=tf.int32, shape=[None, num_steps], name='input_pos1')
? self.input_pos2 = tf.placeholder(dtype=tf.int32, shape=[None, num_steps], name='input_pos2')
? self.input_y = tf.placeholder(dtype=tf.float32, shape=[None, num_classes], name='input_y')
? self.total_shape = tf.placeholder(dtype=tf.int32, shape=[big_num + 1], name='total_shape')
? total_num = self.total_shape[-1]
? word_embedding = tf.get_variable(initializer=word_embeddings, name='word_embedding')
? pos1_embedding = tf.get_variable('pos1_embedding', [settings.pos_num, settings.pos_size])
? ?pos2_embedding = tf.get_variable('pos2_embedding', [settings.pos_num, settings.pos_size])
? ?attention_w = tf.get_variable('attention_omega', [gru_size, 1])
? sen_a = tf.get_variable('attention_A', [gru_size])
? ?sen_r = tf.get_variable('query_r', [gru_size, 1])
? relation_embedding = tf.get_variable('relation_embedding', [self.num_classes, gru_size])
? ?sen_d = tf.get_variable('bias_d', [self.num_classes])
? gru_cell_forward = tf.contrib.rnn.GRUCell(gru_size)
? gru_cell_backward = tf.contrib.rnn.GRUCell(gru_size)
? ?if is_training and settings.keep_prob < 1:
? ? ? gru_cell_forward = tf.contrib.rnn.DropoutWrapper(gru_cell_forward, output_keep_prob=settings.keep_prob)
? ? ? gru_cell_backward = tf.contrib.rnn.DropoutWrapper(gru_cell_backward, output_keep_prob=settings.keep_prob)
? cell_forward = tf.contrib.rnn.MultiRNNCell([gru_cell_forward] * settings.num_layers)
? ?cell_backward = tf.contrib.rnn.MultiRNNCell([gru_cell_backward] * settings.num_layers)
? ?sen_repre = []
? ?sen_alpha = []
? sen_s = []
? ?sen_out = []
? ?self.prob = []
? ?self.predictions = []
? ?self.loss = []
? ?self.accuracy = []
? ?self.total_loss = 0.0
? self._initial_state_forward = cell_forward.zero_state(total_num, tf.float32)
? self._initial_state_backward = cell_backward.zero_state(total_num, tf.float32)
? ?# embedding layer
? ?inputs_forward = tf.concat(axis=2, values=[tf.nn.embedding_lookup(word_embedding, self.input_word),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tf.nn.embedding_lookup(pos1_embedding, self.input_pos1),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tf.nn.embedding_lookup(pos2_embedding, self.input_pos2)])
? inputs_backward = tf.concat(axis=2,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? values=[tf.nn.embedding_lookup(word_embedding, tf.reverse(self.input_word, [1])),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tf.nn.embedding_lookup(pos1_embedding, tf.reverse(self.input_pos1, [1])),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tf.nn.embedding_lookup(pos2_embedding,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? tf.reverse(self.input_pos2, [1]))])
? outputs_forward = []
? ?state_forward = self._initial_state_forward
? # Bi-GRU layer
? ?with tf.variable_scope('GRU_FORWARD') as scope:
? ? ? for step in range(num_steps):
? ? ? ? ? if step > 0:
? ? ? ? ? ? ? scope.reuse_variables()
? ? ? ? ? (cell_output_forward, state_forward) = cell_forward(inputs_forward[:, step, :], state_forward)
? ? ? ? ? outputs_forward.append(cell_output_forward)
? ?outputs_backward = []
? ?state_backward = self._initial_state_backward
? with tf.variable_scope('GRU_BACKWARD') as scope:
? ? ? for step in range(num_steps):
? ? ? ? ? if step > 0:
? ? ? ? ? ? ? scope.reuse_variables()
? ? ? ? ? (cell_output_backward, state_backward) = cell_backward(inputs_backward[:, step, :], state_backward)
? ? ? ? ?outputs_backward.append(cell_output_backward)
? output_forward = tf.reshape(tf.concat(axis=1, values=outputs_forward), [total_num, num_steps, gru_size])
? ?output_backward = tf.reverse(
? ? ? ?tf.reshape(tf.concat(axis=1, values=outputs_backward), [total_num, num_steps, gru_size]), [1])
? # word-level attention layer
? ?output_h = tf.add(output_forward, output_backward)
? ?attention_r = tf.reshape(tf.matmul(tf.reshape(tf.nn.softmax(
? ? ? tf.reshape(tf.matmul(tf.reshape(tf.tanh(output_h), [total_num * num_steps, gru_size]), attention_w),
? ? ? ? ? ? ? ? ? [total_num, num_steps])), [total_num, 1, num_steps]), output_h), [total_num, gru_size])
? ?

三、模型的訓(xùn)練和使用

其中用來訓(xùn)練的語料獲取，由于中文關(guān)系抽取的公開語料比較少。我們從distant supervision的方法中獲取靈感，希望可以首先找到具有確定關(guān)系的實(shí)體對，然后再去獲取該實(shí)體對共同出現(xiàn)的語句作為正樣本。負(fù)樣本則從實(shí)體庫中隨機(jī)產(chǎn)生沒有關(guān)系的實(shí)體對，再去獲取這樣實(shí)體對共同出現(xiàn)的語句。

對于具有確定關(guān)系的實(shí)體對，我們從復(fù)旦知識(shí)工廠得到，感謝他們提供的免費(fèi)API！一個(gè)小問題是，相同的關(guān)系label在復(fù)旦知識(shí)工廠中可能對應(yīng)著不同的標(biāo)注，比如“夫妻”，抓取到的數(shù)據(jù)里有的是“丈夫”，有的是“妻子”，有的是“伉儷”等等，需要手動(dòng)對齊。

模型的訓(xùn)練：

建立train_GRU文件，通過訓(xùn)練已經(jīng)經(jīng)過處理后得到的npy文件進(jìn)行訓(xùn)練。

其中訓(xùn)練的數(shù)據(jù)如下：

代碼如下：

def main(_):
? # the path to save models
? ?save_path = './model/'
? print('reading wordembedding')
? wordembedding = np.load('./data/vec.npy')
? print('reading training data')
? train_y = np.load('./data/train_y.npy')
? train_word = np.load('./data/train_word.npy')
? train_pos1 = np.load('./data/train_pos1.npy')
? train_pos2 = np.load('./data/train_pos2.npy')
? settings = network.Settings()
? ?settings.vocab_size = len(wordembedding)
? settings.num_classes = len(train_y[0])
? big_num = settings.big_num
? ?with tf.Graph().as_default():
? ? ? sess = tf.Session()
? ? ? ?with sess.as_default():
? ? ? ? ? initializer = tf.contrib.layers.xavier_initializer()
? ? ? ? ? with tf.variable_scope("model", reuse=None, initializer=initializer):
? ? ? ? ? ? ? m = network.GRU(is_training=True, word_embeddings=wordembedding, settings=settings)
? ? ? ? ? global_step = tf.Variable(0, name="global_step", trainable=False)
? ? ? ? ? optimizer = tf.train.AdamOptimizer(0.0005)
? ? ? ? ? train_op = optimizer.minimize(m.final_loss, global_step=global_step)
? ? ? ? ? sess.run(tf.global_variables_initializer())
? ? ? ? ?saver = tf.train.Saver(max_to_keep=None)
? ? ? ? ? merged_summary = tf.summary.merge_all()
? ? ? ? ? summary_writer = tf.summary.FileWriter(FLAGS.summary_dir +'/train_loss', sess.graph)
? ? ? ? ? ?def train_step(word_batch, pos1_batch, pos2_batch, y_batch, big_num):
? ? ? ? ? ? ? feed_dict = {}
? ? ? ? ? ? ? ?total_shape = []
? ? ? ? ? ? ?total_num = 0
? ? ? ? ? ? ? total_word = []
? ? ? ? ? ? ? ?total_pos1 = []
? ? ? ? ? ? ? total_pos2 = []
? ? ? ? ? ? ? ?for i in range(len(word_batch)):
? ? ? ? ? ? ? ? ? total_shape.append(total_num)
? ? ? ? ? ? ? ? ?total_num += len(word_batch[i])
? ? ? ? ? ? ? ? ? for word in word_batch[i]:
? ? ? ? ? ? ? ? ? ? ? total_word.append(word)
? ? ? ? ? ? ? ? ?for pos1 in pos1_batch[i]:
? ? ? ? ? ? ? ? ? ? ? total_pos1.append(pos1)
? ? ? ? ? ? ? ? ?for pos2 in pos2_batch[i]:
? ? ? ? ? ? ? ? ? ? ? total_pos2.append(pos2)
? ? ? ? ? ? ? total_shape.append(total_num)
? ? ? ? ? ? ? ?total_shape = np.array(total_shape)
? ? ? ? ? ? ? ?total_word = np.array(total_word)
? ? ? ? ? ? ? ?total_pos1 = np.array(total_pos1)
? ? ? ? ? ? ? ?total_pos2 = np.array(total_pos2)
? ? ? ? ? ? ? ?feed_dict[m.total_shape] = total_shape
? ? ? ? ? ? ? ?feed_dict[m.input_word] = total_word
? ? ? ? ? ? ? ?feed_dict[m.input_pos1] = total_pos1
? ? ? ? ? ? ? feed_dict[m.input_pos2] = total_pos2
? ? ? ? ? ? ? feed_dict[m.input_y] = y_batch
? ? ? ? ? ? ? ?temp, step, loss, accuracy, summary, l2_loss, final_loss = sess.run(
? ? ? ? ? ? ? ? ?[train_op, global_step, m.total_loss, m.accuracy, merged_summary, m.l2_loss, m.final_loss],
? ? ? ? ? ? ? ? ? feed_dict)
? ? ? ? ? ? ? ?time_str = datetime.datetime.now().isoformat()
? ? ? ? ? ? ? ?accuracy = np.reshape(np.array(accuracy), (big_num))
? ? ? ? ? ? ? ?acc = np.mean(accuracy)
? ? ? ? ? ? ? summary_writer.add_summary(summary, step)
? ? ? ? ? ? ? ?if step % 50 == 0:
? ? ? ? ? ? ? ? ? tempstr = "{}: step {}, softmax_loss {:g}, acc {:g}".format(time_str, step, loss, acc)
? ? ? ? ? ? ? ? ? ?print(tempstr)
? ? ? ? ? for one_epoch in range(settings.num_epochs):
? ? ? ? ? ? ? temp_order = list(range(len(train_word)))
? ? ? ? ? ? ? np.random.shuffle(temp_order)
? ? ? ? ? ? ? ?for i in range(int(len(temp_order) / float(settings.big_num))):
? ? ? ? ? ? ? ? ? temp_word = []
? ? ? ? ? ? ? ? ? ?temp_pos1 = []
? ? ? ? ? ? ? ? ? ?temp_pos2 = []
? ? ? ? ? ? ? ? ? temp_y = []
? ? ? ? ? ? ? ? ? ?temp_input = temp_order[i * settings.big_num:(i + 1) * settings.big_num]
? ? ? ? ? ? ? ? ? ?for k in temp_input:
? ? ? ? ? ? ? ? ? ? ? temp_word.append(train_word[k])
? ? ? ? ? ? ? ? ? ? ?temp_pos1.append(train_pos1[k])
? ? ? ? ? ? ? ? ? ? ? temp_pos2.append(train_pos2[k])
? ? ? ? ? ? ? ? ? ? ? temp_y.append(train_y[k])
? ? ? ? ? ? ? ? ? ?num = 0
? ? ? ? ? ? ? ? ? for single_word in temp_word:
? ? ? ? ? ? ? ? ? ? ? num += len(single_word)
? ? ? ? ? ? ? ? ? if num > 1500:
? ? ? ? ? ? ? ? ? ? ? print('out of range')
? ? ? ? ? ? ? ? ? ? ? continue
? ? ? ? ? ? ? ? ? temp_word = np.array(temp_word)
? ? ? ? ? ? ? ? ? ?temp_pos1 = np.array(temp_pos1)
? ? ? ? ? ? ? ? ? ?temp_pos2 = np.array(temp_pos2)
? ? ? ? ? ? ? ? ? ?temp_y = np.array(temp_y)
? ? ? ? ? ? ? ? ? ?train_step(temp_word, temp_pos1, temp_pos2, temp_y, settings.big_num)
? ? ? ? ? ? ? ? ? current_step = tf.train.global_step(sess, global_step)
? ? ? ? ? ? ? ? ? if current_step > 8000 and current_step % 100 == 0:
? ? ? ? ? ? ? ? ? ? ? print('saving model')
? ? ? ? ? ? ? ? ? ? ? path = saver.save(sess, save_path +'ATT_GRU_model', global_step=current_step)
? ? ? ? ? ? ? ? ? ? ? tempstr = 'have saved model to ' + path
? ? ? ? ? ? ? ? ? ? ? ?print(tempstr)

訓(xùn)練過程：

? ?2.模型的測試：

其中得到訓(xùn)練后的模型如下：

while True:
? #try:
? ? ? #BUG: Encoding error if user input directly from command line.
? ? ? ?line = input('請輸入中文句子，格式為"name1 name2 sentence":')
? ? ? #Read file from test file
? ? ? ?'''
? ? ? infile = open('test.txt', encoding='utf-8')
? ? ? ?line = ''
? ? ? ?for orgline in infile:
? ? ? ? ? ?line = orgline.strip()
? ? ? ? ? ?break
? ? ? ?infile.close()
? ? ? '''
? ? ? ?en1, en2, sentence = line.strip().split()
? ? ? ?print("實(shí)體1: " + en1)
? ? ? ?print("實(shí)體2: " + en2)
? ? ? ?print(sentence)
? ? ? relation = 0
? ? ? en1pos = sentence.find(en1)
? ? ? ?if en1pos == -1:
? ? ? ? ? en1pos = 0
? ? ? en2pos = sentence.find(en2)
? ? ? ?if en2pos == -1:
? ? ? ? ? en2post = 0
? ? ? output = []
? ? ? ?# length of sentence is 70
? ? ? ?fixlen = 70
? ? ? # max length of position embedding is 60 (-60~+60)
? ? ? ?maxlen = 60
? ? ? #Encoding test x
? ? ? ?for i in range(fixlen):
? ? ? ? ? word = word2id['BLANK']
? ? ? ? ? rel_e1 = pos_embed(i - en1pos)
? ? ? ? ? ?rel_e2 = pos_embed(i - en2pos)
? ? ? ? ? ?output.append([word, rel_e1, rel_e2])
? ? ? ?for i in range(min(fixlen, len(sentence))):
? ? ? ? ? word = 0
? ? ? ? ? if sentence[i] not in word2id:
? ? ? ? ? ? ? #print(sentence[i])
? ? ? ? ? ? ? #print('==')
? ? ? ? ? ? ? ?word = word2id['UNK']
? ? ? ? ? ? ? #print(word)
? ? ? ? ? else:
? ? ? ? ? ? ? #print(sentence[i])
? ? ? ? ? ? ? #print('||')
? ? ? ? ? ? ? ?word = word2id[sentence[i]]
? ? ? ? ? ? ? ?#print(word)
? ? ? ? ? output[i][0] = word
? ? ? ?test_x = []
? ? ? ?test_x.append([output])
? ? ? #Encoding test y
? ? ? ?label = [0 for i in range(len(relation2id))]
? ? ? label[0] = 1
? ? ? test_y = []
? ? ? ?test_y.append(label)
? ? ? ?test_x = np.array(test_x)
? ? ? ?test_y = np.array(test_y)
? ? ? test_word = []
? ? ? ?test_pos1 = []
? ? ? ?test_pos2 = []
? ? ? ?for i in range(len(test_x)):
? ? ? ? ? word = []
? ? ? ? ? ?pos1 = []
? ? ? ? ? ?pos2 = []
? ? ? ? ? ?for j in test_x[i]:
? ? ? ? ? ? ? temp_word = []
? ? ? ? ? ? ? ?temp_pos1 = []
? ? ? ? ? ? ? temp_pos2 = []
? ? ? ? ? ? ? ?for k in j:
? ? ? ? ? ? ? ? ? temp_word.append(k[0])
? ? ? ? ? ? ? ? ? temp_pos1.append(k[1])
? ? ? ? ? ? ? ? ? temp_pos2.append(k[2])
? ? ? ? ? ? ? word.append(temp_word)
? ? ? ? ? ? ? pos1.append(temp_pos1)
? ? ? ? ? ? ? ?pos2.append(temp_pos2)
? ? ? ? ? test_word.append(word)
? ? ? ? ? test_pos1.append(pos1)
? ? ? ? ? ?test_pos2.append(pos2)
? ? ?test_word = np.array(test_word)
? ? ? ?test_pos1 = np.array(test_pos1)
? ? ? ?test_pos2 = np.array(test_pos2)
? ? ? ?prob, accuracy = test_step(test_word, test_pos1, test_pos2, test_y)
? ? ? prob = np.reshape(np.array(prob), (1, test_settings.num_classes))[0]
? ? ? print("關(guān)系是:")
? ? ? #print(prob)
? ? ? top3_id = prob.argsort()[-3:][::-1]
? ? ? for n, rel_id in enumerate(top3_id):
? ? ? ? ? print("No."+ str(n+1) + ": " + id2relation[rel_id] + ", Probability is " + str(prob[rel_id]))
?測試效果：

完整代碼：https://gitcode.net/qq_42279468/python-bi-gru.git

標(biāo)簽：

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】的評論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】的評論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】

本文作者的其他文章

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】的評論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

python實(shí)現(xiàn)Bi-GRU語義解析實(shí)現(xiàn)中文人物關(guān)系分析【文末源碼】的評論 (共條)