專題6: 代碼實(shí)戰(zhàn)──基于語言模型的拼寫糾錯(cuò)

一、拼寫糾錯(cuò)任務(wù)概述
在實(shí)現(xiàn)QA系統(tǒng)或者檢索系統(tǒng)時(shí),需要用戶給出輸入,用戶在輸入問題的時(shí)候,不能期待他一定會(huì)輸入正確,有可能輸入的單詞的拼寫是錯(cuò)誤的。在一個(gè)完備的系統(tǒng)中,需要后臺(tái)能夠及時(shí)捕獲拼寫錯(cuò)誤,并進(jìn)行糾正,然后再通過修正之后的結(jié)果再跟庫里的問題進(jìn)行匹配。這里來實(shí)現(xiàn)一個(gè)簡(jiǎn)單的拼寫糾錯(cuò)模塊,自動(dòng)去修復(fù)錯(cuò)誤的單詞。

糾錯(cuò)模塊是基于Noisy Channel Model噪音通道模型:



其中,candidates指的是針對(duì)于錯(cuò)誤的單詞的候選集,可以通過edit_ distance來獲??; valid單詞可以定義為存在詞典里的單詞; c代表的是正確的單詞,s代表的是用戶錯(cuò)誤拼寫的單詞。

關(guān)于Noisy Channel Model的更多說明可參考https://blog.csdn.net/kunpen8944/article/details/83066460和https://www.cnblogs.com/loubin/p/13704777.html。

整個(gè)實(shí)現(xiàn)思路如下:

通過編輯距離(1到3)獲取到candidates集合;

通過詞表來過濾candidates,篩選出有效的單詞,獲取到真正的候選集;

目標(biāo):在candidates中找到使得上述條件概率最大的正確寫法c

其中,

p(s|c):通過歷史數(shù)據(jù)來獲得

也就是對(duì)于一個(gè)正確的單詞c,有多大的概率把它寫成了某種錯(cuò)誤的形式,一般通過歷史數(shù)據(jù)來獲取概率。 拼寫錯(cuò)誤的歷史數(shù)據(jù)spell_errors.txt沒有標(biāo)記概率,所以可以使用uniform probability來表示,這個(gè)也叫channel probability。

p( c ):使用語言模型進(jìn)行計(jì)算

也就是假如把錯(cuò)誤的s改造成了c,把它加入到當(dāng)前的語句之后有多通順。 假如有兩個(gè)候選c1、c2,希望分別計(jì)算出這個(gè)語言模型的概率,使用bigram計(jì)算當(dāng)前詞前面和后面詞的bigram概率。 給定We are go to school tomorrow,對(duì)于這句話我們希望把中間的go替換成正確的形式,假如候選集里{going, went},計(jì)算概率p(going|are)p(to|going)和p(went|are)p(to|went) ,記為p( c )的概率,并選擇概率最大的候選詞作為要替換的正確詞。

要計(jì)算bigram概率,就需要訓(xùn)練一個(gè)語言模型。訓(xùn)練時(shí)需要用到文本數(shù)據(jù),選擇使用nItk自帶的reuters的文本類數(shù)據(jù)來訓(xùn)練一個(gè)語言模型。也可以嘗試其他的數(shù)據(jù)。

二、拼寫糾錯(cuò)實(shí)現(xiàn)
(1)讀取詞典庫,構(gòu)建詞典

# 讀取詞典庫
vocab = set([line.rstrip() for line in open('data/vocab.txt')])
len(vocab)
輸出:

48227
(2)實(shí)現(xiàn)函數(shù)構(gòu)建候選集

# 1.構(gòu)建候選集

import string

# 獲取一個(gè)錯(cuò)誤單詞的正確單詞的候選集
def generate_candicates(word):
    # 為了簡(jiǎn)化,只生成與給定單詞編輯距離為1的單詞
    letters = string.ascii_lowercase
    splits = [(word[:i], word[i:]) for i in range(len(word)+1)] # 獲取單詞所有的拆分組合
    # 插入一個(gè)字母
    inserts = [L + c + R for L, R in splits for c in letters]
    # 刪除一個(gè)字母
    deletes = [L + R[1:] for L, R in splits]
    # 替換一個(gè)字母
    replaces = [L + c + R[1:] for L, R in splits for c in letters]
    # 求并集
    candidates = set(inserts + deletes + replaces)
    # 使用詞表過濾掉不合法的單詞
    candidates = list(filter(lambda word: word in vocab, candidates))
    return candidates

generate_candicates('word')
輸出:

['wore',
 'cord',
 'word',
 'ward',
 'sword',
 'wod',
 'worm',
 'wordy',
 'words',
 'work',
 'wood',
 'world',
 'worn',
 'wohd',
 'lord',
 'wold']
(3)基于Bigram計(jì)算p(c)

# 2.計(jì)算p(c),使用Bigram

import nltk
from nltk.corpus import reuters # 導(dǎo)入語料庫

# 下載語料庫
nltk.download('reuters')
nltk.download('punkt')

# 讀取語料庫
categories = reuters.categories()
corpus = reuters.sents(categories=categories)
print(len(corpus), len(corpus[0]))
輸出:

54711 49
自己實(shí)現(xiàn)Bigram:

# 2.1 自己實(shí)現(xiàn)Bigram

from tqdm import tqdm

term_count = {}
bigram_count = {}

for doc in tqdm(corpus):
    doc = ['<s>'] + doc
    for i in range(len(doc) - 1):
        term = doc[i]
        bigram = ''.join(doc[i:i+2]) # bigram: [i, i+1]
        if term in term_count:
            term_count[term] += 1
        else:
            term_count[term] = 1
        if bigram in bigram_count:
            bigram_count[bigram] += 1
        else:
            bigram_count[bigram] = 1

bigram_count
輸出:

{'<s>ASIAN': 4,
 'ASIANEXPORTERS': 1,
 'EXPORTERSFEAR': 1,
 'FEARDAMAGE': 1,
 'DAMAGEFROM': 2,
 'FROMU': 4,
 ...
 'estimated27': 1,
 '27pct': 33,
 'pcton': 121,
 'ona': 521,
 'akilowatt': 1,
 ...}
使用nltk庫自帶的模塊實(shí)現(xiàn)Bigram:

# 2.2 使用nltk提供的模塊實(shí)現(xiàn)bigram

from nltk.util import ngrams
from nltk.lm import NgramCounter

unigrams = [list(ngrams(['<s>'] + doc, 1)) for doc in tqdm(corpus)]
bigrams = [list(ngrams(['<s>'] + doc, 2)) for doc in tqdm(corpus)]
ngram_counts = NgramCounter(unigrams + bigrams)
ngram_counts[['we']]['are']
輸出:

115
實(shí)現(xiàn)計(jì)算兩個(gè)給定單詞的Bigram得分:

# 計(jì)算給定兩個(gè)單詞的bigram得分

import numpy as np

vocab_count = len(vocab)

def get_bigram(word1, word2):
    join_count = ngram_counts[[word1]][word2]
    word1_count = ngram_counts[word1]
    bigram_probability = (join_count + 1) / (word1_count + vocab_count)
    return np.log(bigram_probability)

get_bigram('we', 'are')
輸出:






-6.04867564155262
nltk庫關(guān)于Ngram的更多用法可參考https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams和https://www.nltk.org/api/nltk.lm.html#nltk.lm.NgramCounter。

(4)基于歷史數(shù)據(jù)計(jì)算p(s|c):

# 3.計(jì)算p(s|c)
# 用戶打錯(cuò)的概率統(tǒng)計(jì),即 計(jì)算channel probability
channel_prob = {} # channel_prob[c][s]表示正確的單詞c被錯(cuò)寫成s的概率

for line in tqdm(open('data/spell-errors.txt')):
    items = line.split(':')
    correct = items[0].strip()
    mistakes = [item.strip() for item in items[1].strip().split(',')]
    channel_prob[correct] = {}
    prob = 1 / len(mistakes)
    for mistake in mistakes:
        channel_prob[correct][mistake] = prob

channel_prob['apple']
輸出:

{'alipple': 0.047619047619047616,
 'apoll': 0.047619047619047616,
 'alpper': 0.047619047619047616,
 'appy': 0.047619047619047616,
 'alpple': 0.047619047619047616,
 'ait': 0.047619047619047616,
 'appel': 0.047619047619047616,
 'appre': 0.047619047619047616,
 'abuol': 0.047619047619047616,
 'apelle': 0.047619047619047616,
 'appple': 0.047619047619047616,
 'alploo': 0.047619047619047616,
 'alppe': 0.047619047619047616,
 'apl': 0.047619047619047616,
 'apll': 0.047619047619047616,
 'apply': 0.047619047619047616,
 'alppel': 0.047619047619047616,
 'aplep': 0.047619047619047616,
 'apoler': 0.047619047619047616,
 'appe': 0.047619047619047616,
 'aple': 0.047619047619047616}
(5)實(shí)現(xiàn)拼寫糾錯(cuò)主函數(shù):

# 4.拼寫糾錯(cuò)主函數(shù)

def spell_correct(query):
    words = query.split()
    word_count = len(words)
    correct_query = []
    for word in words:
        if word not in vocab: # 不存在于詞庫,則認(rèn)為拼寫錯(cuò)誤,需要替換為正確的單詞
            # 4.1 生成當(dāng)前單詞的有效候選集
            candidates = generate_candicates(word)
            # 4.2 對(duì)于每一個(gè)candidate,計(jì)算它的score
            probs = []
            for candidate in candidates:
                prob = 0
                # a.計(jì)算channel probability
                if candidate in channel_prob and word in channel_prob[candidate]:
                    prob += np.log(channel_prob[candidate][word])
                else:
                    prob += np.log(0.0001)
                # b.計(jì)算語言模型概率
                idx = words.index(word)
                if idx > 0:
                    pre_score = get_bigram(words[idx-1], candidate) # [pre_word, candidate]
                    prob += pre_score
                if idx < word_count - 1:
                    after_score = get_bigram(candidate, words[idx+1]) # [candidate, post_word]
                    prob += after_score
                probs.append(prob)
            if len(probs) > 0:  # 有合適的候選詞
                max_idx = probs.index(max(probs))
                best_candidate = candidates[max_idx]
                correct_query.append(best_candidate)
            else: # 沒有合適的詞,認(rèn)為是正確的,直接加入
                correct_query.append(word)
        else: # 詞存在于詞表,則認(rèn)為是正確的詞
            correct_query.append(word)
    return ' '.join(correct_query)

spell_correct('toke this away')
輸出:

'take this away'
拼寫糾錯(cuò)的思路是對(duì)于query(待檢測(cè)的句子)進(jìn)行分詞,然后把分詞后的每一個(gè)單詞在詞庫里面進(jìn)行搜索, 如果搜不到,則可以認(rèn)為是拼寫錯(cuò)誤的。如果拼寫錯(cuò)誤,再通過channel和bigram來計(jì)算最適合的候選。

對(duì)當(dāng)前單詞來說,需要計(jì)算其對(duì)應(yīng)的每一個(gè)候選詞的得分,計(jì)算方式如下:

score = p(correct) * p(mistake/correct)

\= log p(correct) + log p(mistake/correct)

計(jì)算好之后,再將最大的candidate作為當(dāng)前錯(cuò)誤單詞對(duì)應(yīng)的正確單詞。

(6)基于測(cè)試數(shù)據(jù)進(jìn)行測(cè)試

# 進(jìn)行測(cè)試
for line in open('data/testdata.txt'):
    items = line.rstrip().split('\t')
    print(spell_correct(items[2]))
輸出:

They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products
But some exporters said that while the conflict would hurt them in the long-run in the short-term Tkyo's loss might be their gain
The U.S. Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on
...
FSIS inspects an estimates 127 mln head of cattle and 4.5 billion chicnek and turkeys every year
Houston said inspection programs have kept pace with change but conceded that the danger of chemical residues in the meat and uopltry supply has increased
He also said that although he was confident the bacterium salmonella eventually could be eradicated it would take time and much money to contain the growing problem
三、語法糾錯(cuò)的應(yīng)用
文中用到了語言模型,如下:

Unigram



Bigram



N-gram

當(dāng)N=3時(shí),同理。但是有時(shí)$\left(w_{1}, w_{2}, w_{3}, w_{4}, w_{5}\right) \text { 到 } w_{5}$時(shí),可能語料庫中沒有出現(xiàn)過這個(gè)短語。

可以看到,Bigram可以看作是Unigram和N-gram之間的權(quán)衡:

Ungram只看一個(gè)單詞,可能不夠準(zhǔn)確,因此將當(dāng)前單詞前后的單詞結(jié)合考慮會(huì)更準(zhǔn)確;

當(dāng)N很大時(shí),會(huì)存在語料庫中可能不包含當(dāng)前的長度為N的短語,因此N應(yīng)該取小一點(diǎn);

所以一般情況下,N取2或3比較理想,即Bigram和Trigram。

錯(cuò)誤檢測(cè)用到的公式為:



這個(gè)公式和錯(cuò)誤檢測(cè)的應(yīng)用場(chǎng)景比較廣泛,包括語音識(shí)別、機(jī)器翻譯、拼寫糾錯(cuò)、OCR、密碼破解等,基本原理都是信號(hào)到文本的轉(zhuǎn)換。如下:

機(jī)器翻譯

公式:$P(\text {中文} \mid \text {英文}) \propto P(\text {英文} \mid \text {中文}) * P(\text {中文})$

其中,P(英文|中文)定義為翻譯模型,P(中文)定義為語言模型

拼寫糾錯(cuò)

公式:$P(\text {正確的寫法}|\text{錯(cuò)誤的寫法}) \propto P(\text {錯(cuò)誤的寫法|正確的寫法) * } P(\text {正確的寫法) }$

其中,P(錯(cuò)誤的寫法|正確的寫法)定義為編輯距離,P(正確的寫法)定義為語言模型。

語音識(shí)別

公式:$P(\text {文本} \mid \text {語音信號(hào)}) \propto P(\text {語音信號(hào)|文本}) * P(\text {文本})$

其中,P(語音信號(hào)|文本)定義為翻譯模型或Recognition Model,P(文本)定義為語言模型。

密碼破解

公式:$P(\text {明文} \mid \text {密文}) \propto P(\text {密文} \mid \text {明文}) * P(\text {明文})$

其中,P(密文|明文)定義為翻譯模型,P(明文)定義為語言模型。






掃碼進(jìn)群:
Python極客部落群聊二維碼