用BERT進行中文短文本分類

發布時間：2020-07-15 08:40:21 來源：網絡閱讀：2284 作者：nineteens 欄目：編程語言

　　1. 環境配置

　　本實驗使用操作系統：Ubuntu 18.04.3 LTS 4.15.0-29-generic GNU/Linux操作系統。

　　1.1 查看CUDA版本

　　cat /usr/local/cuda/version.txt

　　輸出：

　　CUDA Version 10.0.130*

　　1.2 查看 cudnn版本

　　cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

　　輸出：

　　#define CUDNN_MINOR 6

　　#define CUDNN_PATCHLEVEL 3

　　#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

　　如果沒有安裝 cuda 和 cudnn，到官網根自己的 GPU 型號版本安裝即可

　　1.3 安裝tensorflow-gpu

　　通過Anaconda創建虛擬環境來安裝tensorflow-gpu(Anaconda安裝步驟就不說了)

　　創建虛擬環境

　　虛擬環境名為：tensorflow

　　conda create -n tensorflow python=3.7.1

　　進入虛擬環境

　　下次使用也可以通過此命令進入虛擬環境

　　source activate tensorflow

　　安裝tensorflow-gpu

　　不推薦直接pip install tensorflow-gpu 因為速度比較慢。可以從豆瓣的鏡像中下載，速度還是很快的。https://pypi.doubanio.com/simple/tensorflow-gpu/

　　找到自己適用的版本(cp37表示python版本為3.7)

　　然后通過pip install 安裝

　　pip install https://pypi.doubanio.com/packages/15/21/17f941058556b67ce6d1e3f0e0932c9c2deaf457e3d45eecd93f2c20827d/tensorflow_gpu-1.14.0rc1-cp37-cp37m-manylinux1_x86_64.whl

　　我選擇了1.14.0的tensorflow-gpu linux版本，python版本為3.7。使用BERT的話，tensorflow-gpu版本必須大于1.11.0。同時，不建議選擇2.0版本，2.0版本好像修改了一些方法，還需要自己手動修改代碼

　　環境測試

　　在tensorflow虛擬環境中，python命令進入Python環境中，輸入import tensorflow，看是否能成功導入

　　2. 準備工作

　　2.1 預訓練模型下載

　　Bert-base Chinese

　　BERT-wwm ：由哈工大和訊飛聯合實驗室發布的，效果比Bert-base Chinese要好一些(鏈接地址為訊飛云，密碼：mva8。無奈當時用wwm訓練完提交結果時，提交通道已經關閉了，嗚嗚)

　　bert_model.ckpt：負責模型變量載入

　　vocab.txt：訓練時中文文本采用的字典

　　bert_config.json：BERT在訓練時，可選調整的一些參數

　　2.2 數據準備

　　1)將自己的數據集格式改成如下格式：第一列是標簽，第二列是文本數據，中間用tab隔開(若測試集沒有標簽，只保留一列樣本數據)。分別將訓練集、驗證集、測試集文件名改為train.tsv、val.tsv、test.tsv。文件格式為UTF-8(無BOM)

　　2)新建data文件夾，存放這三個文件。

　　3)預訓練模型解壓，存放到新建文件夾chinese中

　　2.3 代碼修改

　　我們需要對bert源碼中run_classifier.py進行兩處修改

　　1)在run_classifier.py中添加我們的任務類

　　可以參照其他Processor類，添加自己的任務類

　　# 自定義Processor類

　　class MyProcessor(DataProcessor):

　　def __init__(self):

　　self.labels = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　def get_train_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

　　def get_dev_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "val.tsv")), "val")

　　def get_test_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

　　def get_labels(self):

　　return self.labels

　　def _create_examples(self, lines, set_type):

　　examples = []

　　for (i, line) in enumerate(lines):

　　guid = "%s-%s" % (set_type, i)

　　if set_type == "test":

　　"""

　　因為我的測試集中沒有標簽，所以對test進行單獨處理，

　　test的label值設為任意一標簽(一定是存在的類標簽，

　　不然predict時會keyError)，如果測試集中有標簽，就

　　不需要if了，統一處理即可。

　　"""

　　text_a = tokenization.convert_to_unicode(line[0])

　　label = "Address"

　　else:

　　text_a = tokenization.convert_to_unicode(line[1])

　　label = tokenization.convert_to_unicode(line[0])

　　examples.append(

　　InputExample(guid=guid, text_a=text_a, text_b=None, label=label))

　　return examples

　　2)修改processor字典

　　def main(_):

　　tf.logging.set_verbosity(tf.logging.INFO)

　　processors = {

　　"cola": ColaProcessor,

　　"mnli": MnliProcessor,

　　"mrpc": MrpcProcessor,

　　"xnli": XnliProcessor,

　　"mytask": MyProcessor, # 將自己的Processor添加到字典

　　}

　　3 開工

　　3.1 配置訓練腳本

　　創建并運行run.sh這個文件

　　python run_classifier.py \

　　--data_dir=data \

　　--task_name=mytask \

　　--do_train=true \

　　--do_eval=true \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=chinese/bert_model.ckpt \

　　--max_seq_length=128 \

　　--train_batch_size=8 \

　　--learning_rate=2e-5 \

　　--num_train_epochs=3.0

　　--output_dir=out \

　　fine-tune需要一定的時間，我的訓練集有兩萬條，驗證集有八千條，GPU為2080Ti，需要20分鐘左右。如果顯存不夠大，記得適當調整max_seq_length 和 train_batch_size

　　3.2 預測

　　創建并運行test.sh(注：init_checkpoint為自己之前輸出模型地址)

　　python run_classifier.py \

　　--task_name=mytask \

　　--do_predict=true \

　　--data_dir=data \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=out \

　　--max_seq_length=128 \

　　--output_dir=out

　　預測完會在out目錄下生成test_results.tsv。生成文件中，每一行對應你訓練集中的每一個樣本，每一列對應的是每一類的概率(對應之前自定義的label列表)。如第5行第8列表示第5個樣本是第8類的概率。

　　3.3 預測結果處理鄭州婦科醫院 http://www.zykdfkyy.com/

　　因為預測結果是概率，我們需要對其處理，選取每一行中的最大值最為預測值，并轉換成對應的真實標簽。

　　data_dir = "C:\\test_results.tsv"

　　lable = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　# 用pandas讀取test_result.tsv，將標簽設置為列名

　　data_df = pd.read_table(data_dir, sep="\t", names=lable, encoding="utf-8")

　　label_test = []

　　for i in range(data_df.shape[0]):

　　# 獲取一行中最大值對應的列名，追加到列表

　　label_test.append(data_df.loc[i, :].idxmax())

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

用BERT進行中文短文本分類

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

用BERT進行中文短文本分類

猜你喜歡

最新資訊

相關推薦

相關標簽