引言

『中文情绪银行』 (Chinese EmoBank)是由人工标注产生的 中文维度情感词典 ,含效价valence和唤醒度arousal两个维度。

  • 效价valence,可测量出文本中的积极/消极情感程度。
  • 唤醒度arousal,可测量文本中平静/兴奋状态的程度。

该词典包括

  • CVAW(Chinese valence-arousal words), 5512词
  • CVAP(Chinese valence-arousal phrases), 含2998词组
  • 语料CVAS(Chinese valence-arousal sentences) 含2582个单句
  • 语料CVAT(Chinese valence-arousal texts) 2969个句子

需要注意该词典是繁体中文词典,经过繁体转简体,已将CVAW嵌入到最新的cntext包。

pip3 install --upgrade cntext 

本文图片来源于

http://nlp.innobic.yzu.edu.tw/resources/ChineseEmoBank.html


CVAW(Chinese valence-arousal words)

Word Valence_Mean Arousal_Mean Valence_SD Arousal_SD
乏味 3.4 3.0 0.800 1.414
放鬆 6.2 2.0 0.748 0.894
勝利 7.8 7.2 0.748 1.166
痛苦 2.4 6.8 0.490 0.748

CVAP(Chinese valence-arousal phrases )

Modifier Type Phrase Valence_Mean Arousal_Mean Valence_SD Arousal_SD
deg 十分有趣 8.222 7.063 0.533 0.390
mod 應該開心 5.986 5.350 0.242 0.456
neg 不喜歡 3.033 5.788 0.481 0.605
neg_deg 沒有太難過 4.478 4.675 0.413 0.538

CVAS(Chinese valence-arousal sentences)

Text Valence_Mean Arousal_Mean Valence_SD Arousal_SD
這是我觀賞過的最令人驚歎的演出。 7.000 7.750 0.000 0.433
簡直是人生惡夢的開端。 2.600 6.750 0.490 0.829
從小我經常覺得現實很無聊。 3.667 4.333 0.471 0.471
過去他們很輕鬆地賺錢。 5.667 4.000 1.247 0.816

CVAT(Chinese valence-arousal texts)

Text Valence_Mean Arousal_Mean Valence_SD Arousal_SD Category
很多車主抱怨新車怠速抖動嚴重—-冷車時更嚴重。 3.250 5.667 1.090 1.054 Car
房間裏黴味,煙味撲鼻,沒有窗戶通風,骯髒的地毯上的斑斑點點的污蹟,令人觸目驚心。 1.889 6.875 0.737 0.927 Hotel
CPU顯卡也完全夠用,接口也非常齊全,總體來說很滿意! 7.143 5.000 0.350 0.816 Laptop
飛安帶來更多保障,也提供旅客更安心的服務品質。 7.000 4.222 0.535 1.133 News

文献

如果用到Chinese EmoBank词典,请注明出处。

Lung-Hao Lee, Jian-Hong Li and Liang-Chih Yu, “Chinese EmoBank: Building Valence-Arousal Resources for Dimensional Sentiment Analysis,ACM Trans. Asian and Low-Resource Language Information Processing, vol. 21, no. 4, article 65, 2022.

Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang, Yunchao He, Jun Hu, K. Robert Lai, and Xuejie Zhang. 2016. “Building Chinese affective resources in valence-arousal dimensions. In Proceedings of NAACL/HLT-16, pages 540-545.


代码

import cntext as ct

ct.load_pkl_dict('ChineseEmoBank.pkl')

Run

{'Referer-1': 'Lee, Lung-Hao, Jian-Hong Li, and Liang-Chih Yu. "Chinese EmoBank: Building Valence-Arousal Resources for Dimensional Sentiment Analysis." Transactions on Asian and Low-Resource Language Information Processing 21, no. 4 (2022): 1-18.',
 
 'Referer-2': 'Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang, Yunchao He, Jun Hu, K. Robert Lai, and Xuejie Zhang. 2016. "Building Chinese affective resources in valence-arousal dimensions. In Proceedings of NAACL/HLT-16, pages 540-545.',
 
 'Desc': 'Chinese Sentiment Dictionary, includes 「valence」「arousal」. In cntext, we only take single word into account, ignore phrase.',
 
 'ChineseEmoBank':       word  valence  arousal
 0     不可思议      5.4      7.2
 1       不平      3.6      5.8
 2       不甘      3.2      6.4
 3       不安      3.8      5.4
 4       不利      3.6      5.6
 ...    ...      ...      ...
 5505    黏闷      2.8      5.6
 5506    黏腻      2.7      5.8
 5507    艳丽      5.8      4.5
 5508    苗条      6.7      3.8
 5509    修长      7.0      4.5

ChineseEmoBank的CVAW词典(Chinese valence-arousal words)原有 5512词,经过繁体转简体处理,得到5510个词。

diction_df = ct.load_pkl_dict('ChineseEmoBank.pkl')['ChineseEmoBank']
diction_df

Run


测量一段文本的valence和arousal,

text = '很多车主抱怨新车怠速抖动严重---冷车时更严重。'

help(ct.sentiment_by_weight)

Run

Help on function sentiment_by_weight in module cntext.stats:

sentiment_by_weight(text, diction, params, lang='english')
    calculate the occurrences of each sentiment category words in text;
    the complex influence of intensity adverbs and negative words on emotion is not considered.
    :param text:  text sring
    :param diction:  sentiment dictionary dataframe with weight.;
    :param params:  set sentiment category weight, such as params=['valence', 'arousal']
    :param lang: "chinese" or "english"; default lang="english"
    
    :return:

计算文本text中chinese_emobank词两维度的汇总得分,得到valence、arousal、word_num

text = '很多车主抱怨新车怠速抖动严重---冷车时更严重。'

ct.sentiment_by_weight(text = text, 
                       diction = diction_df,
                       params = ['valence', 'arousal'],
                       lang = 'chinese')

Run

{'valence': 14.8, 
'arousal': 24.8, 
'word_num': 13}
  • valence是句子中各个chinese_emobank词valence得分的加总。
  • arousal是句子中各个chinese_emobank词arousal得分的加总。
  • word_num是句子中的词语数(含标点符号),短文本的情况下,word_num会不太准确,长文本情况下无限接近真实词语数。

需要注意,文本越长,valence和arousal指标应该会越大。使用这两个指标时,需要结合word_num进行均值处理,即

Valence = valence/word_num

Arousal = arousal/word_num

这里未做均值处理,尽量保留文本的原始信息。



广而告之