Huggingface(抱抱脸)总部位于纽约,是一家专注于自然语言处理、人工智能和分布式系统的创业公司。他们所提供的聊天机器人技术一直颇受欢迎,但更出名的是他们在NLP开源社区上的贡献。

Huggingface一直致力于自然语言处理NLP技术的平民化(democratize),希望每个人都能用上最先进(SOTA, state-of-the-art)的NLP技术,而非困窘于训练资源的匮乏。

Hugging Face所有模型的地址

https://huggingface.co/models

你可以在这里下载所需要的模型,也可以上传你微调之后用于特定task的模型。


Hugging Face使用文档的地址

https://huggingface.co/transformers/master/index.html



英汉互译

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

zh2en_model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
zh2en_tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
zh2en_translation = pipeline('translation_zh_to_en', 
                       model=zh2en_model, 
                       tokenizer=zh2en_tokenizer)
zh2en_translation('Python是一门非常强大的编程语言!')
[{'translation_text': 'Python is a very powerful programming language!'}]

en2zh_model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
en2zh_tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')

en2zh_translation = pipeline('translation_en_to_zh', 
                       model=en2zh_model, 
                       tokenizer=en2zh_tokenizer)
en2zh_translation('Python is a very powerful programming language!')
[{'translation_text': 'Python是一个非常强大的编程语言!'}]



文本分类

模型 uer/roberta-base-finetuned-chinanews-chinese是使用5个中文文本分类数据集训练得到

  • 京东full、京东binary和大众点评数据集包含不同情感极性的用户评论数据。
  • 凤凰网 和 China Daily 包含不同主题类的新闻文本数据
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
text_classification = pipeline('sentiment-analysis', 
                               model=model, 
                               tokenizer=tokenizer)
test_text = "上证指数大涨2%"

text_classification(test_text, return_all_scores=True)
[[{'label': 'mainland China politics', 'score': 0.0002807585697155446},
  {'label': 'Hong Kong - Macau politics', 'score': 0.00015504546172451228},
  {'label': 'International news', 'score': 6.818029214628041e-05},
  {'label': 'financial news', 'score': 0.9991051554679871},
  {'label': 'culture', 'score': 0.00011297615128569305},
  {'label': 'entertainment', 'score': 0.00012184812658233568},
  {'label': 'sports', 'score': 0.0001558474759804085}]]
test_text = "Python是一门强大的编程语言"
text_classification(test_text, return_all_scores=True)
[[{'label': 'mainland China politics', 'score': 0.02050291746854782},
  {'label': 'Hong Kong - Macau politics', 'score': 0.0030984438490122557},
  {'label': 'International news', 'score': 0.005687597207725048},
  {'label': 'financial news', 'score': 0.03360358253121376},
  {'label': 'culture', 'score': 0.913349986076355},
  {'label': 'entertainment', 'score': 0.010810119099915028},
  {'label': 'sports', 'score': 0.012947351671755314}]]



代码下载

https://github.com/hidadeng/DaDengAndHisPython/tree/master/20211108HuggingFace学习


广而告之