textstat可以对文本进行可读性计算,支持英文、德语、西班牙、意大利、荷兰语等。目前不支持中文,如做中文文本分析,可以考虑用cntext包。

https://github.com/textstat/textstat


任务

今天在本文中,将学习三个知识点。

  1. 读取csv数据文件
  2. 选中csv中某列文本数据,依次使用apply方法,计算FOG、ARI、CLI等。
  3. 综合三个可读性指标,构造可读性mean值


安装

pip3 install textstat


读取数据

点击下载实验数据data.csv

import pandas as pd
import textstat
#设置dataframe显示的宽度
pd.options.display.max_colwidth = 50

df = pd.read_csv('data.csv')
df
##                                                  doc
## 0  Playing games has always been thought to be im...
## 1  the development of well-balanced and creative ...
## 2  however, what part, if any, they should play i...
## 3  of adults has never been researched that deepl...
## 4  that playing games is every bit as important f...
## 5  as for children. Not only is taking time out t...
## 6  with our children and other adults valuable to...
## 7  interpersonal relationships but is also a wond...
## 8                       to release built up tension.
## 9  The language will be used for syllable calcula...


Series批操作

使用apply方法对pd.Series类型的数据进行批操作

extstat库有丰富的可读性方法,这里任选2个作为 批操作函数

  • Fog textstat.gunning_fog(text)
  • Flesch textstat.flesch_reading_ease(text)
df['Fog'] = df['doc'].apply(textstat.gunning_fog)
df['Flesch'] = df['doc'].apply(textstat.flesch_reading_ease)

df.head()
##                                                  doc   Fog  Flesch
## 0  Playing games has always been thought to be im...  4.00   78.25
## 1  the development of well-balanced and creative ...  8.51   30.53
## 2  however, what part, if any, they should play i...  4.40   94.15
## 3  of adults has never been researched that deepl...  4.00   86.71
## 4  that playing games is every bit as important f...  4.00   78.25


DataFrame均值

选中Fog、Flesch两列

#查看df[['Fog', 'Flesch']]数据类型
type(df[['Fog', 'Flesch']])
## <class 'pandas.core.frame.DataFrame'>
#对这两个指标水平方向进行均值
df['Mean'] = df[['Fog', 'Flesch']].mean(axis=1)
df.head()
##                                                  doc   Fog  Flesch    Mean
## 0  Playing games has always been thought to be im...  4.00   78.25  41.125
## 1  the development of well-balanced and creative ...  8.51   30.53  19.520
## 2  however, what part, if any, they should play i...  4.40   94.15  49.275
## 3  of adults has never been researched that deepl...  4.00   86.71  45.355
## 4  that playing games is every bit as important f...  4.00   78.25  41.125


存储

存储到可读性.csv中

df.to_csv('可读性.csv', index=False)