原文: https://www.miriamheiss.com/posts/graphing-with-ggplot/

ggplot() 函数对任何数据科学家都是必不可少的, ta是一种非常简单的绘图函数。刚开始接触可能看起来很难, 不过不要害怕,因为一旦学了基础知识,一切都会变得清晰! 让我们开始!

准备

在这里,我需要导入本节需要的包。 tidyverse 包括八个包,其中之一是 ggplot2primer.data包 拥有比R 内置的更多的数据集。

library(ggplot2)
library(primer.data) #准备数据
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
showtext_auto()  #显示中文

画布gglot

画画需要画布,对于数据分析的绘图也是同理。导入相关R包后, 用ggplot函数构造一个画布。因为还没设定数据,所以这是一个空画布

ggplot()

我们将使用nhanes数据集,传入数据的代码ggplot(data=nhanes)

ggplot(data=nhanes)

画布看起来依然是空白的,不要紧张。理解这个之前类比PS这类绘图软件,将修图工作看做是很多个图层的叠加。现在我们使用时依然在最底层的ggplot图层,在ggplot函数内添加mapping=aes()参数,准备添加x轴、y轴、color。的图层。

Aesthetic mappings审美映射。

ggplot(data=nhanes,
       mapping=aes())

注意了,现在图层即将发生变化。我们选择设置x轴、y轴、color的字段。

  • x轴 height身高
  • y轴 weight体重
  • color gender性别
ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))

现在我们将开始添加高层次的图层,也会显示越来越多的信息。


添加geom

现在添加geom层(geom是geomeric缩写),该层是通过 + 构建在ggplot层之上。这里使用 geom_point() 绘制散点图,

ggplot(data=nhanes,
       mapping = aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point()
## Warning: Removed 366 rows containing missing values (geom_point).

Wow! 不错的开始,不过这个图中的点互相之间重叠的有一点点严重,需要设定点的大小size和透明度alpha来控制重叠效果。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.3, size=0.5)
## Warning: Removed 366 rows containing missing values (geom_point).

much better! 但能否按性别,分别绘制男、女的散点图。


分面facet

接下来添加一个分面函数 facet_wrap。该函数会分别生成男性分面、女性分面

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.3, size=0.5)+
  facet_wrap(~gender)
## Warning: Removed 366 rows containing missing values (geom_point).

现在我们有了两个分面图


添加第二个geom

现在我们需要添加一个趋势线,可以使用 geom_smooth() 函数,因为geom_smooth和geom_point都是geom层的函数,理所当然它俩比 facet_wrap层更近一些。为了让趋势线更明显,将散点的透明度设置的更浅,比如0.1

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth()+
  facet_wrap(~gender)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).

现在,我们想让趋势线更平滑一些。在geom_smooth()中,我们会设置 method="loess"以使得趋势线更平滑。 formula=y~x表示y的变化与x有关。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  facet_wrap(~gender)
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).


标签labs

现在我们需要用labs()函数给图片添加标签图层。例如title、subtitle、caption、x、y、legend。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).

现在有了正副标题,横纵坐标没有数量单位,不太nice,这里更改为 Height(cm)、Weight(kg)

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women",
       x="Height (cm)",
       y="Weight (kg)")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).

Awesome! 但图例Lengend中的 gender 依然是小写,我希望改为大写。我们知道x、y、color分别对应height、weight、gender,所以如果更改gender,需要设置的是color。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women",
       x="Height (cm)",
       y="Weight (kg)",
       color="Gender")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).

但是看到这个图片时,其他人会想原始数据是啥情况,怎么来的。这时候我们需要告诉大家nhances数据集来自于 National Health and Nutrition Examination Survey。通过设置labs的caption参数即可。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women",
       x="Height (cm)",
       y="Weight (kg)",
       color="Gender",
       caption="Source: National Health and Nutrition Examination Survey")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).


更改配色

绘图已经相当完整,但geom层的散点颜色可能不是咱的最爱,如何设置颜色呢?

更改geom层的颜色,所以该层紧贴geom层,且在geom层之上。设置方法使用 scale_color_manual() 即可。scale_color_munual() 中的values可以传入颜色十六进制的字符串,还可以传入颜色字符串。

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  scale_color_manual(values=c("magenta", "blue"))+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women",
       x="Height (cm)",
       y="Weight (kg)",
       color="Gender",
       caption="Source: National Health and Nutrition Examination Survey")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).


设置主题

  • theme_bw()
  • theme_dark()
  • theme_gray()
  • theme_light()
  • theme_minimal()
g <- ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  scale_color_manual(values=c("magenta", "blue"))+
  facet_wrap(~gender)+
  labs(title = "Heights in the U.S.",
       subtitle = "On average, men weigh more and are taller than women",
       x="Height (cm)",
       y="Weight (kg)",
       color="Gender",
       caption="Source: National Health and Nutrition Examination Survey")
  
g + theme_minimal()
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).


保存

ggsave(
  filename = "scatter.png",
  plot = g,
  width = 10,
  height = 8,
  dpi = 100,
  device = "png"
)
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).


中文问题

默认ggplot2不支持中文,为了能显示中文,需要使用showtext包

library(ggplot2)
library(primer.data) #提供数据
library(showtext) #支持中文
showtext_auto()

ggplot(data=nhanes,
       mapping=aes(
         x=height,
         y=weight,
         color=gender))+
  geom_point(alpha=0.1, size=0.5)+
  geom_smooth(method="loess", formula=y~x)+
  scale_color_manual(values=c("magenta", "blue"))+
  facet_wrap(~gender)+
  labs(title = "美国身高",
       subtitle = "平均而言,男性群体的身高会高于女性群体",
       x="身高(cm)",
       y="体重(kg)",
       color="性别",
       caption="数据源: National Health and Nutrition Examination Survey")
## Warning: Removed 366 rows containing non-finite values (stat_smooth).
## Warning: Removed 366 rows containing missing values (geom_point).