本文无意引发论战,不管是
matplotlib
、ggplot2
,还是 base plot 我都没有偏见,我初接触 R 时,ggplot2
还不火,到处都是 base plot 的教程,后来 ggplot 火了,我也跟着去ggplot
了,见证了 ggplot 的大的升级,再后来tidyverse
特别火,我也很迷恋了一阵。我也经常使用 tidyverse 的包,只是近特别喜欢 base 的语法而已。语言是工具,哪个工具用好了,都能是大神,例如同样的太祖长拳,在乔峰手里就是不一样了。本文就是针对工具的学习。
看着容易实现起来就难了,有些图懒得做的和下面链接的完全一样(主要是觉得设定不符合我的审美,大部分细节能接受的我都改的一样了),有些是改起来费劲,我就不想浪费太多时间了,有些图用 base 的软件包太费劲了,不值得为无关紧要的细节再添加太多代码,我已经为像原作者的图大动干戈了,于是就用的基于 base 语法的软件包。
年初有篇很火的 matplotlib 做了 50 幅图的帖子:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/ 被人翻译了中文,写的很棒,不知为什么,近又在微信里看到了推送,恰逢我近很迷恋 R base plot 的语法,于是准备不自量力,暂时中断假设检验,来写一篇 base plot 版本的 50 幅图。其中有的图用现成的 ggplot 会非常容易,我不反对使用,只是近在强制练习 base plot。
由于 github 连接不怎么稳定的原因,我把所有的数据都下载到本地读取。因为 RColorBrewer
在颜色的细节上做的确实不错,我们在一些图形中使用它的颜色来显示。
library(RColorBrewer)
library(data.table)
library(scales)
Correlation
Scatterplot
这幅图很简单,就是按照 category 这个分组数据来作图,使用不同的颜色。
midwest <- read.csv("data/midwest_filter.csv")
# the same no. will use same colors
# R 4.0 stringasfactors = false
cat_level <- unique(midwest$category)
col_n <- as.numeric(as.factor(midwest$category))
uni_col <- length(unique(midwest$category))
cat_col <- colorRampPalette(brewer.pal(8, "Set1"))(uni_col)
plot(
midwest$area,
midwest$poptotal,
pch = 20,
col = cat_col[col_n],
xlab = 'Area',
ylab = 'Population',
xlim = c(0.0, 0.1),
ylim = c(, 90000),
axes = FALSE,
frame.plot = TRUE,
main = "Scatterplot of Midwest Area vs Population"
)
axis(1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)
axis(2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)
legend("topright", legend = sort(cat_level), pch = 20, col = cat_col, bty = "n")
试了几个颜色,感觉使用这个颜色的划分,区分度还高一些。
Bubble plot with Encircling
这个图本身没什么特别的,只是多了一个边界,将某些重要的点圈起来,数据还是使用的上面的。看了一下, legend 中应该没有特别的只改图形大小,而不改变字符的控制,而且既然每个分组大小并不一致(dot_size并为分组),似乎大小不一致也不合理,看数据点有重叠,因此将使用的颜色的 alpha 通道设置了一下,略微透明,应该更清晰。
# use scale to make alpha transparency
# because of the overlap
col_alpha <- alpha(cat_col, 0.5)
cat_size <- midwest$dot_size / 100
# data for border
poly_data <- midwest[which(midwest$state == 'IN'),]
circle_data <- function(x, y){
index <- chull(x, y)
circle_frame <- data.frame(xpoint = x[index], ypoint = y[index])
}
# only get the convex hull data points
# polygonpath do the plot
encircle_data <- circle_data(poly_data$area, poly_data$poptotal)
plot(
midwest$area,
midwest$poptotal,
pch = 20,
col = col_alpha[col_n],
cex = cat_size,
xlab = 'Area',
ylab = 'Population',
xlim = c(0.0, 0.1),
ylim = c(, 90000),
axes = FALSE,
frame.plot = TRUE,
main = "Scatterplot of Midwest Area vs Population"
)
polypath(encircle_data$xpoint, encircle_data$ypoint, border = "red", lwd = 3)
axis(
1,
lwd = 1,
las = ,
col = 'gray',
tick = FALSE,
cex.axis = 0.8,
)
axis(
2,
lwd = 1,
las = 2,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
legend(
"topright",
legend = sort(cat_level),
pch = 20,
col = col_alpha,
bty = "n"
)
Scatter plot with linear regression line of best fit
这个图没什么特别的,只不过是使用 ggplot2
会秒杀所有的方法。一点一点的从头开始使用 base plot 作图,可以利用 predict.lm
来求置信区间,其实我觉得使用的 displ 范围应该是数据的范围,但是看到的所有的作图上,都是使用了更广的范围,作为参考的区间无可厚非,作为预测来讲,当然不能随便扩大模型的范围。
mpg <- data.table(ggplot2::mpg)
cyl4 <- mpg[cyl == 4]
cyl8 <- mpg[cyl == 8]
fit_cyl4 <- lm(data = cyl4, hwy ~ displ)
fit_cyl8 <- lm(data = cyl8, hwy ~ displ)
new_df <- data.frame(displ = seq(, 8, by = 0.1))
#newdata must be a data frame contain the name of displ
cy4_bound <-
predict(fit_cyl4, newdata = new_df, interval = "confidence")
cy8_bound <-
predict(fit_cyl8, newdata = new_df, interval = "confidence")
# make the alpha of confidence interval
bluea = alpha("blue", 0.3)
orangea = alpha("orange", 0.3)
plot.new()
box()
plot.window(
xlim = c(, 8),
ylim = c(, 50),
)
points(cyl4$displ,
cyl4$hwy,
col = "blue",
cex = 2.5,
pch = 20)
points(cyl8$displ,
cyl8$hwy,
col = "orange",
cex = 2.5,
pch = 20)
abline(fit_cyl4$coefficients[1],
fit_cyl4$coefficients[2],
col = "blue",
lwd = 3)
abline(fit_cyl8$coefficients[1],
fit_cyl8$coefficients[2],
col = "orange",
lwd = 3)
polygon(c(new_df$displ, rev(new_df$displ)),
c(cy4_bound[, 2], rev(cy4_bound[, 3])),
col = bluea, border = bluea)
polygon(c(new_df$displ, rev(new_df$displ)),
c(cy8_bound[, 2], rev(cy8_bound[, 3])),
col = orangea, border = orangea
)
axis(
1,
lwd = 1,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
mtext('dspl', side = 1, font = 2, line = 3)
axis(
2,
lwd = 1,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
mtext('hwy', side =2, font = 2, line = 3)
title('Scatterplot with line of best fit grouped by number of cylinders', cex.main = 1)
legend(
"topright",
legend = c('cyl = 4', 'cyl = 8'),
pch = 20,
col = c("blue", "orange"),
bty = 'n'
)
Each regression line in its own column
这个图是上面的翻版,其实比较简单,就是将一幅图分成两幅图来做,选择不同吧,作者原来就是想演示 sns 想做到这一点很容易。
oldpar <- par()
layout(matrix(c(1, 2), 1, 2, byrow = TRUE))
par(mar = c(5.1, 4.1, 4.1, 1.1))
plot(
cyl4$displ,
cyl4$hwy,
col = "blue",
cex = 2.5,
pch = 20,
xlim = c(, 7.5),
ylim = c(, 50),
xlab = 'displ',
ylab = 'hwy',
tck =
)
abline(fit_cyl4$coefficients[1],
fit_cyl4$coefficients[2],
col = "blue",
lwd = 3)
polygon(c(new_df$displ, rev(new_df$displ)),
c(cy4_bound[, 2], rev(cy4_bound[, 3])),
col = bluea, border = bluea)
legend(
"top",
legend = 'cyl = 4',
pch = 20,
col = "blue",
bty = 'n'
)
par(mar = c(5.1, 1.1, 4.1, 4.1))
plot(
cyl8$displ,
cyl8$hwy,
col = "blue",
cex = 2.5,
pch = 20,
xlim = c(, 7.5),
ylim = c(, 50),
xlab = 'displ',
ylab = NA,
tck =
)
abline(fit_cyl8$coefficients[1],
fit_cyl8$coefficients[2],
col = "blue",
lwd = 3)
polygon(c(new_df$displ, rev(new_df$displ)),
c(cy8_bound[, 2], rev(cy8_bound[, 3])),
col = bluea, border = bluea)
legend(
"top",
legend = 'cyl = 8',
pch = 20,
col = "blue",
bty = 'n'
)
par(oldpar)
Jittering with stripplot
这也是散点图的一种,只不过有的数据有严重的重合,人为的使这些数据点有偏离(加点噪音),这就叫 jitter。原文这幅图的数据点使用了不同的颜色,我不知道他使用的哪个分组,代码中没看到这一项,不过这不重要,我就以 cyl 为分组上色好了。其中,jitter
默认的添加噪音程度可能太保守,如果加大后数据点分离加重,那么意味着我们可以修改一下这个权重。在这里 ggplot2 更有优势。
cat_level <- unique(mpg$cyl)
col_n <- as.numeric(as.factor(mpg$cyl))
uni_col <- length(unique(mpg$cyl))
cat_col <- brewer.pal(4, "Set1")
plot(
mpg$cty,
jitter(mpg$hwy, factor = 5),
pch = 20,
col = cat_col[col_n],
xlab = 'cty',
ylab = 'hwy',
xlim = c(8, 36),
ylim = c(10, 50),
axes = FALSE,
frame.plot = TRUE,
main = "use jitter to avoid overlapping of points"
)
axis(
1,
lwd = 1,
las = ,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
axis(
2,
lwd = 1,
las = 2,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
legend(
"topleft",
legend = sort(cat_level),
title = 'cyl',
pch = 20,
col = cat_col,
bty = "n"
)
Counts Plot
上面为了展示给数据加噪音,尽管对理解可能有影响,但毕竟是改动了数据,另外一个不改动数据的方法就是使得相同的数值(重叠多)的数据计数,依据这个来改变该位置点的大小。在这里我们只需要先对重叠的点进行记数就好了,dplyr
或 data.table
都有该功能,这里使用 data.table
。
mpg_size <- mpg[, n := .N, by = c("hwy", 'cty')]
size_n <- (mpg_size$n+10)/8
plot(
mpg_size$cty,
mpg_size$hwy,
pch = 20,
cex = size_n,
col = cat_col[col_n],
xlab = 'cty',
ylab = 'hwy',
xlim = c(8, 36),
ylim = c(10, 50),
axes = FALSE,
frame.plot = TRUE,
main = "counts the number of overlap points"
)
axis(
1,
lwd = 1,
las = ,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
axis(
2,
lwd = 1,
las = 2,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
legend(
"topleft",
legend = sort(cat_level),
title = 'cyl',
pch = 20,
col = cat_col,
bty = "n"
)
Marginal Histogram
这个在图形的边缘显示其他图形的技巧,这里用的是频度直方图。我这里仍然不知道分组(我没找到 scatter 的 col 选项),我随便用一个 drv 进行颜色分组。
oldpar <- par()
cat_level <- unique(mpg$drv)
col_n <- as.numeric(as.factor(mpg$drv))
uni_col <- length(unique(mpg$drv))
cat_col <- brewer.pal(3, "Set1")
par(fig = c(, 0.75, , 0.75), new = TRUE)
plot(
mpg$displ,
mpg$hwy,
pch = 20,
col = cat_col[col_n],
xlab = 'dispal',
ylab = 'hwy',
axes = FALSE,
frame.plot = TRUE,
)
axis(
1,
lwd = 1,
las = ,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
axis(
2,
lwd = 1,
las = 2,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
par(fig = c(, 0.75, 0.45, 1), new = TRUE)
xhist <- hist(mpg$displ, breaks = 45, plot = FALSE)
barplot(xhist$count, axes = FALSE, col = '#FF0099')
par(fig = c(0.6, 1, , 0.75), new = TRUE)
yhist <- hist(mpg$hwy, breaks = 40, plot = FALSE)
barplot(yhist$count,
horiz = TRUE,
axes = FALSE,
col = '#FF0099')
mtext(
"scatter plot with histograms",
side = 3,
outer = TRUE,
line = -1
)
mtext("displ vs hwy",
side = 3,
outer = TRUE,
line = -2)
par(oldpar)
Marginal Boxplot
这个和上面的图没有差别,改个 boxplot 即可。另外,我略微改了一下这两幅图,没有将边缘的图像放在大图的底部,我觉得放在上部更符合审美。
oldpar <- par()
cat_level <- unique(mpg$drv)
col_n <- as.numeric(as.factor(mpg$drv))
uni_col <- length(unique(mpg$drv))
cat_col <- brewer.pal(3, "Set1")
par(fig = c(, 0.75, , 0.75), new = TRUE)
plot(
mpg$displ,
mpg$hwy,
pch = 20,
col = cat_col[col_n],
xlab = 'dispal',
ylab = 'hwy',
axes = FALSE,
frame.plot = TRUE,
)
axis(
1,
lwd = 1,
las = ,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
axis(
2,
lwd = 1,
las = 2,
col = 'gray',
tick = FALSE,
cex.axis = 0.8
)
par(fig = c(, 0.75, 0.45, 1), new = TRUE)
boxplot(mpg$displ, axes = FALSE, col = 'darkblue',
horizontal = TRUE)
par(fig = c(0.6, 1, , 0.75), new = TRUE)
boxplot(mpg$hwy,
axes = FALSE,
col = 'darkblue')
mtext(
"scatter plot with histograms",
side = 3,
outer = TRUE,
line = -1
)
mtext("displ vs hwy",
side = 3,
outer = TRUE,
line = -2)
par(oldpar)
Correllogram
这个图看似复杂,其实比较容易实现,有很多现成的包可以做这个事情,也很简单,但是基于 base 的似乎没有完美的简单的base package 的解决方案。尽管前面我尽量避免使用现成的包,但目前来看,我只能使用基于 base plot 的软件包来做这个图了。
library(corrplot)
mtc <- read.csv("data/mtcars.csv")
heat_mtc <- cor(mtc[,1:12])
col <- colorRampPalette(c("darkred", "white", "darkgreen"))(40)
corrplot(heat_mtc, method="color", col=col,
type="upper", order="hclust",
addCoef.col = "black",
tl.col="black",
tl.srt=45,
diag=FALSE
)
Pairwise Plot
这个图形的实现,相比上图来讲,实现起来容易,就是代码太多,还是使用现成的软件包,这个也有很多,例如 Corrr
, PerformanceAnalytics
,GGally
,以及我今天使用的 psych
。
library(psych)
pairs.panels(iris, scale=TRUE)
参考资料
Koncevicius., Karolis. 2020. “R Base Plotting Without Wrappers.” http://karolis.koncevicius.lt/posts/r_base_plotting_without_wrappers/.