R|base - 文章详情

Base R是R语言基础的部分，同时又是容易被忽略的，温故而知新，在数据操作过程中不断挖掘其中的宝藏吧！持续更新中……

1. is.element( )

is.element(x, y) #用来确定x是否在y之内，等价于  x%in% y

类似的集合操作：

union(x, y) #并集
intersect(x, y) #交集
setdiff(x, y) #补集
setequal(x, y) #判断两个集合是否相同，类似all(x==y)

2. strsplit()

strsplit()用于分割字符串，具体参数：
strsplit(x, split, fixed = FALSE)
x：需要分割的字符串，
split：拆分的依据
fixed：是否完全匹配，默认情况下fixed = FALSE，即支持正则表达式。

> strsplit("a.b.c",".") # 默认情况下，fixed=FALSE
##[[1]]
## [1] "" "" "" "" ""
> strsplit("a.b.c",".",fixed = T) # fixed =TRUE就可以完全匹配了
## [[1]]
## [1] "a" "b" "c"

strsplit() + unlist()

> unlist(strsplit("abcd","",fixed = T))
[1] "a" "b" "c" "d"

类似的函数有stringr::str_split().

3.sapply()+ ‘[’

在数据操作过程中经常看到sapply(x, "[", 1)的形式，但是非常迷茫怎么会有这么奇怪的符号？后来带着这个问题取搜索，竟然在stackoverflow给找到了。看来不止我一个人迷惑这个问题

问题描述

其实是`function(x) x[1]`的简写形式，提取向量中的子集。

答案

例如：可以看出二者输出的结果是一致的。

> a <- strsplit("a.b.c",".",fixed = T)
> sapply(a, "[",1)
[1] "a"
> sapply(a, function(x) x[1])
[1] "a"

还可以用来提取其他位置的向量，如：

> sapply(a, "[",2)
[1] "b"
> sapply(a, "[",3)
[1] "c"

4.trimws()：去掉字符串首尾空格

trimws(x, which = c("both", "left", "right"))

> trimws(" abc ",which = "left") #去掉左侧空字符
[1] "abc "
> trimws(" abc ",which = "right")#去掉右侧空字符
[1] " abc"
> trimws(" abc ",which = "both")#去掉首尾空字符
[1] "abc"

5. sub()：查找并替换

> colnames(eset)
 [1] "CLL11.CEL" "CLL12.CEL" "CLL14.CEL" "CLL15.CEL" "CLL16.CEL" "CLL17.CEL" "CLL18.CEL"
 [8] "CLL19.CEL" "CLL20.CEL" "CLL21.CEL" "CLL22.CEL" "CLL23.CEL" "CLL24.CEL" "CLL2.CEL" 
[15] "CLL3.CEL"  "CLL4.CEL"  "CLL5.CEL"  "CLL6.CEL"  "CLL7.CEL"  "CLL8.CEL"  "CLL9.CEL" 
> sub(pattern = "\\.CEL",replacement = "",colnames(eset)) #匹配".CEL"，并删除
 [1] "CLL11" "CLL12" "CLL14" "CLL15" "CLL16" "CLL17" "CLL18" "CLL19" "CLL20" "CLL21" "CLL22"
[12] "CLL23" "CLL24" "CLL2"  "CLL3"  "CLL4"  "CLL5"  "CLL6"  "CLL7"  "CLL8"  "CLL9"

备注：

6.model.matrix():制作设计矩阵

两个应用：

limma差异分析
WGCNA中求临床性状和模块的相关性

3种形式：

model.matrix(~group)
model.matrix(~0+group)
model.matrix(~-1+group)

其实，model.matrix(~0+group)和model.matrix(~-1+group)输出的内容是一样的。
model.matrix(~group)以提前设计好参照组，
model.matrix(~0+group)和model.matrix(~-1+group)还需要配合makeContrasts()填充具体的比较分组。
详情参考：差异分析是否需要比较矩阵

7.diff():连续数据对间的差异

> x <- cumsum(cumsum(1:10));x
 [1]   1   4  10  20  35  56  84 120 165 220
> diff(x)
[1]  3  6 10 15 21 28 36 45 55

可以看出，计算的是4-1,10-4,20-10……的差值。
当然，还可以滞后，

> diff(x,lag=2)
[1]   9  16  25  36  49  64  81 100

这时候计算的是，10-1,20-4,35-10……的差值。

8.grep 查找字符串

在TCGA数据中往往需要找到配对的临床样本，这时候grep就派上用场了。
比如像下面这样：

norsam是正常样本的barcode，tumsam是肿瘤样本的barcode，下面在tumsam中找与norsam相匹配的样本。
其实TCGA的barcode中第9-12位可以识别患者。

首先提取每个barcode的第9-12的数字，

x <- substr(norsam)
head(x)

接着在tumsam中搜索与之匹配的barcode

matches <- unique(grep(paste(toMatch,collapse = "|"),tumsam,value = T))
head(matches)

后整理出norsam和tumsam数据：

t.code <- substr(matches,1,12)
n.code <- substr(norsam,1,12)
com_code <- intersect(t.code,n.code)

pair.t.sam <- paste0(com_code,"-01A")
pair.n.sam <- paste0(com_code,"-11A")