R dplyr包学习

文章目录

1. 1. 选取符合条件的数据行：filter
2. 2. 选取符合条件的数据列：select
3. 3. 对数据框进行排序：arrange
4. 4. 添加新变量：mutate
5. 5. 数据汇总：summarise

虽然R中的基础函数就可以满足对数据的各种操作，但当数据量大或筛选条件较多时，采用基础函数就显得效率低下。对此，大神Hadley Wickham 表示不满意，于是就开发出plyr包，并对其进一步优化，诞生了dplyr包。这个包采用C++编写，处理速度相当快，可以与python中的pandas库相媲美，关键是语法特别简单，以往冗长的代码用简单的一个函数就可以解决，简直就是数据处理的福音，好的不能更赞了！

在dplyr包中一般将data.frame转换为tbl_df,方便操作,这个转换可以通过tbl_df函数实现。

ds <- tbl_df(mtcars)
ds
Source: local data frame [32 x 11]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...

可以看到转换后的数据并不完整显示，这是因为tbl对象的展示同电脑屏幕相适应。

class(ds)
[1] "tbl_df"     "tbl"        "data.frame"

查看ds数据类型，发现有三种类型。

介绍完数据类型后，我们可以看一下dplyr包中数据基本操作函数，共有5个函数。

1. 选取符合条件的数据行：filter

filter(.data, …)，.data是data.frame或tbl对象，…是筛选条件

ds1 <- filter(ds,cyl == 8)

查看全部ds1可以用as.data.frame(ds1)或View(ds1)。

我们也可以直接对数据框mtcars操作：

filter(mtcars, cyl == 8)
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
3  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
......

2. 选取符合条件的数据列：select

select(.data, …)

参数解释同上，但是select函数中有7个内部条件函数可供使用，这些函数仅仅限于在select函数中使用。

条件函数：

•    starts_with(x, ignore.case = TRUE): names starts with x
•    ends_with(x, ignore.case = TRUE): names ends in x
•    contains(x, ignore.case = TRUE): selects all variables whose name contains x
•    matches(x, ignore.case = TRUE): selects all variables whose name matches the regular expression x
•    num_range("x", 1:5, width = 2): selects all variables (numerically) from x01 to x05.
•    one_of("x", "y", "z"): selects variables provided in a character vector.
•    everything(): selects all variables.

保留变量

iris <- tbl_df(iris)  
select(iris, starts_with("Petal"))
select(iris, ends_with("Width"))
select(iris, contains("etal"))
select(iris, matches(".t."))
select(iris, Petal.Length, Petal.Width)
vars <- c("Petal.Length", "Petal.Width")
select(iris, one_of(vars))

删除变量，加一个-即可。

select(iris, -starts_with("Petal"))
select(iris, -ends_with("Width"))
select(iris, -contains("etal"))
select(iris, -matches(".t."))
select(iris, -Petal.Length, -Petal.Width)

重命名变量

select(iris, petal_length = Petal.Length)

重命名变量也可以通过rename函数实现，两者区别在于：采用rename函数时保留所有变量，而select函数仅仅保存重命名变量。

3. 对数据框进行排序：arrange

arrange(.data,…)，第一个是进行排序的数据，可以是数据框也可以是tbl对象，之后是数据排序的条件，可以是多个条件，默认是升序排序，采用desc进行降序排序。输出结果为数据框。

arrange(mtcars, cyl, disp)
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
2  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
4  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
......

多重条件排序,cyl升序，disp降序。

arrange(mtcars, cyl,desc(disp))
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
2  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
3  21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
4  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
......

4. 添加新变量：mutate

mutate(.data,…)，第一个变量是数据框或tbl对象，之后是添加的变量，可以是多个变量。输出结果为数据框。

mutate(mtcars, displ_l = disp / 61.0237)
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb  displ_l
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 2.621932
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 2.621932
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 1.769804
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 4.227866
......

与mutate类似的一个函数是transmutate，但mutate函数保留所有变量，而transmutate仅仅保留添加的变量。

transmute(mtcars, displ_l = disp / 61.0237)
    displ_l
1  2.621932
2  2.621932
3  1.769804
4  4.227866
......

5. 数据汇总：summarise

summarise(.data, …)，第一个变量是数据框或tbl对象，第二个变量是汇总变量，可以是多个汇总变量，汇总函数可以是R中的基础汇总函数，如mean，sd，sum，也可以是dplyr包中的汇总函数，如求观测值数的n，不同观测值个数的n_distinct，第一个观测值first,最后一个观测值last。

summarise(mtcars, mean = mean(disp))
      mean
1 230.7219
summarise(mtcars, n = n(),dn = n_distinct(cyl),first = first(cyl),last = last(cyl))
   n dn first last
1 32  3     6    4

这5个基础函数涵盖了数据处理的筛选，排序，添加新变量以及变量汇总，相对于基础函数，它们处理数据的效率更高。

统苑