R 中的 COUNTIF 具有多重限制

COUNTIF in R with multiple restrictions

我有来自 retrosheet.org 的事件文件数据。这是关于棒球比赛的数据,其格式使得每次观察都是对棒球赛季每场比赛中每场比赛的描述(包括比赛、球员和比赛的参考变量)。

> str(e.2015.1990)
'data.frame':   4813807 obs. of  42 variables:
 $ GAME.ID                              : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ INNING                               : num  1 1 1 1 1 1 1 1 1 2 ...
 $ BATTING.TEAM                         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
 $ OUTS                                 : int  0 1 2 2 2 2 0 1 2 0 ...
 $ BATTER                               : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ BATTER.HAND                          : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ RES.BATTER                           : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ RES.BATTER.HAND                      : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ PITCHER                              : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ PITCHER.HAND                         : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ RES.PITCHER                          : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ RES.PITCHER.HAND                     : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ FIRST.RUNNER                         : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
 $ SECOND.RUNNER                        : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
 $ THIRD.RUNNER                         : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.TEXT                           : chr  "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
 $ EVENT.TYPE                           : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
 $ AB.FLAG                              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ HIT.VALUE                            : int  1 1 3 2 2 1 1 1 1 1 ...
 $ SH.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SF.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ DOUBLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRIPLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RBI.ON.PLAY                          : num  0 0 0 1 0 0 0 0 0 0 ...
 $ BATTED.BALL.TYPE                     : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
 $ BATTER.DEST                          : int  0 0 2 1 1 0 0 0 0 0 ...
 $ RUNNER.ON.1ST.DEST                   : int  0 0 0 0 2 1 0 0 0 0 ...
 $ RUNNER.ON.2ND.DEST                   : int  0 0 0 4 0 2 0 0 0 0 ...
 $ RUNNER.ON.3RD.DEST                   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SB.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.NUM                            : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...

据此,我想计算每个玩家每场比赛的比赛总分。我想格式化一个数据框,这样每个观察结果都是对一名球员在本赛季一场比赛中表现的描述,每场比赛中的每个球员都构成了全部观察结果。

我创建了一个包含两列的新数据库,GAME.ID 和 PLAYER.ID,这样每个游戏中的每个 STARTER 都构成了全部观察结果。

> str(k.2015.1990)
'data.frame':   1146866 obs. of  2 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...

我认为我接下来需要做的是创建额外的向量(针对我要计算的每个统计数据),以便对所述向量的每次观察都创建我的事件数据的唯一子集,定义如下:

e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID

然后根据该子集计算该统计数据。我知道如何在 R 中创建向量和子集,但不知道如何为每个观察创建唯一子集的向量。我想我需要使用

function(x)

做这个;但是,我是 R 的新手,没有使用此功能的经验。

为了方便起见,我将尝试制作一个可重现的示例。在此示例中,目标是计算天使队 2015 年常规赛前两场比赛中每位球员的总命中率。

我制作了事件文件数据的一个子集,其中包含与这两场比赛相对应的 156 个观察结果。为了简单起见,我只包含了变量 GAME.ID、BATTER 和 HIT.VALUE.

         GAME.ID   BATTER HIT.VALUE
1   ANA201504100 escoa003         1
2   ANA201504100 mousm001         1
3   ANA201504100 cainl001         3
4   ANA201504100 hosme001         2
5   ANA201504100 morak001         2
6   ANA201504100 gorda001         1
7   ANA201504100 calhk001         1
8   ANA201504100 troum001         1
9   ANA201504100 pujoa001         1
10  ANA201504100 riosa002         1
11  ANA201504100 peres002         1
12  ANA201504100 infao001         1
13  ANA201504100 freed001         1
14  ANA201504100 cronc002         1
15  ANA201504100 aybae001         1
16  ANA201504100 escoa003         1
17  ANA201504100 mousm001         1
18  ANA201504100 cainl001         1
19  ANA201504100 hosme001         1
20  ANA201504100 morak001         1
21  ANA201504100 iannc001         1
22  ANA201504100 cowgc001         2
23  ANA201504100 giavj001         1
24  ANA201504100 calhk001         3
25  ANA201504100 troum001         1
26  ANA201504100 pujoa001         1
27  ANA201504100 gorda001         1
28  ANA201504100 riosa002         1
29  ANA201504100 peres002         1
30  ANA201504100 freed001         2
31  ANA201504100 cronc002         1
32  ANA201504100 aybae001         1
33  ANA201504100 iannc001         1
34  ANA201504100 infao001         1
35  ANA201504100 escoa003         2
36  ANA201504100 mousm001         1
37  ANA201504100 cainl001         2
38  ANA201504100 hosme001         1
39  ANA201504100 cowgc001         1
40  ANA201504100 giavj001         1
41  ANA201504100 calhk001         1
42  ANA201504100 morak001         5
43  ANA201504100 gorda001         1
44  ANA201504100 riosa002         1
45  ANA201504100 peres002         1
46  ANA201504100 troum001         2
47  ANA201504100 pujoa001         1
48  ANA201504100 freed001         5
49  ANA201504100 cronc002         1
50  ANA201504100 infao001         1
51  ANA201504100 escoa003         1
52  ANA201504100 mousm001         2
53  ANA201504100 cainl001         1
54  ANA201504100 cainl001         1
55  ANA201504100 aybae001         1
56  ANA201504100 iannc001         1
57  ANA201504100 joycm001         3
58  ANA201504100 giavj001         1
59  ANA201504100 hosme001         1
60  ANA201504100 morak001         1
61  ANA201504100 gorda001         1
62  ANA201504100 riosa002         1
63  ANA201504100 riosa002         1
64  ANA201504100 calhk001         1
65  ANA201504100 troum001         2
66  ANA201504100 pujoa001         1
67  ANA201504100 freed001         1
68  ANA201504100 peres002         2
69  ANA201504100 infao001         2
70  ANA201504100 escoa003         1
71  ANA201504100 mousm001         1
72  ANA201504100 cainl001         1
73  ANA201504100 hosme001         1
74  ANA201504100 morak001         1
75  ANA201504100 cronc002         1
76  ANA201504100 aybae001         1
77  ANA201504100 iannc001         1
78  ANA201504100 joycm001         1
79  ANA201504110 escoa003         1
80  ANA201504110 mousm001         1
81  ANA201504110 cainl001         1
82  ANA201504110 hosme001         1
83  ANA201504110 calhk001         5
84  ANA201504110 troum001         2
85  ANA201504110 pujoa001         1
86  ANA201504110 joycm001         1
87  ANA201504110 freed001         1
88  ANA201504110 morak001         1
89  ANA201504110 gorda001         1
90  ANA201504110 riosa002         1
91  ANA201504110 aybae001         2
92  ANA201504110 navae001         1
93  ANA201504110 buted001         1
94  ANA201504110 giavj001         1
95  ANA201504110 peres002         1
96  ANA201504110 infao001         1
97  ANA201504110 escoa003         1
98  ANA201504110 giavj001         1
99  ANA201504110 calhk001         1
100 ANA201504110 troum001         1
101 ANA201504110 mousm001         5
102 ANA201504110 cainl001         2
103 ANA201504110 hosme001         1
104 ANA201504110 hosme001         1
105 ANA201504110 morak001         3
106 ANA201504110 gorda001         1
107 ANA201504110 riosa002         2
108 ANA201504110 peres002         5
109 ANA201504110 infao001         2
110 ANA201504110 escoa003         1
111 ANA201504110 pujoa001         1
112 ANA201504110 joycm001         1
113 ANA201504110 freed001         1
114 ANA201504110 mousm001         1
115 ANA201504110 cainl001         1
116 ANA201504110 hosme001         2
117 ANA201504110 morak001         2
118 ANA201504110 gorda001         1
119 ANA201504110 riosa002         1
120 ANA201504110 aybae001         1
121 ANA201504110 navae001         1
122 ANA201504110 buted001         2
123 ANA201504110 giavj001         1
124 ANA201504110 calhk001         3
125 ANA201504110 troum001         2
126 ANA201504110 pujoa001         1
127 ANA201504110 riosa002         1
128 ANA201504110 peres002         2
129 ANA201504110 infao001         1
130 ANA201504110 escoa003         2
131 ANA201504110 mousm001         1
132 ANA201504110 joycm001         1
133 ANA201504110 freed001         1
134 ANA201504110 aybae001         1
135 ANA201504110 cainl001         1
136 ANA201504110 hosme001         1
137 ANA201504110 morak001         2
138 ANA201504110 gorda001         1
139 ANA201504110 riosa002         1
140 ANA201504110 navae001         1
141 ANA201504110 iannc001         1
142 ANA201504110 giavj001         1
143 ANA201504110 peres002         1
144 ANA201504110 infao001         1
145 ANA201504110 escoa003         1
146 ANA201504110 calhk001         1
147 ANA201504110 troum001         1
148 ANA201504110 pujoa001         1
149 ANA201504110 mousm001         2
150 ANA201504110 cainl001         1
151 ANA201504110 hosme001         1
152 ANA201504110 morak001         1
153 ANA201504110 gorda001         1
154 ANA201504110 joycm001         1
155 ANA201504110 freed001         1
156 ANA201504110 aybae001         1

我还制作了新数据库的子集,对应这两场比赛的40名先发球员。

             GAME.ID PLAYER.ID
1       ANA201504100  escoa003
60465   ANA201504100  mousm001
120929  ANA201504100  cainl001
181393  ANA201504100  hosme001
241857  ANA201504100  morak001
302321  ANA201504100  gorda001
362785  ANA201504100  riosa002
423249  ANA201504100  peres002
483713  ANA201504100  infao001
1117610 ANA201504100  vargj001
573434  ANA201504100  calhk001
633898  ANA201504100  troum001
694362  ANA201504100  pujoa001
754826  ANA201504100  freed001
815290  ANA201504100  cronc002
875754  ANA201504100  aybae001
936218  ANA201504100  iannc001
996682  ANA201504100  cowgc001
1057146 ANA201504100  giavj001
1117613 ANA201504100  santh001
2       ANA201504110  escoa003
60466   ANA201504110  mousm001
120930  ANA201504110  cainl001
181394  ANA201504110  hosme001
241858  ANA201504110  morak001
302322  ANA201504110  gorda001
362786  ANA201504110  riosa002
423250  ANA201504110  peres002
483714  ANA201504110  infao001
2100000 ANA201504110  guthj001
573435  ANA201504110  calhk001
633899  ANA201504110  troum001
694363  ANA201504110  pujoa001
754827  ANA201504110  joycm001
815291  ANA201504110  freed001
875755  ANA201504110  aybae001
936219  ANA201504110  navae001
996683  ANA201504110  buted001
1057147 ANA201504110  giavj001
2100001 ANA201504110  weavj003

我认为应该有一种方法可以向后一个数据库添加一列,以便每个观察都引用其行中的 GAME.ID 和 PLAYER.ID 条目,搜索前一个数据库以隔离那些GAME.ID = GAME.ID 和 PLAYER.ID = BATTER 的观测值,计算 HIT.VALUE > 1(1 = 默认,2 = 单,3 = 双, 4 = 三重, 5 = 本垒打), 然后 returns 计入观察。在 excel 中,这可以通过 CountIf() 函数完成,我可以轻松复制向量的长度。不过,我不知道如何在 R 中做到这一点。

我想这可能就是您要找的。它按 GAME.IDBATTER 对倒数第二个数据集进行分组,然后计算每组 >1 的命中数。

library(data.table)
dt<-setDT(df)[, list(count_hits = sum(HIT.VALUE>1)),by=c("GAME.ID","BATTER")]

head(dt)
        GAME.ID   BATTER count_hits
1: ANA201504100 escoa003          1
2: ANA201504100 mousm001          1
3: ANA201504100 cainl001          2
4: ANA201504100 hosme001          1
5: ANA201504100 morak001          2
6: ANA201504100 gorda001          0

base R 中的另一个选项是:

res<-aggregate(x=list(count_hits=df$HIT.VALUE), by=list(GAME.ID=df$GAME.ID,BATTER=df$BATTER), FUN = function(x) sum(x>1) )

head(res)
       GAME.ID   BATTER count_hits
1 ANA201504100 aybae001          0
2 ANA201504110 aybae001          1
3 ANA201504110 buted001          1
4 ANA201504100 cainl001          2
5 ANA201504110 cainl001          1
6 ANA201504100 calhk001          1