如何使用过滤功能来提高样品的均匀性？

Question

我正在寻找一些数据争论的建议。我的数据集包含两组投资者（signatory =0，signatory =1）。都有相应的国家，但两组国家不匹配。

对于我的下一个分析，我需要将我的数据集减少到仅存在于两个组中的国家，因此每个组将在列出的国家中至少有一个单位（投资者）。

明确地说，如果一组在 45 个国家/地区拥有投资者，而另一组在 50 个国家/地区拥有投资者，但其中只有 30 个国家匹配，我只想在新数据框中保留这 30 个匹配的国家。

我的数据是这样的：

投资者	年	activity	国家	地区	策略	签字人
123 即时通讯	2002	4.45	法国	欧洲	VC	1
123 即时通讯	2003	3.2	法国	欧洲	VC	1
123 即时通讯	2004	7.8	法国	欧洲	VC	1
21投资	2002	4.45	法国	欧洲	VC	0
21投资	2003	3.2	法国	欧洲	VC	0
21投资	2004	7.8	法国	欧洲	VC	0
伊耿	2005	5.4	荷兰	欧洲	通过	1
伊耿	2006	4.2	荷兰	欧洲	通过	1
伊耿	2007	1.3	荷兰	欧洲	通过	1
ING	2005	5.4	荷兰	欧洲	通过	0
ING	2006	4.2	荷兰	欧洲	通过	0
ING	2007	1.3	荷兰	欧洲	通过	0
香港仔	2002	4.45	英国	欧洲	VC	1
香港仔	2003	3.2	英国	欧洲	VC	1
香港仔	2004	7.8	英国	欧洲	VC	1
JPM	2005	5.4	美国	欧洲	通过	0
JPM	2006	4.2	美国	欧洲	通过	0
JPM	2007	1.3	美国	欧洲	通过	0

我正在寻找的输出是：

投资者	年	activity	国家	地区	策略	签字人
123 即时通讯	2002	4.45	法国	欧洲	VC	1
123 即时通讯	2003	3.2	法国	欧洲	VC	1
123 即时通讯	2004	7.8	法国	欧洲	VC	1
21投资	2002	4.45	法国	欧洲	VC	0
21投资	2003	3.2	法国	欧洲	VC	0
21投资	2004	7.8	法国	欧洲	VC	0
伊耿	2005	5.4	荷兰	欧洲	通过	1
伊耿	2006	4.2	荷兰	欧洲	通过	1
伊耿	2007	1.3	荷兰	欧洲	通过	1
ING	2005	5.4	荷兰	欧洲	通过	0
ING	2006	4.2	荷兰	欧洲	通过	0
ING	2007	1.3	荷兰	欧洲	通过	0

注：英国和美国的公司被删除，而法国和荷兰的公司被保留。

这是因为两个投资者样本（签字人 = 0 和签字人 = 1）在 France/NL 个国家/地区都有单位，而英国和美国仅在其中一个样本中出现。

df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen", "aberdeen", "aberdeen", "JPM", "JPM", "JPM"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "UK", "UK", "UK", "USA", "USA", "USA"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "north america", "north america", "north america"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))

df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))

如有任何提示，我们将不胜感激！

罗里

Answer 1

您可以这样做：

library(tidyverse)

signatory_countries <- unique(df[df$signatory==1, 'country'])
non_signatory_countries <- unique(df[df$signatory==0, 'country'])

new_df <- bind_rows(
  df %>% filter(signatory==1, country %in% non_signatory_countries),
  df %>% filter(signatory==0, country %in% signatory_countries)
)
new_df
   investor year activity     country region strategy signatory
1    123 IM 2002     4.45      France europe       VC         1
2    123 IM 2003     3.20      France europe       VC         1
3    123 IM 2004     7.80      France europe       VC         1
4     Aegon 2002     4.45 Netherlands europe       VC         1
5     Aegon 2003     3.20 Netherlands europe       VC         1
6     Aegon 2004     7.80 Netherlands europe       VC         1
7  21Invest 2005     5.40      France europe       BY         0
8  21Invest 2006     4.20      France europe       BY         0
9  21Invest 2007     1.30      France europe       BY         0
10      ING 2005     5.40 Netherlands europe       BY         0
11      ING 2006     4.20 Netherlands europe       BY         0
12      ING 2007     1.30 Netherlands europe       BY         0

如何使用过滤功能来提高样品的均匀性？

How can I use the filter function to increase sample homogeniety?

r

filter

data-wrangling