Python, R return 左连接的不同结果
Python, R return different results for left join
我正在使用一些代码将两个文件连接在一起,并在 Python 和 R 中进行了尝试。我认为下面的代码会 return 相同的结果,但是当我加入数据集,然后计算特定列中的 NA Python 代码有更多的 NA。有什么想法吗?
R 代码:
subs %>%
select(-revenue) -> subs
subs %>%
left_join(rev, by = "name") -> fullsubs
missingvalues <- map(fullsubs, ~sum(is.na(.)))
still_missing <- missingvalues$revenue
fullsubs %>%
filter(!is.na(revenue)) -> full_filtered
not_missing <- nrow(full_filtered)
results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")
Python代码:
fulldata = subs.merge(rev, on='name', how = 'left')
missing = fulldata.isnull().sum()
notmissing = fulldata.notnull().sum()
d = {'Matches': [notmissing["revenue_y"]], '"Still Missing': [missing["revenue_y"]]}
df = pd.DataFrame(data=d)
df
编辑:
这些问题最终变成了白色 space。在我从我加入的列的开头和结尾修剪白色 space 后,我能够从 R 和 Python 获得相同的结果。有谁知道 R 和 Python 为什么或如何以不同方式解析白色 space?
提供相同结果的更新代码:
```{r message=FALSE, warning = FALSE}
library(tidyverse)
library(lubridate)
library(scales)
library(reticulate)
```
## R CODE
```{r message=FALSE, warning = FALSE}
rev <- read_csv("company_revenues.csv")
subs <- read_csv("subscribers.csv")
subs$company_name <- str_trim(subs$company_name, c("both"))
rev$company_name <- str_trim(rev$company_name, c("both"))
subs %>%
select(-company_revenue) -> subs
subs %>%
left_join(rev, by = "company_name") -> fullsubs
missingvalues <- map(fullsubs, ~sum(is.na(.)))
still_missing <- missingvalues$company_revenue
fullsubs %>%
filter(!is.na(company_revenue)) -> full_filtered
not_missing <- nrow(full_filtered)
results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")
```
## PYTHON CODE
```{python}
import pandas as pd
revp = pd.read_csv("company_revenues.csv", error_bad_lines=False, index_col=False, dtype='unicode')
subsp = pd.read_csv("subscribers.csv", error_bad_lines=False, index_col=False, dtype='unicode')
#change comapny name to the same type
revp['company_name']=revp['company_name'].astype(str)
subsp['company_name']=subsp['company_name'].astype(str)
#strip white space before and after word
revp['company_name']=revp["company_name"].str.strip()
subsp['company_name']=subsp["company_name"].str.strip()
fulldata = subsp.merge(revp, on='company_name', how = 'left')
missing = fulldata.isnull().sum()
notmissing = fulldata.notnull().sum()
d = {'Matches': [notmissing["company_revenue_y"]], 'Still Missing': [missing["company_revenue_y"]]}
df = pd.DataFrame(data=d)
df
```
区别在于 read_csv
函数。如果您查看 readr 包文档:https://readr.tidyverse.org/reference/read_delim.html,函数默认设置为 trim_ws = TRUE
。这意味着前导和尾随空格从一开始就被修剪掉了。 pandas read_csv
函数没有该功能,因此您需要在读取数据后 运行 str.strip
。
我正在使用一些代码将两个文件连接在一起,并在 Python 和 R 中进行了尝试。我认为下面的代码会 return 相同的结果,但是当我加入数据集,然后计算特定列中的 NA Python 代码有更多的 NA。有什么想法吗?
R 代码:
subs %>%
select(-revenue) -> subs
subs %>%
left_join(rev, by = "name") -> fullsubs
missingvalues <- map(fullsubs, ~sum(is.na(.)))
still_missing <- missingvalues$revenue
fullsubs %>%
filter(!is.na(revenue)) -> full_filtered
not_missing <- nrow(full_filtered)
results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")
Python代码:
fulldata = subs.merge(rev, on='name', how = 'left')
missing = fulldata.isnull().sum()
notmissing = fulldata.notnull().sum()
d = {'Matches': [notmissing["revenue_y"]], '"Still Missing': [missing["revenue_y"]]}
df = pd.DataFrame(data=d)
df
编辑: 这些问题最终变成了白色 space。在我从我加入的列的开头和结尾修剪白色 space 后,我能够从 R 和 Python 获得相同的结果。有谁知道 R 和 Python 为什么或如何以不同方式解析白色 space?
提供相同结果的更新代码:
```{r message=FALSE, warning = FALSE}
library(tidyverse)
library(lubridate)
library(scales)
library(reticulate)
```
## R CODE
```{r message=FALSE, warning = FALSE}
rev <- read_csv("company_revenues.csv")
subs <- read_csv("subscribers.csv")
subs$company_name <- str_trim(subs$company_name, c("both"))
rev$company_name <- str_trim(rev$company_name, c("both"))
subs %>%
select(-company_revenue) -> subs
subs %>%
left_join(rev, by = "company_name") -> fullsubs
missingvalues <- map(fullsubs, ~sum(is.na(.)))
still_missing <- missingvalues$company_revenue
fullsubs %>%
filter(!is.na(company_revenue)) -> full_filtered
not_missing <- nrow(full_filtered)
results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")
```
## PYTHON CODE
```{python}
import pandas as pd
revp = pd.read_csv("company_revenues.csv", error_bad_lines=False, index_col=False, dtype='unicode')
subsp = pd.read_csv("subscribers.csv", error_bad_lines=False, index_col=False, dtype='unicode')
#change comapny name to the same type
revp['company_name']=revp['company_name'].astype(str)
subsp['company_name']=subsp['company_name'].astype(str)
#strip white space before and after word
revp['company_name']=revp["company_name"].str.strip()
subsp['company_name']=subsp["company_name"].str.strip()
fulldata = subsp.merge(revp, on='company_name', how = 'left')
missing = fulldata.isnull().sum()
notmissing = fulldata.notnull().sum()
d = {'Matches': [notmissing["company_revenue_y"]], 'Still Missing': [missing["company_revenue_y"]]}
df = pd.DataFrame(data=d)
df
```
区别在于 read_csv
函数。如果您查看 readr 包文档:https://readr.tidyverse.org/reference/read_delim.html,函数默认设置为 trim_ws = TRUE
。这意味着前导和尾随空格从一开始就被修剪掉了。 pandas read_csv
函数没有该功能,因此您需要在读取数据后 运行 str.strip
。