Python, R return 左连接的不同结果

Python, R return different results for left join

我正在使用一些代码将两个文件连接在一起,并在 Python 和 R 中进行了尝试。我认为下面的代码会 return 相同的结果,但是当我加入数据集,然后计算特定列中的 NA Python 代码有更多的 NA。有什么想法吗?

R 代码:

subs %>% 
  select(-revenue) -> subs

subs %>% 
  left_join(rev, by = "name") -> fullsubs

missingvalues <- map(fullsubs, ~sum(is.na(.)))

still_missing <- missingvalues$revenue

fullsubs %>% 
  filter(!is.na(revenue)) -> full_filtered

not_missing <- nrow(full_filtered)

results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")

Python代码:

fulldata = subs.merge(rev, on='name', how = 'left')

missing = fulldata.isnull().sum()

notmissing = fulldata.notnull().sum()

d = {'Matches': [notmissing["revenue_y"]], '"Still Missing': [missing["revenue_y"]]}

df = pd.DataFrame(data=d)
df

编辑: 这些问题最终变成了白色 space。在我从我加入的列的开头和结尾修剪白色 space 后,我能够从 R 和 Python 获得相同的结果。有谁知道 R 和 Python 为什么或如何以不同方式解析白色 space?

提供相同结果的更新代码:

```{r  message=FALSE, warning = FALSE}
library(tidyverse)
library(lubridate)
library(scales)
library(reticulate)
```

## R CODE

```{r message=FALSE, warning = FALSE}
rev <- read_csv("company_revenues.csv")
subs <- read_csv("subscribers.csv")


subs$company_name <- str_trim(subs$company_name, c("both"))
rev$company_name <- str_trim(rev$company_name, c("both"))


subs %>% 
  select(-company_revenue) -> subs

subs %>% 
  left_join(rev, by = "company_name") -> fullsubs

missingvalues <- map(fullsubs, ~sum(is.na(.)))

still_missing <- missingvalues$company_revenue

fullsubs %>% 
  filter(!is.na(company_revenue)) -> full_filtered

not_missing <- nrow(full_filtered)

results <- c("Matches"=format(as.numeric(not_missing),big.mark=","), "Still Missing"=format(as.numeric(still_missing),big.mark=","))
print(results, big.mark = ",")
```



## PYTHON CODE 

```{python}
import pandas as pd

revp = pd.read_csv("company_revenues.csv", error_bad_lines=False, index_col=False, dtype='unicode')
subsp = pd.read_csv("subscribers.csv", error_bad_lines=False, index_col=False, dtype='unicode')


#change comapny name to the same type 
revp['company_name']=revp['company_name'].astype(str)
subsp['company_name']=subsp['company_name'].astype(str)


#strip white space before and after word
revp['company_name']=revp["company_name"].str.strip()
subsp['company_name']=subsp["company_name"].str.strip()

fulldata = subsp.merge(revp, on='company_name', how = 'left')

missing = fulldata.isnull().sum()
notmissing = fulldata.notnull().sum()

d = {'Matches': [notmissing["company_revenue_y"]], 'Still Missing': [missing["company_revenue_y"]]}
df = pd.DataFrame(data=d)
df
```

区别在于 read_csv 函数。如果您查看 readr 包文档:https://readr.tidyverse.org/reference/read_delim.html,函数默认设置为 trim_ws = TRUE。这意味着前导和尾随空格从一开始就被修剪掉了。 pandas read_csv 函数没有该功能,因此您需要在读取数据后 运行 str.strip