R 加速应用
R speed up sapply
我有以下循环脚本:
number_of_rows_similar_addresses <- as.data.table(cbind(
distinct_similar_addresses,
sapply(distinct_similar_addresses, function(x) {
length(similar_addresses[Original_Address == x]$people_names) / length(unique(similar_addresses[Original_Address == x]$people_names))
})
))
问题是它大大减慢了循环速度。
数据如下所示:
distinct_similar_addresses:
"U 2 5 TIMPERLEY ST NICHOLLS VIC"
"U 1 3 TIMPERLEY ST NICHOLLS VIC"
"U 1 11 TIMPERLEY ST NICHOLLS VIC"
"U 1 33 TIMPERLEY ST NICHOLLS VIC"
"U 1 2 TIMPERLEY ST NICHOLLS VIC"
"U 1 3 TIMPERLEY ST NICHOLLS VIC"
"U 1 5 TIMPERLEY ST NICHOLLS VIC"
similar_addresses:
people_names,Original_Address,Numbers,street_Name,street_type,post_code,suburb,PO,UID
Giuseppe Conte,U 1 3 TIMPERLEY ST NICHOLLS VIC,1,TIMPERLEY,ST,5469,NICHOLLS,,
Giuseppe Conte,U 1 3 TIMPERLEY ST NICHOLLS VIC,TIMPERLEY,ST,5469,NICHOLLS,,
Mario Pertini,U 2 5 TIMPERLEY ST NICHOLLS VIC,TIMPERLEY,ST,5469,NICHOLLS,,
Mario Pertini,U 2 5 TIMPERLEY ST NICHOLLS VIC,5,TIMPERLEY,ST,5469,NICHOLLS,,
脚本正在评估地址是指一个单元还是一个独立的房子。
有什么方法可以更快地执行此任务?
我正在添加一个结果集和一个解释,以便它的作用变得更容易理解。
结果集:
distinct_similar_addresses V2
"U 2 5 TIMPERLEY ST NICHOLLS VIC" 2
"U 1 3 TIMPERLEY ST NICHOLLS VIC" 2
该代码只是计算与单行地址关联的姓名数。
事实上,如果地址重复,则表示它指的是一个单元,否则就是一个单独的房子。
对于给您带来的数据不便,我们深表歉意,感谢 Roland 的帮助。
这就是解决方案
x <- similar_addresses[, .N, by = Original_Address] %>% select('N')
y <- similar_addresses[, length(unique(people_names)) , by = Original_Address] %>% select('V1')
number_of_rows_similar_addresses <- cbind(unique(similar_addresses$Original_Address), x/y)
谢谢格雷戈尔,
这可能更好:
x <- similar_addresses[, .N, by = Original_Address]$N
y <- similar_addresses[, length(unique(people_names)) , by = Original_Address]$V1
number_of_rows_similar_addresses <- cbind(unique(similar_addresses$Original_Address), x/y)
我有以下循环脚本:
number_of_rows_similar_addresses <- as.data.table(cbind(
distinct_similar_addresses,
sapply(distinct_similar_addresses, function(x) {
length(similar_addresses[Original_Address == x]$people_names) / length(unique(similar_addresses[Original_Address == x]$people_names))
})
))
问题是它大大减慢了循环速度。
数据如下所示:
distinct_similar_addresses:
"U 2 5 TIMPERLEY ST NICHOLLS VIC"
"U 1 3 TIMPERLEY ST NICHOLLS VIC"
"U 1 11 TIMPERLEY ST NICHOLLS VIC"
"U 1 33 TIMPERLEY ST NICHOLLS VIC"
"U 1 2 TIMPERLEY ST NICHOLLS VIC"
"U 1 3 TIMPERLEY ST NICHOLLS VIC"
"U 1 5 TIMPERLEY ST NICHOLLS VIC"
similar_addresses:
people_names,Original_Address,Numbers,street_Name,street_type,post_code,suburb,PO,UID
Giuseppe Conte,U 1 3 TIMPERLEY ST NICHOLLS VIC,1,TIMPERLEY,ST,5469,NICHOLLS,,
Giuseppe Conte,U 1 3 TIMPERLEY ST NICHOLLS VIC,TIMPERLEY,ST,5469,NICHOLLS,,
Mario Pertini,U 2 5 TIMPERLEY ST NICHOLLS VIC,TIMPERLEY,ST,5469,NICHOLLS,,
Mario Pertini,U 2 5 TIMPERLEY ST NICHOLLS VIC,5,TIMPERLEY,ST,5469,NICHOLLS,,
脚本正在评估地址是指一个单元还是一个独立的房子。 有什么方法可以更快地执行此任务?
我正在添加一个结果集和一个解释,以便它的作用变得更容易理解。
结果集:
distinct_similar_addresses V2
"U 2 5 TIMPERLEY ST NICHOLLS VIC" 2
"U 1 3 TIMPERLEY ST NICHOLLS VIC" 2
该代码只是计算与单行地址关联的姓名数。 事实上,如果地址重复,则表示它指的是一个单元,否则就是一个单独的房子。
对于给您带来的数据不便,我们深表歉意,感谢 Roland 的帮助。
这就是解决方案
x <- similar_addresses[, .N, by = Original_Address] %>% select('N')
y <- similar_addresses[, length(unique(people_names)) , by = Original_Address] %>% select('V1')
number_of_rows_similar_addresses <- cbind(unique(similar_addresses$Original_Address), x/y)
谢谢格雷戈尔, 这可能更好:
x <- similar_addresses[, .N, by = Original_Address]$N
y <- similar_addresses[, length(unique(people_names)) , by = Original_Address]$V1
number_of_rows_similar_addresses <- cbind(unique(similar_addresses$Original_Address), x/y)