将具有不均匀列号的嵌套列表扁平化到 R 中的数据框中
flattern nested list with uneven column numbers into data frame in R
我面临将嵌套列表绑定到数据框以进行处理的挑战。
编辑:下面是原始嵌套数据的示例,然后才尝试将它们扁平化。
list(list(list(url = "https://lda.senate.gov/api/v1/filings/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/",
filing_uuid = "5e4bbd96-db94-4ea3-a310-7a7fb1e93fff", filing_type = "Q1",
filing_type_display = "1st Quarter - Report", filing_year = 2021L,
filing_period = "first_quarter", filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/print/",
filing_document_content_type = "text/html", income = "15000.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Christian Smith", dt_posted = "2021-04-30T10:20:59.217000-04:00",
termination_date = NULL, registrant = list(id = 8214L, url = "https://lda.senate.gov/api/v1/registrants/8214/",
house_registrant_id = 31113L, name = "CAPSTONE NATIONAL PARTNERS",
description = "public affairs", address_1 = "501 Capitol Court NE",
address_2 = "Suite 100", address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20002", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "", contact_telephone = "", dt_updated = "2022-01-13T14:47:31.828778-05:00"),
client = list(id = 111342L, url = "https://lda.senate.gov/api/v1/clients/111342/",
client_id = 303L, name = "OSHKOSH CORPORATION", general_description = "manufacturing",
client_government_entity = FALSE, client_self_select = NULL,
state = "WI", state_display = "Wisconsin", country = "US",
country_display = "United States of America", ppb_state = "WI",
ppb_state_display = "Wisconsin", ppb_country = "US",
ppb_country_display = "United States of America", effective_date = "2016-04-01"),
lobbying_activities = list(list(general_issue_code = "BUD",
general_issue_code_display = "Budget/Appropriations",
description = "FY22 Appropriations", foreign_entity_issues = "",
lobbyists = list(list(lobbyist = list(id = 63767L, prefix = NULL,
prefix_display = NULL, first_name = "WILLIAM", nickname = NULL,
middle_name = NULL, last_name = "STONE", suffix = NULL,
suffix_display = NULL), covered_position = "Chief of Staff, Dave Obey: House Appropriations Committee",
new = FALSE)), government_entities = list(list(id = 2L,
name = "HOUSE OF REPRESENTATIVES"), list(id = 1L,
name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())),
list(list(url = "https://lda.senate.gov/api/v1/filings/177b995a-3be2-4127-b962-795e76974617/",
filing_uuid = "177b995a-3be2-4127-b962-795e76974617",
filing_type = "Q1", filing_type_display = "1st Quarter - Report",
filing_year = 2021L, filing_period = "first_quarter",
filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/177b995a-3be2-4127-b962-795e76974617/print/",
filing_document_content_type = "text/html", income = "22500.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Doyce Boesch", dt_posted = "2021-04-30T11:22:12.233000-04:00",
termination_date = NULL, registrant = list(id = 400677020L,
url = "https://lda.senate.gov/api/v1/registrants/400677020/",
house_registrant_id = NULL, name = "MR. DOYCE BOESCH",
description = "Government Relations", address_1 = "4515 W Street NW",
address_2 = NULL, address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20007", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "DOYCE BOESCH", contact_telephone = "+1 202-731-9995",
dt_updated = "2022-01-13T14:59:12.412096-05:00"),
client = list(id = 194057L, url = "https://lda.senate.gov/api/v1/clients/194057/",
client_id = 75L, name = "INVESTMENT COMPANY INSTITUTE",
general_description = "Stock Market and Financial Services",
client_government_entity = FALSE, client_self_select = FALSE,
state = "DC", state_display = "District of Columbia",
country = "US", country_display = "United States of America",
ppb_state = NULL, ppb_state_display = NULL, ppb_country = "US",
ppb_country_display = "United States of America",
effective_date = "2012-07-01"), lobbying_activities = list(
list(general_issue_code = "FIN", general_issue_code_display = "Financial Institutions/Investments/Securities",
description = "providing awareness of members positions",
foreign_entity_issues = "", lobbyists = list(
list(lobbyist = list(id = 52828L, prefix = NULL,
prefix_display = NULL, first_name = "DOYCE",
nickname = NULL, middle_name = NULL, last_name = "BOESCH",
suffix = NULL, suffix_display = NULL), covered_position = NULL,
new = FALSE)), government_entities = list(
list(id = 2L, name = "HOUSE OF REPRESENTATIVES"),
list(id = 1L, name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())))
所以把这个高度嵌套的数据称为
my.data
然后我试着用
把它弄平
flat.df <- lapply(my.data, function(i) list(unlist(i, recursive = F)))
它有点工作,但 flat.df 列表中的每个元素仍然有多个子列表,例如“lobbying_activities”、“游说者”。而且它们没有展开(我想要里面的信息)。
但是如果我将 recursive 设置为“TRUE”,那么扁平化的列表就会有重复的列,最令人沮丧的是我看到一些列混在一起了(例如人名进入了费用列)
理想情况下,我想将每个 table 中的此类子列表扁平化,并将整个内容合并为一个 table。然后通过
将它们加入数据框
df<- as.data.frame(do.call("rbind", flat.df))
tibble
是一种很好的格式,因为它们支持嵌套 data.frames。我的目标是 2 行的 tibble,一种宽格式。在其中,每个嵌套列表元素都是它自己的 data.frame,我们可以在以后需要时对其进行操作。我会做这样的事情:
library(tidyverse)
l = unlist(l, recursive = F)
ind_to_nest <- which(map_lgl(l[[1]], is.list))
non_tbl <- map(l, ~ .x[-ind_to_nest])
tbl <- map(l, ~ .x[ind_to_nest])
df <- bind_rows(non_tbl) %>%
mutate(n = 1:n(), .before = 1) %>%
mutate(data = map(tbl, ~ map(.x, ~flatten(.x) %>% bind_cols))) %>%
unnest_wider(data, simplify = F)
请注意,这确实会引发一堆警告。这是因为列表中存在名称冲突。
#> New names:
#> * id -> id...5
#> * id -> id...10
可以通过指定命名策略来解决,或者重新考虑如何将数据读入 R 以尽早解决命名冲突。
#> Outer names are only allowed for unnamed scalar atomic inputs
这有点难解决,但issue是一个起点。
为了分析,可以在需要时对 sub-tibbles 进行一些清理,因为不同的任务需要不同的形状。
我面临将嵌套列表绑定到数据框以进行处理的挑战。
编辑:下面是原始嵌套数据的示例,然后才尝试将它们扁平化。
list(list(list(url = "https://lda.senate.gov/api/v1/filings/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/",
filing_uuid = "5e4bbd96-db94-4ea3-a310-7a7fb1e93fff", filing_type = "Q1",
filing_type_display = "1st Quarter - Report", filing_year = 2021L,
filing_period = "first_quarter", filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/print/",
filing_document_content_type = "text/html", income = "15000.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Christian Smith", dt_posted = "2021-04-30T10:20:59.217000-04:00",
termination_date = NULL, registrant = list(id = 8214L, url = "https://lda.senate.gov/api/v1/registrants/8214/",
house_registrant_id = 31113L, name = "CAPSTONE NATIONAL PARTNERS",
description = "public affairs", address_1 = "501 Capitol Court NE",
address_2 = "Suite 100", address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20002", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "", contact_telephone = "", dt_updated = "2022-01-13T14:47:31.828778-05:00"),
client = list(id = 111342L, url = "https://lda.senate.gov/api/v1/clients/111342/",
client_id = 303L, name = "OSHKOSH CORPORATION", general_description = "manufacturing",
client_government_entity = FALSE, client_self_select = NULL,
state = "WI", state_display = "Wisconsin", country = "US",
country_display = "United States of America", ppb_state = "WI",
ppb_state_display = "Wisconsin", ppb_country = "US",
ppb_country_display = "United States of America", effective_date = "2016-04-01"),
lobbying_activities = list(list(general_issue_code = "BUD",
general_issue_code_display = "Budget/Appropriations",
description = "FY22 Appropriations", foreign_entity_issues = "",
lobbyists = list(list(lobbyist = list(id = 63767L, prefix = NULL,
prefix_display = NULL, first_name = "WILLIAM", nickname = NULL,
middle_name = NULL, last_name = "STONE", suffix = NULL,
suffix_display = NULL), covered_position = "Chief of Staff, Dave Obey: House Appropriations Committee",
new = FALSE)), government_entities = list(list(id = 2L,
name = "HOUSE OF REPRESENTATIVES"), list(id = 1L,
name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())),
list(list(url = "https://lda.senate.gov/api/v1/filings/177b995a-3be2-4127-b962-795e76974617/",
filing_uuid = "177b995a-3be2-4127-b962-795e76974617",
filing_type = "Q1", filing_type_display = "1st Quarter - Report",
filing_year = 2021L, filing_period = "first_quarter",
filing_period_display = "1st Quarter (Jan 1 - Mar 31)",
filing_document_url = "https://lda.senate.gov/filings/public/filing/177b995a-3be2-4127-b962-795e76974617/print/",
filing_document_content_type = "text/html", income = "22500.00",
expenses = NULL, expenses_method = NULL, expenses_method_display = NULL,
posted_by_name = "Doyce Boesch", dt_posted = "2021-04-30T11:22:12.233000-04:00",
termination_date = NULL, registrant = list(id = 400677020L,
url = "https://lda.senate.gov/api/v1/registrants/400677020/",
house_registrant_id = NULL, name = "MR. DOYCE BOESCH",
description = "Government Relations", address_1 = "4515 W Street NW",
address_2 = NULL, address_3 = NULL, address_4 = NULL,
city = "Washington", state = "DC", state_display = "District of Columbia",
zip = "20007", country = "US", country_display = "United States of America",
ppb_country = "US", ppb_country_display = "United States of America",
contact_name = "DOYCE BOESCH", contact_telephone = "+1 202-731-9995",
dt_updated = "2022-01-13T14:59:12.412096-05:00"),
client = list(id = 194057L, url = "https://lda.senate.gov/api/v1/clients/194057/",
client_id = 75L, name = "INVESTMENT COMPANY INSTITUTE",
general_description = "Stock Market and Financial Services",
client_government_entity = FALSE, client_self_select = FALSE,
state = "DC", state_display = "District of Columbia",
country = "US", country_display = "United States of America",
ppb_state = NULL, ppb_state_display = NULL, ppb_country = "US",
ppb_country_display = "United States of America",
effective_date = "2012-07-01"), lobbying_activities = list(
list(general_issue_code = "FIN", general_issue_code_display = "Financial Institutions/Investments/Securities",
description = "providing awareness of members positions",
foreign_entity_issues = "", lobbyists = list(
list(lobbyist = list(id = 52828L, prefix = NULL,
prefix_display = NULL, first_name = "DOYCE",
nickname = NULL, middle_name = NULL, last_name = "BOESCH",
suffix = NULL, suffix_display = NULL), covered_position = NULL,
new = FALSE)), government_entities = list(
list(id = 2L, name = "HOUSE OF REPRESENTATIVES"),
list(id = 1L, name = "SENATE")))), conviction_disclosures = list(),
foreign_entities = list(), affiliated_organizations = list())))
所以把这个高度嵌套的数据称为
my.data
然后我试着用
把它弄平flat.df <- lapply(my.data, function(i) list(unlist(i, recursive = F)))
它有点工作,但 flat.df 列表中的每个元素仍然有多个子列表,例如“lobbying_activities”、“游说者”。而且它们没有展开(我想要里面的信息)。
但是如果我将 recursive 设置为“TRUE”,那么扁平化的列表就会有重复的列,最令人沮丧的是我看到一些列混在一起了(例如人名进入了费用列)
理想情况下,我想将每个 table 中的此类子列表扁平化,并将整个内容合并为一个 table。然后通过
将它们加入数据框df<- as.data.frame(do.call("rbind", flat.df))
tibble
是一种很好的格式,因为它们支持嵌套 data.frames。我的目标是 2 行的 tibble,一种宽格式。在其中,每个嵌套列表元素都是它自己的 data.frame,我们可以在以后需要时对其进行操作。我会做这样的事情:
library(tidyverse)
l = unlist(l, recursive = F)
ind_to_nest <- which(map_lgl(l[[1]], is.list))
non_tbl <- map(l, ~ .x[-ind_to_nest])
tbl <- map(l, ~ .x[ind_to_nest])
df <- bind_rows(non_tbl) %>%
mutate(n = 1:n(), .before = 1) %>%
mutate(data = map(tbl, ~ map(.x, ~flatten(.x) %>% bind_cols))) %>%
unnest_wider(data, simplify = F)
请注意,这确实会引发一堆警告。这是因为列表中存在名称冲突。
#> New names:
#> * id -> id...5
#> * id -> id...10
可以通过指定命名策略来解决,或者重新考虑如何将数据读入 R 以尽早解决命名冲突。
#> Outer names are only allowed for unnamed scalar atomic inputs
这有点难解决,但issue是一个起点。
为了分析,可以在需要时对 sub-tibbles 进行一些清理,因为不同的任务需要不同的形状。