将具有不均匀列号的嵌套列表扁平化到 R 中的数据框中

flattern nested list with uneven column numbers into data frame in R

我面临将嵌套列表绑定到数据框以进行处理的挑战。

编辑:下面是原始嵌套数据的示例,然后才尝试将它们扁平化。

list(list(list(url = "https://lda.senate.gov/api/v1/filings/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/", 
    filing_uuid = "5e4bbd96-db94-4ea3-a310-7a7fb1e93fff", filing_type = "Q1", 
    filing_type_display = "1st Quarter - Report", filing_year = 2021L, 
    filing_period = "first_quarter", filing_period_display = "1st Quarter (Jan 1 - Mar 31)", 
    filing_document_url = "https://lda.senate.gov/filings/public/filing/5e4bbd96-db94-4ea3-a310-7a7fb1e93fff/print/", 
    filing_document_content_type = "text/html", income = "15000.00", 
    expenses = NULL, expenses_method = NULL, expenses_method_display = NULL, 
    posted_by_name = "Christian Smith", dt_posted = "2021-04-30T10:20:59.217000-04:00", 
    termination_date = NULL, registrant = list(id = 8214L, url = "https://lda.senate.gov/api/v1/registrants/8214/", 
        house_registrant_id = 31113L, name = "CAPSTONE NATIONAL PARTNERS", 
        description = "public affairs", address_1 = "501 Capitol Court NE", 
        address_2 = "Suite 100", address_3 = NULL, address_4 = NULL, 
        city = "Washington", state = "DC", state_display = "District of Columbia", 
        zip = "20002", country = "US", country_display = "United States of America", 
        ppb_country = "US", ppb_country_display = "United States of America", 
        contact_name = "", contact_telephone = "", dt_updated = "2022-01-13T14:47:31.828778-05:00"), 
    client = list(id = 111342L, url = "https://lda.senate.gov/api/v1/clients/111342/", 
        client_id = 303L, name = "OSHKOSH CORPORATION", general_description = "manufacturing", 
        client_government_entity = FALSE, client_self_select = NULL, 
        state = "WI", state_display = "Wisconsin", country = "US", 
        country_display = "United States of America", ppb_state = "WI", 
        ppb_state_display = "Wisconsin", ppb_country = "US", 
        ppb_country_display = "United States of America", effective_date = "2016-04-01"), 
    lobbying_activities = list(list(general_issue_code = "BUD", 
        general_issue_code_display = "Budget/Appropriations", 
        description = "FY22 Appropriations", foreign_entity_issues = "", 
        lobbyists = list(list(lobbyist = list(id = 63767L, prefix = NULL, 
            prefix_display = NULL, first_name = "WILLIAM", nickname = NULL, 
            middle_name = NULL, last_name = "STONE", suffix = NULL, 
            suffix_display = NULL), covered_position = "Chief of Staff, Dave Obey: House Appropriations Committee", 
            new = FALSE)), government_entities = list(list(id = 2L, 
            name = "HOUSE OF REPRESENTATIVES"), list(id = 1L, 
            name = "SENATE")))), conviction_disclosures = list(), 
    foreign_entities = list(), affiliated_organizations = list())), 
    list(list(url = "https://lda.senate.gov/api/v1/filings/177b995a-3be2-4127-b962-795e76974617/", 
        filing_uuid = "177b995a-3be2-4127-b962-795e76974617", 
        filing_type = "Q1", filing_type_display = "1st Quarter - Report", 
        filing_year = 2021L, filing_period = "first_quarter", 
        filing_period_display = "1st Quarter (Jan 1 - Mar 31)", 
        filing_document_url = "https://lda.senate.gov/filings/public/filing/177b995a-3be2-4127-b962-795e76974617/print/", 
        filing_document_content_type = "text/html", income = "22500.00", 
        expenses = NULL, expenses_method = NULL, expenses_method_display = NULL, 
        posted_by_name = "Doyce Boesch", dt_posted = "2021-04-30T11:22:12.233000-04:00", 
        termination_date = NULL, registrant = list(id = 400677020L, 
            url = "https://lda.senate.gov/api/v1/registrants/400677020/", 
            house_registrant_id = NULL, name = "MR. DOYCE BOESCH", 
            description = "Government Relations", address_1 = "4515 W Street NW", 
            address_2 = NULL, address_3 = NULL, address_4 = NULL, 
            city = "Washington", state = "DC", state_display = "District of Columbia", 
            zip = "20007", country = "US", country_display = "United States of America", 
            ppb_country = "US", ppb_country_display = "United States of America", 
            contact_name = "DOYCE BOESCH", contact_telephone = "+1 202-731-9995", 
            dt_updated = "2022-01-13T14:59:12.412096-05:00"), 
        client = list(id = 194057L, url = "https://lda.senate.gov/api/v1/clients/194057/", 
            client_id = 75L, name = "INVESTMENT COMPANY INSTITUTE", 
            general_description = "Stock Market and Financial Services", 
            client_government_entity = FALSE, client_self_select = FALSE, 
            state = "DC", state_display = "District of Columbia", 
            country = "US", country_display = "United States of America", 
            ppb_state = NULL, ppb_state_display = NULL, ppb_country = "US", 
            ppb_country_display = "United States of America", 
            effective_date = "2012-07-01"), lobbying_activities = list(
            list(general_issue_code = "FIN", general_issue_code_display = "Financial Institutions/Investments/Securities", 
                description = "providing awareness of members positions", 
                foreign_entity_issues = "", lobbyists = list(
                  list(lobbyist = list(id = 52828L, prefix = NULL, 
                    prefix_display = NULL, first_name = "DOYCE", 
                    nickname = NULL, middle_name = NULL, last_name = "BOESCH", 
                    suffix = NULL, suffix_display = NULL), covered_position = NULL, 
                    new = FALSE)), government_entities = list(
                  list(id = 2L, name = "HOUSE OF REPRESENTATIVES"), 
                  list(id = 1L, name = "SENATE")))), conviction_disclosures = list(), 
        foreign_entities = list(), affiliated_organizations = list())))

所以把这个高度嵌套的数据称为

my.data

然后我试着用

把它弄平
flat.df <- lapply(my.data, function(i) list(unlist(i, recursive = F)))

它有点工作,但 flat.df 列表中的每个元素仍然有多个子列表,例如“lobbying_activities”、“游说者”。而且它们没有展开(我想要里面的信息)。

但是如果我将 recursive 设置为“TRUE”,那么扁平化的列表就会有重复的列,最令人沮丧的是我看到一些列混在一起了(例如人名进入了费用列)

理想情况下,我想将每个 table 中的此类子列表扁平化,并将整个内容合并为一个 table。然后通过

将它们加入数据框
df<- as.data.frame(do.call("rbind", flat.df))

tibble 是一种很好的格式,因为它们支持嵌套 data.frames。我的目标是 2 行的 tibble,一种宽格式。在其中,每个嵌套列表元素都是它自己的 data.frame,我们可以在以后需要时对其进行操作。我会做这样的事情:

library(tidyverse)
l = unlist(l, recursive = F)
ind_to_nest <- which(map_lgl(l[[1]], is.list))
non_tbl <- map(l, ~ .x[-ind_to_nest])
tbl <- map(l, ~ .x[ind_to_nest])

df <- bind_rows(non_tbl) %>%
  mutate(n = 1:n(), .before = 1) %>%
  mutate(data =  map(tbl,  ~ map(.x, ~flatten(.x) %>% bind_cols))) %>%
  unnest_wider(data, simplify = F)

请注意,这确实会引发一堆警告。这是因为列表中存在名称冲突。

#> New names:
#> * id -> id...5
#> * id -> id...10

可以通过指定命名策略来解决,或者重新考虑如何将数据读入 R 以尽早解决命名冲突。

#> Outer names are only allowed for unnamed scalar atomic inputs 

这有点难解决,但issue是一个起点。

为了分析,可以在需要时对 sub-tibbles 进行一些清理,因为不同的任务需要不同的形状。