如何根据列对齐方式组合 R 中的文本行

How do I combine lines of text in R based on column alignment

我正在尝试从使用 {pdftools} 从 PDF 中提取的问卷中解析文本数据。我最终得到一个看起来像这个对齐文本噩梦的数据框:

example <- data.frame(
  lines = c("Beverages", 
            "What beverages did you drink?", 
            "  Please check the box next to each beverage that you drank at least once in the past 12 months.",
            "         Tomato juice or vegetable juice", 
            "         Orange juice or grapefruit juice", 
            "         Grape juice",
            "         Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)", 
            "         Fruit or vegetable smoothies", 
            "         Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry",
            "            cocktail)", 
            "         Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and",
            "            coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)", 
            "         Chocolate milk or hot chocolate",
            "Tomato juice or vegetable juice",
            "         You drank tomato juice or vegetable juice in the past 12 months.",
            "  Over the past 12 months, how often did you drink tomato juice or vegetable juice?",
            "         1 time per month or less",
            "         2-3 times per month"
            )
)

每个回复都以一个方框开头 \uf06f,有时回复的长度足以分两行显示。

任何人都可以提供有关当响应分为两行时如何连接文本的建议吗?

你可以使用

library(dplyr)
library(stringr)

example %>%
  group_by(
    category = cumsum(str_detect(lines, "^[^\s]")),
    group_1  = cumsum(str_detect(lines, "^\s{2}(?!\s)")),
    group_3  = cumsum(str_detect(lines, "\uf06f|\uf0a1"))) %>% 
  mutate(
    lines = ifelse(group_3 > 0 & !str_detect(lines, "\uf06f|\uf0a1"), str_trim(lines), lines),
    lines = case_when(
      group_3 > 0 ~ str_c(lines, collapse = " "),
      TRUE ~ lines
      )
    ) %>% 
  distinct() %>% 
  ungroup() %>% 
  select(lines)

获得

# A tibble: 11 x 1
   lines                                                                                                    
   <chr>                                                                                                    
 1 "Beverages"                                                                                              
 2 "What beverages did you drink?"                                                                          
 3 "  Please check the box next to each beverage that you drank at least once in the past 12 months."       
 4 "        \uf06f Tomato juice or vegetable juice"                                                         
 5 "        \uf06f Orange juice or grapefruit juice"                                                        
 6 "        \uf06f Grape juice"                                                                             
 7 "        \uf06f Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)"
 8 "        \uf06f Fruit or vegetable smoothies"                                                            
 9 "        \uf06f Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry cocktail)"
10 "        \uf06f Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)"
11 "        \uf06f Chocolate milk or hot chocolate"                                                         
12 "Tomato juice or vegetable juice"                                                                        
13 "        \uf06f You drank tomato juice or vegetable juice in the past 12 months."                        
14 "Over the past 12 months, how often did you drink tomato juice or vegetable juice?"                      
15 "        \uf0a1 1 time per month or less"                                                                
16 "        \uf0a1 2-3 times per month" 

我们想做什么?

  1. 首先我们尝试建立一个“类别”。这些行不是以 space 字符开头,因此我们要查找“^[^\s]”。 ^ 表示“开始于”,[^\s] 表示“不是 space 字符”。
  2. 下一个分组级别是以两个 space 个字符开头且后面没有另一个 space 的行,因此 ^\s{2}(?!\s).
  3. 最后的分组级别是包含这些 UTF 字符的行 "\uf06f|\uf0a1"