如何根据列对齐方式组合 R 中的文本行
How do I combine lines of text in R based on column alignment
我正在尝试从使用 {pdftools}
从 PDF 中提取的问卷中解析文本数据。我最终得到一个看起来像这个对齐文本噩梦的数据框:
example <- data.frame(
lines = c("Beverages",
"What beverages did you drink?",
" Please check the box next to each beverage that you drank at least once in the past 12 months.",
" Tomato juice or vegetable juice",
" Orange juice or grapefruit juice",
" Grape juice",
" Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)",
" Fruit or vegetable smoothies",
" Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry",
" cocktail)",
" Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and",
" coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)",
" Chocolate milk or hot chocolate",
"Tomato juice or vegetable juice",
" You drank tomato juice or vegetable juice in the past 12 months.",
" Over the past 12 months, how often did you drink tomato juice or vegetable juice?",
" 1 time per month or less",
" 2-3 times per month"
)
)
每个回复都以一个方框开头 \uf06f
,有时回复的长度足以分两行显示。
任何人都可以提供有关当响应分为两行时如何连接文本的建议吗?
你可以使用
library(dplyr)
library(stringr)
example %>%
group_by(
category = cumsum(str_detect(lines, "^[^\s]")),
group_1 = cumsum(str_detect(lines, "^\s{2}(?!\s)")),
group_3 = cumsum(str_detect(lines, "\uf06f|\uf0a1"))) %>%
mutate(
lines = ifelse(group_3 > 0 & !str_detect(lines, "\uf06f|\uf0a1"), str_trim(lines), lines),
lines = case_when(
group_3 > 0 ~ str_c(lines, collapse = " "),
TRUE ~ lines
)
) %>%
distinct() %>%
ungroup() %>%
select(lines)
获得
# A tibble: 11 x 1
lines
<chr>
1 "Beverages"
2 "What beverages did you drink?"
3 " Please check the box next to each beverage that you drank at least once in the past 12 months."
4 " \uf06f Tomato juice or vegetable juice"
5 " \uf06f Orange juice or grapefruit juice"
6 " \uf06f Grape juice"
7 " \uf06f Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)"
8 " \uf06f Fruit or vegetable smoothies"
9 " \uf06f Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry cocktail)"
10 " \uf06f Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)"
11 " \uf06f Chocolate milk or hot chocolate"
12 "Tomato juice or vegetable juice"
13 " \uf06f You drank tomato juice or vegetable juice in the past 12 months."
14 "Over the past 12 months, how often did you drink tomato juice or vegetable juice?"
15 " \uf0a1 1 time per month or less"
16 " \uf0a1 2-3 times per month"
我们想做什么?
- 首先我们尝试建立一个“类别”。这些行不是以 space 字符开头,因此我们要查找“^[^\s]”。
^
表示“开始于”,[^\s]
表示“不是 space 字符”。
- 下一个分组级别是以两个 space 个字符开头且后面没有另一个 space 的行,因此
^\s{2}(?!\s)
.
- 最后的分组级别是包含这些 UTF 字符的行
"\uf06f|\uf0a1"
。
我正在尝试从使用 {pdftools}
从 PDF 中提取的问卷中解析文本数据。我最终得到一个看起来像这个对齐文本噩梦的数据框:
example <- data.frame(
lines = c("Beverages",
"What beverages did you drink?",
" Please check the box next to each beverage that you drank at least once in the past 12 months.",
" Tomato juice or vegetable juice",
" Orange juice or grapefruit juice",
" Grape juice",
" Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)",
" Fruit or vegetable smoothies",
" Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry",
" cocktail)",
" Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and",
" coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)",
" Chocolate milk or hot chocolate",
"Tomato juice or vegetable juice",
" You drank tomato juice or vegetable juice in the past 12 months.",
" Over the past 12 months, how often did you drink tomato juice or vegetable juice?",
" 1 time per month or less",
" 2-3 times per month"
)
)
每个回复都以一个方框开头 \uf06f
,有时回复的长度足以分两行显示。
任何人都可以提供有关当响应分为两行时如何连接文本的建议吗?
你可以使用
library(dplyr)
library(stringr)
example %>%
group_by(
category = cumsum(str_detect(lines, "^[^\s]")),
group_1 = cumsum(str_detect(lines, "^\s{2}(?!\s)")),
group_3 = cumsum(str_detect(lines, "\uf06f|\uf0a1"))) %>%
mutate(
lines = ifelse(group_3 > 0 & !str_detect(lines, "\uf06f|\uf0a1"), str_trim(lines), lines),
lines = case_when(
group_3 > 0 ~ str_c(lines, collapse = " "),
TRUE ~ lines
)
) %>%
distinct() %>%
ungroup() %>%
select(lines)
获得
# A tibble: 11 x 1
lines
<chr>
1 "Beverages"
2 "What beverages did you drink?"
3 " Please check the box next to each beverage that you drank at least once in the past 12 months."
4 " \uf06f Tomato juice or vegetable juice"
5 " \uf06f Orange juice or grapefruit juice"
6 " \uf06f Grape juice"
7 " \uf06f Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)"
8 " \uf06f Fruit or vegetable smoothies"
9 " \uf06f Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry cocktail)"
10 " \uf06f Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)"
11 " \uf06f Chocolate milk or hot chocolate"
12 "Tomato juice or vegetable juice"
13 " \uf06f You drank tomato juice or vegetable juice in the past 12 months."
14 "Over the past 12 months, how often did you drink tomato juice or vegetable juice?"
15 " \uf0a1 1 time per month or less"
16 " \uf0a1 2-3 times per month"
我们想做什么?
- 首先我们尝试建立一个“类别”。这些行不是以 space 字符开头,因此我们要查找“^[^\s]”。
^
表示“开始于”,[^\s]
表示“不是 space 字符”。 - 下一个分组级别是以两个 space 个字符开头且后面没有另一个 space 的行,因此
^\s{2}(?!\s)
. - 最后的分组级别是包含这些 UTF 字符的行
"\uf06f|\uf0a1"
。