使用正则表达式清理格式错误的问卷

Clean up badly formatted questionnaires using regex

我有一份格式错误的调查问卷,答案(和伴随的换行符)经常出现在问题的某处。这是句子(即问题和相应答案)分割的问题,因此模型很难从每个问答中提取信息!

示例:

\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n

示例的打印版本:

01 Do you have preexisting      No
conditions?
02 Within the past 12 months I worried about          Never True
my health would get worse.
03 Within the past 12 months I have had         Never True
high blood pressure.
04 What is your housing situation today?   I have housing
05 How many times have you moved in the past 12        Zero (I 
did not move)
months?
06 Are you worried that in the next 2 months, you may not    No
have your own housing to live in?
07 Do you have trouble paying your heating or electricity    No
bill?
08 Do you have trouble paying for medicines?                 No
09 Are you currently unemployed and looking for work?        No
10 Are you interested in more education?                     Yes

预期输出:

  1. 如果答案位于问题的某处,则移至句末;
  2. 删除问题中不必要的空格和换行符;
  3. 将问题末尾的问号或其他标点符号替换为 :,以便句子分割模型在下一个问题之前包含 : 之后的答案。

预期示例输出:

\n01 您是否有既往病史:No\n02 在过去的 12 个月内,我担心自己的健康状况会变得更糟:从不 True\n03 在过去的 12 个月内,我患有高血压:从未 True\n04 你今天的住房情况如何: 我有 housing\n05 你在过去的 12 个月里搬过几次家: 零次(我没有搬家)\n06 你是否担心在接下来的 2 个月里,您可能没有自己的住房居住:No\n07 您是否难以支付供暖或电费:No\n08 您是否难以支付药品费用:No\n09 您目前失业了吗正在找工作:No\n10 您是否对更多教育感兴趣:Yes\n\n

我一直在尝试匹配连续的 \n(0[1-9]|1[0-3])s,并将 re.sublambda m: m.group() 一起使用,但到目前为止没有成功。欢迎任何建议!

这很接近,我相信:

import re

question_break_re = re.compile("\n(?=\d{2} )")
answer_re = re.compile("\s{2,}([^\n]+)")
whitespace_re = re.compile("\s+")
end_of_question_mark_re = re.compile(r"(?:\?|\.)?$")

def tidy_up_question(question):
    answer = None
    match = answer_re.search(question)
    if match:
        answer = match.group(1)
        question = question[:match.start(0)] + question[match.end(0):]
    question = whitespace_re.sub(' ', question).strip()
    if answer is not None:
        question = end_of_question_mark_re.sub(f": {answer}", question, count=1)
    return question


text = "\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n"

q_and_a = [
    tidy_up_question(question)
    for question in question_break_re.split(text)
    if question.strip()
]

print('\n'.join(q_and_a))

输出:

01 Do you have preexisting conditions: No
02 Within the past 12 months I worried about my health would get worse: Never True
03 Within the past 12 months I have had high blood pressure: Never True
04 What is your housing situation today: I have housing
05 How many times have you moved in the past 12 months: Zero (I did not move)
06 Are you worried that in the next 2 months, you may not have your own housing to live in: No
07 Do you have trouble paying your heating or electricity bill: No
08 Do you have trouble paying for medicines: No
09 Are you currently unemployed and looking for work: No
10 Are you interested in more education: Yes

这在某些极端情况下会失败:例如,如果 12 位于下一行的开头,它将被视为新问题的开头。此外,任何不紧接在答案之前的多个连续空格同样会把事情搞砸。

我用的方法:把字符串切成问题,工作原理是每行都以two-digit数字开头;将答案识别为多个空格和换行符之间的一段文本;最后用冒号和答案替换结尾的标点符号。