数据清理:处理来自用户输入的大量不同格式
data cleaning: dealing with a large number of different formats from user inputs
我在用户输入中有一些脏数据,因此不一致。它们要么是单个数字,要么是数字范围。
number_ranges = [
'11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
'2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740'
]
number_ranges = ','.join(number_ranges)
number_ranges = number_ranges.replace(' ', '')
number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)
number_ranges= re.sub(r"[;]+", ",", number_ranges)
number_ranges = str(number_ranges).split(',')
这是结果列表:
[
'11.6', '665.690', '705.715', '740.54-830.18', '900-930', '1200', '2100/2200',
'2320/2350', '2300-2400/2500-2560/2730-2740'
]
我从这里知道
for i in number_ranges:
if (len(i) >5) and ('.' in i) and ('-' not in i):
i = i.replace('.','-')
for i in number_ranges:
if ('-' in i) and ('/' in i):
i = i.split('/')
for i in number_ranges:
if len(i) < 3:
i = str(int(i) * 1000)
我也试过这个方法:
for n, i in enumerate(number_ranges):
if (len(i) >5) and ('.' in i) and ('-' not in i):
number_ranges[n] = i.replace('.','-')
665.690应该是665-690,740.54-830.18ABC应该是741-830,2100/2200应该是2100-2200,11.6应该是11600
最终结果应该有整数元组的范围,所以:
[(11600,), (665, 690), (705, 715), (741, 830), (900, 930), (1200,), (2100, 2200), (2320, 2350), (2300, 2400), (2500, 2560), (2730, 2740)]
从那里如果我需要它们在我可以使用的范围内:
for pair in number_ranges:
number_ranges.append("{}-{}".format(*pair))
我知道逻辑,但不知道实现。
我想我想弄清楚的是如何根据特定条件替换 characters/manipulate 字符串。
这些是最常见的格式,所以我想说明它们。我知道我永远无法预测某人会输入什么,但我认为我可以解释 95% 以上的情况。
如果我遗漏了任何必要的信息,我深表歉意。我会尽快提供。
谢谢。
编辑:
我让它与下面的代码一起工作:
number_ranges = ','.join(number_ranges)
number_ranges = number_ranges.replace(' ', '')
number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)
number_ranges= re.sub(r"[;]+", ",", number_ranges)
number_ranges = str(number_ranges).split(',')
for n, i in enumerate(number_ranges):
if ('-' in i) and ('/' in i):
number_ranges[n] = i.replace('/',',')
for n, i in enumerate(number_ranges):
if ('-' not in i) and ('/' in i):
number_ranges[n] = i.replace('/','-')
for n, i in enumerate(number_ranges):
if ('-' not in i) and ('.' in i) and (len(i)>4):
number_ranges[n] = i.replace('.','-')
for n, i in enumerate(number_ranges):
if ('.' in i) and (len(i) <= 4) and (float(i) < 30):
number_ranges[n] = str(round(float(i) * 1000))
number_ranges = [i.split(',') for i in number_ranges]
我试图找到一种 "pythonic" 的方式来编写这套规则。也许它可以给你一些想法,而且它肯定可以得到改进。
number_ranges = [
'11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
'2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740', '433.454', '345-654'
]
import re
def outer_split(rangetext):
'''Split the input text to individual range text.'''
# Rule:
# if both characters are present, use the second one to split
# and switch the first one to '-'
doubleseparators = ['-/', '.,', '-;', '/;']
for c in doubleseparators:
if c[0] in rangetext and c[1] in rangetext:
outersplit = rangetext.split(c[1])
outersplit = [s.replace(c[0], '-') for s in outersplit]
break
else:
outersplit = [rangetext, ]
return outersplit
def inner_split(rangetext):
'''Clean the range text and Split to [left, right] boundaries.'''
rangetext = re.sub(r'[a-zA-Z ]+', '', rangetext)
sep = '-'
if sep in rangetext:
innersplit = rangetext.split(sep)
else:
innersplit = [rangetext,]
# The special '.' case:
if len(innersplit)==1 and '.' in innersplit[0]:
l, r = innersplit[0].split('.')
if len(l)>2 or len(r)>2:
innersplit = [l, r]
else:
innersplit = [str(float(innersplit[0])*1000), ]
return innersplit
individualinputs = [individualinput for text in number_ranges
for individualinput in outer_split(text)]
[inner_split(textrange) for textrange in individualinputs]
输出为:
[['11600.0'],
['665', '690'],
['705', '715'],
['740.54', '830.18'],
['900', '930'],
['1200'],
['2100', '2200'],
['2320', '2350'],
['2300', '2400'],
['2500', '2560'],
['2730', '2740'],
['433', '454'],
['345', '654']]
我在用户输入中有一些脏数据,因此不一致。它们要么是单个数字,要么是数字范围。
number_ranges = [
'11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
'2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740'
]
number_ranges = ','.join(number_ranges)
number_ranges = number_ranges.replace(' ', '')
number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)
number_ranges= re.sub(r"[;]+", ",", number_ranges)
number_ranges = str(number_ranges).split(',')
这是结果列表:
[
'11.6', '665.690', '705.715', '740.54-830.18', '900-930', '1200', '2100/2200',
'2320/2350', '2300-2400/2500-2560/2730-2740'
]
我从这里知道
for i in number_ranges:
if (len(i) >5) and ('.' in i) and ('-' not in i):
i = i.replace('.','-')
for i in number_ranges:
if ('-' in i) and ('/' in i):
i = i.split('/')
for i in number_ranges:
if len(i) < 3:
i = str(int(i) * 1000)
我也试过这个方法:
for n, i in enumerate(number_ranges):
if (len(i) >5) and ('.' in i) and ('-' not in i):
number_ranges[n] = i.replace('.','-')
665.690应该是665-690,740.54-830.18ABC应该是741-830,2100/2200应该是2100-2200,11.6应该是11600
最终结果应该有整数元组的范围,所以:
[(11600,), (665, 690), (705, 715), (741, 830), (900, 930), (1200,), (2100, 2200), (2320, 2350), (2300, 2400), (2500, 2560), (2730, 2740)]
从那里如果我需要它们在我可以使用的范围内:
for pair in number_ranges:
number_ranges.append("{}-{}".format(*pair))
我知道逻辑,但不知道实现。
我想我想弄清楚的是如何根据特定条件替换 characters/manipulate 字符串。
这些是最常见的格式,所以我想说明它们。我知道我永远无法预测某人会输入什么,但我认为我可以解释 95% 以上的情况。
如果我遗漏了任何必要的信息,我深表歉意。我会尽快提供。
谢谢。
编辑: 我让它与下面的代码一起工作:
number_ranges = ','.join(number_ranges)
number_ranges = number_ranges.replace(' ', '')
number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)
number_ranges= re.sub(r"[;]+", ",", number_ranges)
number_ranges = str(number_ranges).split(',')
for n, i in enumerate(number_ranges):
if ('-' in i) and ('/' in i):
number_ranges[n] = i.replace('/',',')
for n, i in enumerate(number_ranges):
if ('-' not in i) and ('/' in i):
number_ranges[n] = i.replace('/','-')
for n, i in enumerate(number_ranges):
if ('-' not in i) and ('.' in i) and (len(i)>4):
number_ranges[n] = i.replace('.','-')
for n, i in enumerate(number_ranges):
if ('.' in i) and (len(i) <= 4) and (float(i) < 30):
number_ranges[n] = str(round(float(i) * 1000))
number_ranges = [i.split(',') for i in number_ranges]
我试图找到一种 "pythonic" 的方式来编写这套规则。也许它可以给你一些想法,而且它肯定可以得到改进。
number_ranges = [
'11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
'2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740', '433.454', '345-654'
]
import re
def outer_split(rangetext):
'''Split the input text to individual range text.'''
# Rule:
# if both characters are present, use the second one to split
# and switch the first one to '-'
doubleseparators = ['-/', '.,', '-;', '/;']
for c in doubleseparators:
if c[0] in rangetext and c[1] in rangetext:
outersplit = rangetext.split(c[1])
outersplit = [s.replace(c[0], '-') for s in outersplit]
break
else:
outersplit = [rangetext, ]
return outersplit
def inner_split(rangetext):
'''Clean the range text and Split to [left, right] boundaries.'''
rangetext = re.sub(r'[a-zA-Z ]+', '', rangetext)
sep = '-'
if sep in rangetext:
innersplit = rangetext.split(sep)
else:
innersplit = [rangetext,]
# The special '.' case:
if len(innersplit)==1 and '.' in innersplit[0]:
l, r = innersplit[0].split('.')
if len(l)>2 or len(r)>2:
innersplit = [l, r]
else:
innersplit = [str(float(innersplit[0])*1000), ]
return innersplit
individualinputs = [individualinput for text in number_ranges
for individualinput in outer_split(text)]
[inner_split(textrange) for textrange in individualinputs]
输出为:
[['11600.0'],
['665', '690'],
['705', '715'],
['740.54', '830.18'],
['900', '930'],
['1200'],
['2100', '2200'],
['2320', '2350'],
['2300', '2400'],
['2500', '2560'],
['2730', '2740'],
['433', '454'],
['345', '654']]