在导入带有额外逗号的 pandas 的 csv 文件时,如何使用正则表达式作为分隔符?
How can I use regex as a delimiter when importing a csv file with pandas with extra commas?
csv 文件已发送给我/我无法重新分隔列
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
我替换了字符串字母以覆盖敏感信息,但问题很明显。
这是我的 csv 中的示例 "problem row"。应按如下方式分为8列:
col1: 239845723
col2: 28374
col3: 2384234
col4: AEVNE EFU 5 GN OR WNV
col5: Owinv Vnwo Badvw 5 VIN
col6: Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).
col7: 2011-07-13 00:00:00
col8: 2011-07-13 00:00:00
如您所见,第 6 列是出现问题的地方,因为字符串中的逗号导致 pandas 错误地分隔和创建新列。我怎么解决这个问题?我在想正则表达式会有所帮助,也许使用以下设置。感谢您的帮助!
csvfile = open(filetrace)
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
#not sure
因此,在不知道文件或数据的具体细节的情况下,我可以提供一个正则表达式解决方案,如果数据一致(并且句点在第 6 列)。我们可以不使用 csv 模块而只使用 regex 模块。
import re
# make the regex pattern here
pattern = r"([\d\.]*),([\d\.]*),([\d\.]*),([^,]*),([^,]*),(.*\.?),([\d\-\s:]*),([\d\-\s:]*)"
# open the file with 'with' so you don't have to worry about closing it
with open(filetrace) as f:
for line in f: # iterate through the lines
values = re.findall(pattern, line)[0] # re.findall returns a list
# literal of a tuple
# record the values somewhere
values
这是一个 8 元组,其中包含您在原始 csv 中的每一列的值,只是 use/store 您想要的值。
祝你好运!
由于您确切地知道需要多少列并且只有一列有问题,我们可以将前几列从左侧拆分出来,其余的从右侧拆分。换句话说,你不需要 regex
将文件读入单个字符串
csvfile = open(filetrace).read()
制作pd.Series
s = pd.Series(csvfile.split('\n'))
拆分s
并将其限制为5个拆分,应该是6列
df = s.str.split(',', 5, expand=True)
现在拆分右侧限制为 2 个拆分
df = df.iloc[:, :-1].join(df.iloc[-1].str.rsplit(',', 2, expand=True))
从s
开始的另一种方式
left = s.str.split(',', 5)
right = left.str[-1].str.rsplit(',', 2)
df = pd.DataFrame(left.str[:-1].add(right).tolist())
我 运行 这个并采取了 t运行 姿势以便在屏幕上更容易阅读
df.T
0
0 239845723
1 28374
2 2384234
3 AEVNE EFU 5 GN OR WNV
4 Owinv Vnwo Badvw 5 VIN
5 Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd b...
6 2011-07-13 00:00:00
7 2011-07-13 00:00:00
我不使用正则表达式,而是读取带有分隔符“,”的 csv,您可以提取最后两个日期并将其存储在列表中。然后用 ''
填充日期,然后加入您想要的列并删除其余列。范例
如果您有 csv 文件:
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).,2011-07-13 00:00:00,2011-07-13 00:00:00
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
然后
df = pd.read_csv('good.txt',delimiter=',',header=None)
# Get the Dates from all the DataFrame
dates = [[item] for i in df.values for item in i if '2011-' in str(item)]
# Merge two Dates for each column
dates = pd.DataFrame([x+y for x,y in zip(dates[0::2], dates[1::2])])
# Remove the dates present
df = df.replace({'2011-': np.nan}, regex=True).replace(np.nan,'')
#Get the columns you want to merge
cols = df.columns[4:]
# Merge the columns
df[4] = df[cols].astype(str).apply(lambda x: ','.join(x), axis=1)
df[4] = df[4].replace('\,+$', '',regex=True)
#Drop the Columns
df = df.drop(df.columns[5:],axis=1)
#Concat the dates
df = pd.concat([df,dates],axis=1)
输出:打印(df)
0 1 2 3 \
0 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV
1 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV
2 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV
4 0 \
0 Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ... 2011-07-13 00:00:00
1 Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ... 2011-07-13 00:00:00
2 Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse... 2011-07-13 00:00:00
1
0 2011-07-13 00:00:00
1 2011-07-13 00:00:00
2 2011-07-13 00:00:00
第 4 列的输出:
['Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).',
'Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).',
'Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).']
如果要更改列索引
df.columns = [i for i in range(df.shape[1])]
希望对你有帮助
csv 文件已发送给我/我无法重新分隔列
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
我替换了字符串字母以覆盖敏感信息,但问题很明显。
这是我的 csv 中的示例 "problem row"。应按如下方式分为8列:
col1: 239845723
col2: 28374
col3: 2384234
col4: AEVNE EFU 5 GN OR WNV
col5: Owinv Vnwo Badvw 5 VIN
col6: Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).
col7: 2011-07-13 00:00:00
col8: 2011-07-13 00:00:00
如您所见,第 6 列是出现问题的地方,因为字符串中的逗号导致 pandas 错误地分隔和创建新列。我怎么解决这个问题?我在想正则表达式会有所帮助,也许使用以下设置。感谢您的帮助!
csvfile = open(filetrace)
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
#not sure
因此,在不知道文件或数据的具体细节的情况下,我可以提供一个正则表达式解决方案,如果数据一致(并且句点在第 6 列)。我们可以不使用 csv 模块而只使用 regex 模块。
import re
# make the regex pattern here
pattern = r"([\d\.]*),([\d\.]*),([\d\.]*),([^,]*),([^,]*),(.*\.?),([\d\-\s:]*),([\d\-\s:]*)"
# open the file with 'with' so you don't have to worry about closing it
with open(filetrace) as f:
for line in f: # iterate through the lines
values = re.findall(pattern, line)[0] # re.findall returns a list
# literal of a tuple
# record the values somewhere
values
这是一个 8 元组,其中包含您在原始 csv 中的每一列的值,只是 use/store 您想要的值。
祝你好运!
由于您确切地知道需要多少列并且只有一列有问题,我们可以将前几列从左侧拆分出来,其余的从右侧拆分。换句话说,你不需要 regex
将文件读入单个字符串
csvfile = open(filetrace).read()
制作pd.Series
s = pd.Series(csvfile.split('\n'))
拆分s
并将其限制为5个拆分,应该是6列
df = s.str.split(',', 5, expand=True)
现在拆分右侧限制为 2 个拆分
df = df.iloc[:, :-1].join(df.iloc[-1].str.rsplit(',', 2, expand=True))
从s
left = s.str.split(',', 5)
right = left.str[-1].str.rsplit(',', 2)
df = pd.DataFrame(left.str[:-1].add(right).tolist())
我 运行 这个并采取了 t运行 姿势以便在屏幕上更容易阅读
df.T
0
0 239845723
1 28374
2 2384234
3 AEVNE EFU 5 GN OR WNV
4 Owinv Vnwo Badvw 5 VIN
5 Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd b...
6 2011-07-13 00:00:00
7 2011-07-13 00:00:00
我不使用正则表达式,而是读取带有分隔符“,”的 csv,您可以提取最后两个日期并将其存储在列表中。然后用 ''
填充日期,然后加入您想要的列并删除其余列。范例
如果您有 csv 文件:
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00 239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).,2011-07-13 00:00:00,2011-07-13 00:00:00 239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
然后
df = pd.read_csv('good.txt',delimiter=',',header=None)
# Get the Dates from all the DataFrame
dates = [[item] for i in df.values for item in i if '2011-' in str(item)]
# Merge two Dates for each column
dates = pd.DataFrame([x+y for x,y in zip(dates[0::2], dates[1::2])])
# Remove the dates present
df = df.replace({'2011-': np.nan}, regex=True).replace(np.nan,'')
#Get the columns you want to merge
cols = df.columns[4:]
# Merge the columns
df[4] = df[cols].astype(str).apply(lambda x: ','.join(x), axis=1)
df[4] = df[4].replace('\,+$', '',regex=True)
#Drop the Columns
df = df.drop(df.columns[5:],axis=1)
#Concat the dates
df = pd.concat([df,dates],axis=1)
输出:打印(df)
0 1 2 3 \ 0 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV 1 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV 2 239845723 28374 2384234 AEVNE EFU 5 GN OR WNV 4 0 \ 0 Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ... 2011-07-13 00:00:00 1 Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ... 2011-07-13 00:00:00 2 Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse... 2011-07-13 00:00:00 1 0 2011-07-13 00:00:00 1 2011-07-13 00:00:00 2 2011-07-13 00:00:00
第 4 列的输出:
['Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).', 'Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).', 'Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).']
如果要更改列索引
df.columns = [i for i in range(df.shape[1])]
希望对你有帮助