使用 python 从文件中提取数据
Extract Data from file using python
输入文件:
["abc","on time","date","<a href='link'>11111</a>","time","2","2"],
["abc","on time","date","<a href='link'>11111</a>","time","2","6"],
["abc","on time","date","<a href='link'>11111</a>","time","2","9"],
["abc","on time","date","<a href='link'>11111</a>","time","2","0"],
["abc","on time","date","<a href='link'>11111</a>","time","2","5"]
需要输出:
abc,on time,date,<a href='link'>11111</a>,time,2,2
abc,on time,date,<a href='link'>11111</a>,time,2,6
abc,on time,date,<a href='link'>11111</a>,time,2,9
abc,on time,date,<a href='link'>11111</a>,time,2,0
abc,on time,date,<a href='link'>11111</a>,time,2,5
尝试过的代码:
import sys
import re
Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]
for EachLine in Lines:
Parts = EachLine.split(",")
for EachPart in Parts:
EachPart = re.sub(r'[', '', EachPart)
EachPart = re.sub(r']', '', EachPart)
print ' '.join(Parts)
谁能帮我解决这个问题??我没有得到我想要的。提前致谢。
我将您的初始解决方案修改为
import sys
import re
Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]
for EachLine in Lines:
matches = re.findall(r'\"(.+?)\"',EachLine)
print ','.join(matches)
我的方法是使用正则表达式来获取双引号中的所有字符串。
如前所述,您可以使用 eval()
。
with open('a.txt') as f:
for line in f:
line = line.replace(',\n', '\n').strip() # remove if there is `,` at the end
if line: # to tackle with empty lines
print(','.join(eval(line.strip())))
输入:
["abc","on time","date","<a href='link'>11111</a>","time","2","2"],
["abc","on time","date","<a href='link'>11111</a>","time","2","6"],
["abc","on time","date","<a href='link'>11111</a>","time","2","9"],
["abc","on time","date","<a href='link'>11111</a>","time","2","0"],
["abc","on time","date","<a href='link'>11111</a>","time","2","5"]
输出:
abc,on time,date,<a href='link'>11111</a>,time,2,2
abc,on time,date,<a href='link'>11111</a>,time,2,6
abc,on time,date,<a href='link'>11111</a>,time,2,9
abc,on time,date,<a href='link'>11111</a>,time,2,0
abc,on time,date,<a href='link'>11111</a>,time,2,5
另一个不使用正则表达式的选项是:
for line in lines:
formatted = ','.join(line).replace('"', '')
print(formatted)
输入文件:
["abc","on time","date","<a href='link'>11111</a>","time","2","2"],
["abc","on time","date","<a href='link'>11111</a>","time","2","6"],
["abc","on time","date","<a href='link'>11111</a>","time","2","9"],
["abc","on time","date","<a href='link'>11111</a>","time","2","0"],
["abc","on time","date","<a href='link'>11111</a>","time","2","5"]
需要输出:
abc,on time,date,<a href='link'>11111</a>,time,2,2
abc,on time,date,<a href='link'>11111</a>,time,2,6
abc,on time,date,<a href='link'>11111</a>,time,2,9
abc,on time,date,<a href='link'>11111</a>,time,2,0
abc,on time,date,<a href='link'>11111</a>,time,2,5
尝试过的代码:
import sys
import re
Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]
for EachLine in Lines:
Parts = EachLine.split(",")
for EachPart in Parts:
EachPart = re.sub(r'[', '', EachPart)
EachPart = re.sub(r']', '', EachPart)
print ' '.join(Parts)
谁能帮我解决这个问题??我没有得到我想要的。提前致谢。
我将您的初始解决方案修改为
import sys
import re
Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]
for EachLine in Lines:
matches = re.findall(r'\"(.+?)\"',EachLine)
print ','.join(matches)
我的方法是使用正则表达式来获取双引号中的所有字符串。
如前所述,您可以使用 eval()
。
with open('a.txt') as f:
for line in f:
line = line.replace(',\n', '\n').strip() # remove if there is `,` at the end
if line: # to tackle with empty lines
print(','.join(eval(line.strip())))
输入:
["abc","on time","date","<a href='link'>11111</a>","time","2","2"],
["abc","on time","date","<a href='link'>11111</a>","time","2","6"],
["abc","on time","date","<a href='link'>11111</a>","time","2","9"],
["abc","on time","date","<a href='link'>11111</a>","time","2","0"],
["abc","on time","date","<a href='link'>11111</a>","time","2","5"]
输出:
abc,on time,date,<a href='link'>11111</a>,time,2,2
abc,on time,date,<a href='link'>11111</a>,time,2,6
abc,on time,date,<a href='link'>11111</a>,time,2,9
abc,on time,date,<a href='link'>11111</a>,time,2,0
abc,on time,date,<a href='link'>11111</a>,time,2,5
另一个不使用正则表达式的选项是:
for line in lines:
formatted = ','.join(line).replace('"', '')
print(formatted)