使用 python 从文件中提取数据

Extract Data from file using python

输入文件:

["abc","on time","date","<a href='link'>11111</a>","time","2","2"],

["abc","on time","date","<a href='link'>11111</a>","time","2","6"],

["abc","on time","date","<a href='link'>11111</a>","time","2","9"],

["abc","on time","date","<a href='link'>11111</a>","time","2","0"],

["abc","on time","date","<a href='link'>11111</a>","time","2","5"]

需要输出:

abc,on time,date,<a href='link'>11111</a>,time,2,2

abc,on time,date,<a href='link'>11111</a>,time,2,6

abc,on time,date,<a href='link'>11111</a>,time,2,9

abc,on time,date,<a href='link'>11111</a>,time,2,0

abc,on time,date,<a href='link'>11111</a>,time,2,5

尝试过的代码:

import sys
import re

Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]



for EachLine in Lines:
    Parts = EachLine.split(",")
    for EachPart in Parts:

        EachPart = re.sub(r'[', '', EachPart)
        EachPart = re.sub(r']', '', EachPart)
print ' '.join(Parts)

谁能帮我解决这个问题??我没有得到我想要的。提前致谢。

我将您的初始解决方案修改为

import sys
import re

Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]

for EachLine in Lines:
    matches = re.findall(r'\"(.+?)\"',EachLine)
    print ','.join(matches)

我的方法是使用正则表达式来获取双引号中的所有字符串。

如前所述,您可以使用 eval()

with open('a.txt') as f:
    for line in f:
        line = line.replace(',\n', '\n').strip() # remove if there is `,` at the end
        if line:                                 # to tackle with empty lines
            print(','.join(eval(line.strip())))

输入:

["abc","on time","date","<a href='link'>11111</a>","time","2","2"],

["abc","on time","date","<a href='link'>11111</a>","time","2","6"],

["abc","on time","date","<a href='link'>11111</a>","time","2","9"],

["abc","on time","date","<a href='link'>11111</a>","time","2","0"],

["abc","on time","date","<a href='link'>11111</a>","time","2","5"]

输出:

abc,on time,date,<a href='link'>11111</a>,time,2,2
abc,on time,date,<a href='link'>11111</a>,time,2,6
abc,on time,date,<a href='link'>11111</a>,time,2,9
abc,on time,date,<a href='link'>11111</a>,time,2,0
abc,on time,date,<a href='link'>11111</a>,time,2,5

另一个不使用正则表达式的选项是:

for line in lines:
  formatted = ','.join(line).replace('"', '')
  print(formatted)