Python 对于 re.match re.sub
Python for re.match re.sub
正在处理 csv 文件。它包含源列表(简单 ssl links)、地点、网站(非 ssl links)、Direcciones 和电子邮件。当某些数据不可用时,它根本不会出现。像这样:
httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com
然而网站 'a html tag' links 总是出现两次,后面跟着几个逗号。同样,逗号后面有时是 Direcciones,有时是来源 (https)。因此,如果进程在 EOF 时没有中断,它可能会持续 'replacing' 数小时,并创建一个包含 gbs 冗余和错放信息的输出文件。让我们选择四个条目作为示例 Reutput.csv:
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,,
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,,
> ,,Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,,
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,,
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
所以想法是删除不需要的网站 'a html tag' link 和多余的逗号,但尊重新行 /n 而不是陷入循环。像这样:
> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>"
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
这是代码的最新版本:
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
text = str(reuf.read())
for lines in text:
d = re.match('</a>".*D?',text,re.DOTALL)
if d is not None:
if not 'https' in d:
replace = re.sub(d,'</a>",Direc',lines)
h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE)
if h is not None:
if not 'Direc' in h:
replace = re.sub(h,'</a>"\nhttp',lines)
replace = str(replace)
putuf.write(replace)
现在我得到一个 Put.csv 文件,其中最后一行永远重复。为什么这个循环?我已经尝试了几种方法来处理这段代码,但遗憾的是,我仍然坚持这一点。提前致谢。
当没有匹配时,groups
将是 None
。您需要防范这种情况(或重构正则表达式,使其始终匹配某些内容)。
groups = re.search('</a>".*?Direc',lines,re.DOTALL)
if groups is not None:
if not 'https' in groups:
请注意添加的 not None
条件及其控制的后续行的缩进。
最后我自己搞定了代码。我把它贴在这里希望有人觉得它有用。无论如何,谢谢你的帮助和反对票!
import re
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
text = str(reuf.read())
d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE)
if d is not None:
for elements in d:
elements = str(elements)
if not 'https' in elements:
s = re.compile('</a>".*?Direc',re.DOTALL)
replace = re.sub(s,'</a>",Direc',text)
h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE)
if h is not None:
for elements in h:
if not 'Direc' in elements:
s = re.compile('</a>".*?https',re.DOTALL)
replace = re.sub(s,'</a>"\nhttps',text)
replace = str(replace)
putuf.write(replace)
正在处理 csv 文件。它包含源列表(简单 ssl links)、地点、网站(非 ssl links)、Direcciones 和电子邮件。当某些数据不可用时,它根本不会出现。像这样:
httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com
然而网站 'a html tag' links 总是出现两次,后面跟着几个逗号。同样,逗号后面有时是 Direcciones,有时是来源 (https)。因此,如果进程在 EOF 时没有中断,它可能会持续 'replacing' 数小时,并创建一个包含 gbs 冗余和错放信息的输出文件。让我们选择四个条目作为示例 Reutput.csv:
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,,
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,,
> ,,Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,,
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,,
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
所以想法是删除不需要的网站 'a html tag' link 和多余的逗号,但尊重新行 /n 而不是陷入循环。像这样:
> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>"
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
这是代码的最新版本:
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
text = str(reuf.read())
for lines in text:
d = re.match('</a>".*D?',text,re.DOTALL)
if d is not None:
if not 'https' in d:
replace = re.sub(d,'</a>",Direc',lines)
h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE)
if h is not None:
if not 'Direc' in h:
replace = re.sub(h,'</a>"\nhttp',lines)
replace = str(replace)
putuf.write(replace)
现在我得到一个 Put.csv 文件,其中最后一行永远重复。为什么这个循环?我已经尝试了几种方法来处理这段代码,但遗憾的是,我仍然坚持这一点。提前致谢。
当没有匹配时,groups
将是 None
。您需要防范这种情况(或重构正则表达式,使其始终匹配某些内容)。
groups = re.search('</a>".*?Direc',lines,re.DOTALL)
if groups is not None:
if not 'https' in groups:
请注意添加的 not None
条件及其控制的后续行的缩进。
最后我自己搞定了代码。我把它贴在这里希望有人觉得它有用。无论如何,谢谢你的帮助和反对票!
import re
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
text = str(reuf.read())
d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE)
if d is not None:
for elements in d:
elements = str(elements)
if not 'https' in elements:
s = re.compile('</a>".*?Direc',re.DOTALL)
replace = re.sub(s,'</a>",Direc',text)
h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE)
if h is not None:
for elements in h:
if not 'Direc' in elements:
s = re.compile('</a>".*?https',re.DOTALL)
replace = re.sub(s,'</a>"\nhttps',text)
replace = str(replace)
putuf.write(replace)