如果匹配多个字符串，则删除列表中的索引

Question

我抓取了一个包含 table 的网站，我想为我想要的最终结果格式化 headers。

headers = []

for row in table.findAll('tr'):
    for item in row.findAll('th'):
        for link in item.findAll('a', text=True):
            headers.append(link.contents[0])

print headers

哪个returns:

[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n            2016\r\n        ']

我不想要“点击这里阅读更多”或“2016”headers所以我做了以下事情：

for idx, i in enumerate(headers):
    if 'Click' in i:
        del headers[idx]
for idx, i in enumerate(headers):
    if '2016' in i:
        del headers[idx]

哪个returns:

[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']

完美。但是有没有 better/neater 方法可以删除不需要的项目？谢谢！

Answer 1

您可以考虑使用列表理解来获取新的过滤列表，例如：

new_headers = [header for header in headers if '2016' not in header]

Answer 2

headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)

如果你想更通用：

banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)

Answer 3

pattern = '^Click|^2016'

new = [x for x in header if not re.match(pattern,str(x).strip())]

Answer 4

如果你能确定'2016'永远在最后：

>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']

如果匹配多个字符串，则删除列表中的索引

Delete index in list if multiple strings are matched

python

list

beautifulsoup