需要从网站中提取数据并使用正则表达式存储在列表中

Question

所以我有一个任务需要我从网站中提取数据以形成 'top 10 list'。我选择了 IMDB 前 250 页http://www.imdb.com/chart/top。

换句话说，我需要一些帮助来使用正则表达式来隔离电影的名称，然后将它们存储在列表中。我已经将 HTML 作为字符串存储在变量中（如果这是错误的处理方式，请告诉我）。

此外，我仅限于使用模块 urlopen、re 和 htmlparser

import HTMLParser
from urllib import urlopen
import re

site = urlopen("http://www.imdb.com/chart/top?tt0468569")
content = site.read()

print content

Answer 1

你真的不应该使用正则表达式，但你在评论中声明你必须使用正则表达式，所以这里是正则表达式：

import requests

respText = requests.get("http://www.imdb.com/chart/top").text

for title in re.findall(r'<td class="titleColumn">.+?>(.+?)<', respText, re.DOTALL):
    print(title)

在 BeautifulSoup（你不能使用）

soup = BeautifulSoup(respText, "html.parser")
for item in soup.find_all("td", {"class" : "titleColumn"}):
    print(item.find("a").text)

需要从网站中提取数据并使用正则表达式存储在列表中

Need to extract data from a website and store in list using regex

python

regex

html-parsing