使用 scrapy 从 Javascript 网站提取数据时得到空结果
Getting empty result on extracting data from Javascript website using scrapy
我正在尝试从提供如下信息的网站中提取数据:
<html>
<head>
......
<script>
window.dataLayer = window.dataLayer || [];
window.dataLayer.push({
'author': 'author name',
'editor': 'editor name',
'article_id': '954301',
'article_title': " Article Title ",
'pagination': 'page 1',
'total_page': '1',
'publish_date': '20210803',
'publish_year': '2021',
'publish_month': '08',
'publish_day': '03',
'publish_time': '20:04',
'channel': 'finance',
'sub-channel': 'macro',
'regional': '',
'type' : 'article',
'content_type' : 'article',
'topics' : 'topic1, topic2, topic3',
'page_type' : 'article_page',
'tags' : 'tag1,tag2,tag3',
'user_id' : '',
'register_date' : '',
'data_source' : 'Non AMP'
});
</script>
我使用了以下命令:
data_content = response.xpath('//script[contains(text(),"author")]/text()').re(r'"author":"(\d+)"')
data = json.loads(data_content)
author = data["author"]
结果是空的。
data_content = response.xpath('//script[contains(text(),"author")]/text()').get()
data_content = re.search(r'window.dataLayer.push\(({.+})\);', data, re.DOTALL)
data_content = data_content.group(1)
data_content = data_content.replace("'", '"')
data = json.loads(data_content)
author = data['author']
print(author)
author name
我正在尝试从提供如下信息的网站中提取数据:
<html>
<head>
......
<script>
window.dataLayer = window.dataLayer || [];
window.dataLayer.push({
'author': 'author name',
'editor': 'editor name',
'article_id': '954301',
'article_title': " Article Title ",
'pagination': 'page 1',
'total_page': '1',
'publish_date': '20210803',
'publish_year': '2021',
'publish_month': '08',
'publish_day': '03',
'publish_time': '20:04',
'channel': 'finance',
'sub-channel': 'macro',
'regional': '',
'type' : 'article',
'content_type' : 'article',
'topics' : 'topic1, topic2, topic3',
'page_type' : 'article_page',
'tags' : 'tag1,tag2,tag3',
'user_id' : '',
'register_date' : '',
'data_source' : 'Non AMP'
});
</script>
我使用了以下命令:
data_content = response.xpath('//script[contains(text(),"author")]/text()').re(r'"author":"(\d+)"')
data = json.loads(data_content)
author = data["author"]
结果是空的。
data_content = response.xpath('//script[contains(text(),"author")]/text()').get()
data_content = re.search(r'window.dataLayer.push\(({.+})\);', data, re.DOTALL)
data_content = data_content.group(1)
data_content = data_content.replace("'", '"')
data = json.loads(data_content)
author = data['author']
print(author)
author name