pdf 论文的自动 wget 下载 - 给定 header text/html;字符集=UTF-8
Automatic wget download of pdf paper - given header text/html; charset=UTF-8
我正在寻找一种使用 python wget 从在线图书馆下载 pdf 文件的方法。样本 url 可能 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8006280
请求收到的结果内容类型是'text/html; charset=UTF-8'。使用下载功能导致 stamp.jsp 文件包含一些 html 内容 - 有人有办法获得 pdf 吗?
谢谢大家
已收到 html:
<script type="text/javascript" src="/assets/vendor/jquery/jquery.js?cv=20191217_000002" charset="utf-8"></script>
<!-- Fingerprint Cookie -->
<script type="text/javascript" src="/assets/vendor/js-cookie/src/js.cookie.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/vendor/fingerprintjs2/fingerprint2.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/js/lib/core/fingerprint.js?cv=20191217_000002"></script>
<script type="text/javascript">Xplore.Fingerprint.init();</script>
<!-- BEGIN: tealium in stamp/stamp.jsp. NOTE stamp.jsp does not use template.jsp, nor include common/assets.jsp, so including tealiumAnalytics.jsp here -->
<!-- BEGIN: TealiumAnalytics.jsp -->
<script type ="text/javascript">
// tealium config vars
var TEALIUM_CONFIG_TAGGING_ENABLED = true;
var TEALIUM_CONFIG_CDN_URL = '//tags.tiqcdn.com/utag/';
var TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV = 'ieeexplore/main/prod';
// tealium utag_data values for user
var TEALIUM_userType = 'Anonymous';
var TEALIUM_userInstitutionId = '';
var TEALIUM_userId = '';
var TEALIUM_user_third_party = '';
var TEALIUM_products = '';
</script>
<script type="text/javascript">
// asynchronously load tealium's utag.js , which declares tealium JS variables like; utag_data, utag
(function(a,b,c,d){
a=TEALIUM_CONFIG_CDN_URL + TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV + '/utag.js';
b=document;c='script';d=b.createElement(c);d.src=a;
d.type='text/java'+c;d.async=true;
a=b.getElementsByTagName(c)[0];a.parentNode.insertBefore(d,a);
})();
</script>
<script type="text/javascript" src="/assets/js/analytics/tealiumTagsData.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/js/analytics/tealiumAnalytics.js?cv=20191217_000002"></script>
<!-- END: TealiumAnalytics.jsp -->
<!-- END: tealium in stamp/stamp.jsp -->
<html lang="en-US">
<head>
<title>IEEE Xplore Full-Text PDF: </title>
<style>
html {
margin: 0;
padding: 0;
overflow: hidden;
}
body {
margin: 0;
padding: 0;
}
iframe {
display: block;
position: fixed;
width: 100%;
height: 100%;
}
</style>
</head>
<body>
<iframe src="https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf?tp=&arnumber=8006280&isnumber=8326752&ref=" frameborder=0></iframe>
</body>
</html>
恐怕您指向的 URL 无法使用。
通过查看网络消息(例如使用 Chrome 检查)我实际上看到了正确的 .pdf
link:https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf
>>> import wget
>>> url = "https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf"
>>> response = wget.download(url=url, out="/path/to/your/directory")
100% [..........................................................................] 2817205 / 2817205>>>
我正在寻找一种使用 python wget 从在线图书馆下载 pdf 文件的方法。样本 url 可能 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8006280
请求收到的结果内容类型是'text/html; charset=UTF-8'。使用下载功能导致 stamp.jsp 文件包含一些 html 内容 - 有人有办法获得 pdf 吗?
谢谢大家
已收到 html:
<script type="text/javascript" src="/assets/vendor/jquery/jquery.js?cv=20191217_000002" charset="utf-8"></script>
<!-- Fingerprint Cookie -->
<script type="text/javascript" src="/assets/vendor/js-cookie/src/js.cookie.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/vendor/fingerprintjs2/fingerprint2.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/js/lib/core/fingerprint.js?cv=20191217_000002"></script>
<script type="text/javascript">Xplore.Fingerprint.init();</script>
<!-- BEGIN: tealium in stamp/stamp.jsp. NOTE stamp.jsp does not use template.jsp, nor include common/assets.jsp, so including tealiumAnalytics.jsp here -->
<!-- BEGIN: TealiumAnalytics.jsp -->
<script type ="text/javascript">
// tealium config vars
var TEALIUM_CONFIG_TAGGING_ENABLED = true;
var TEALIUM_CONFIG_CDN_URL = '//tags.tiqcdn.com/utag/';
var TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV = 'ieeexplore/main/prod';
// tealium utag_data values for user
var TEALIUM_userType = 'Anonymous';
var TEALIUM_userInstitutionId = '';
var TEALIUM_userId = '';
var TEALIUM_user_third_party = '';
var TEALIUM_products = '';
</script>
<script type="text/javascript">
// asynchronously load tealium's utag.js , which declares tealium JS variables like; utag_data, utag
(function(a,b,c,d){
a=TEALIUM_CONFIG_CDN_URL + TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV + '/utag.js';
b=document;c='script';d=b.createElement(c);d.src=a;
d.type='text/java'+c;d.async=true;
a=b.getElementsByTagName(c)[0];a.parentNode.insertBefore(d,a);
})();
</script>
<script type="text/javascript" src="/assets/js/analytics/tealiumTagsData.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/js/analytics/tealiumAnalytics.js?cv=20191217_000002"></script>
<!-- END: TealiumAnalytics.jsp -->
<!-- END: tealium in stamp/stamp.jsp -->
<html lang="en-US">
<head>
<title>IEEE Xplore Full-Text PDF: </title>
<style>
html {
margin: 0;
padding: 0;
overflow: hidden;
}
body {
margin: 0;
padding: 0;
}
iframe {
display: block;
position: fixed;
width: 100%;
height: 100%;
}
</style>
</head>
<body>
<iframe src="https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf?tp=&arnumber=8006280&isnumber=8326752&ref=" frameborder=0></iframe>
</body>
</html>
恐怕您指向的 URL 无法使用。
通过查看网络消息(例如使用 Chrome 检查)我实际上看到了正确的 .pdf
link:https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf
>>> import wget
>>> url = "https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf"
>>> response = wget.download(url=url, out="/path/to/your/directory")
100% [..........................................................................] 2817205 / 2817205>>>