pdf 论文的自动 wget 下载 - 给定 header text/html;字符集=UTF-8

Automatic wget download of pdf paper - given header text/html; charset=UTF-8

我正在寻找一种使用 python wget 从在线图书馆下载 pdf 文件的方法。样本 url 可能 https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8006280

请求收到的结果内容类型是'text/html; charset=UTF-8'。使用下载功能导致 stamp.jsp 文件包含一些 html 内容 - 有人有办法获得 pdf 吗?

谢谢大家

已收到 html:

<script type="text/javascript" src="/assets/vendor/jquery/jquery.js?cv=20191217_000002" charset="utf-8"></script>

<!-- Fingerprint Cookie -->
<script type="text/javascript" src="/assets/vendor/js-cookie/src/js.cookie.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/vendor/fingerprintjs2/fingerprint2.js?cv=20191217_000002"></script>
<script type="text/javascript" src="/assets/js/lib/core/fingerprint.js?cv=20191217_000002"></script>
<script type="text/javascript">Xplore.Fingerprint.init();</script>

<!-- BEGIN: tealium in stamp/stamp.jsp. NOTE stamp.jsp does not use template.jsp, nor include common/assets.jsp, so including tealiumAnalytics.jsp here -->








  <!-- BEGIN: TealiumAnalytics.jsp -->
  
  
  
  
  
  
  
  
  
  
  
  
   
    
   
   
   
   
  
  
  
  
  
  

   <script type ="text/javascript">
     // tealium config vars
    var TEALIUM_CONFIG_TAGGING_ENABLED = true;  
    var TEALIUM_CONFIG_CDN_URL = '//tags.tiqcdn.com/utag/';
    var TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV = 'ieeexplore/main/prod';
    
    // tealium utag_data values for user 
    var TEALIUM_userType = 'Anonymous';
    var TEALIUM_userInstitutionId = '';
    var TEALIUM_userId = '';
    var TEALIUM_user_third_party = '';
    
    var TEALIUM_products = '';
   </script>


   <script type="text/javascript">
   // asynchronously load tealium's utag.js , which declares tealium JS variables like; utag_data, utag
   (function(a,b,c,d){
   
    a=TEALIUM_CONFIG_CDN_URL + TEALIUM_CONFIG_ACCOUNT_PROFILE_ENV + '/utag.js';
    b=document;c='script';d=b.createElement(c);d.src=a;
    d.type='text/java'+c;d.async=true;
    a=b.getElementsByTagName(c)[0];a.parentNode.insertBefore(d,a);
   })();
   </script>

   <script type="text/javascript" src="/assets/js/analytics/tealiumTagsData.js?cv=20191217_000002"></script>
   <script type="text/javascript" src="/assets/js/analytics/tealiumAnalytics.js?cv=20191217_000002"></script>


  
   
  <!-- END: TealiumAnalytics.jsp -->
    

<!-- END: tealium in stamp/stamp.jsp -->
  



<html lang="en-US">
 <head> 
  <title>IEEE Xplore Full-Text PDF: </title>
  <style>
   html {
       margin: 0;
       padding: 0;
       overflow: hidden;
   }
   body {
       margin: 0;
       padding: 0;
   }
   iframe {
    display: block;
    position: fixed;
    width: 100%;
    height: 100%;
   }
  </style>
 </head>
 <body>
  <iframe src="https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf?tp=&arnumber=8006280&isnumber=8326752&ref=" frameborder=0></iframe>
 </body>
</html>

恐怕您指向的 URL 无法使用。

通过查看网络消息(例如使用 Chrome 检查)我实际上看到了正确的 .pdf link:https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf

>>> import wget
>>> url = "https://ieeexplore.ieee.org/ielx7/6979/8326752/08006280.pdf"
>>> response = wget.download(url=url, out="/path/to/your/directory")
100% [..........................................................................] 2817205 / 2817205>>>