显示警报时从 window 警报中抓取警报文本

Question

我正在使用 python 请求库和 BeautifulSoup。当请求无效时，有一个 URL 弹出 returns HTML 和 alert()。 Beautifulsoup 中的问题是我无法获得 window.alert 弹出文本。

我试过使用 this answer 中的正则表达式方法，但它似乎不起作用。

因此在做的时候：

for script in soup.find_all("script"):
    alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text)

脚本永远不会得到执行的脚本。

这是我正在提取的脚本：

<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>

我期待收到 User ID 的警报文本。我注意到如果我有标签，我无法在下面抓取脚本如果我删除脚本或将脚本移动到 body 标签中，那么我可以获得

<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>

Answer 1

使用html5lib解析器库解决如果您阅读文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc/，它会像 Web 浏览器一样解析页面这样就可以获取 body 标签外的脚本

soup = BeautifulSoup(payload, 'html5lib')
        errors = None
        for scr in soup.find_all("script"):
            scrExtract = scr.extract()
            alert = re.findall('err="(.*\w)', scrExtract.text)
            if len(alert) > 0:
                errors = alert[0]

        print(errors)

Answer 2

当运行 BeautifulSoup 对您的数据进行 diagnose() 时，我获得以下信息：

data = '''
<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>'''

from bs4.diagnose import diagnose

diagnose(data)

打印：

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
<html>
 <body>
 </body>
</html>
<script>
 var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
</script>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
  <script>
   var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
--------------------------------------------------------------------------------

由此我可以看出，lxml 解析器不会解析最后一个 <script>，因此您永远无法通过 BeautifulSoup 到达它。解决方案是不同的解析器，例如html.parser:

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')


for script in soup.select('script:contains(alert)'):
    alert = re.findall(r'(?<=alert\().+(?=\))', script.text)
    print(alert)

打印：

['err']

显示警报时从 window 警报中抓取警报文本

Scrape alert text from window alert when alert is shown

python

screen-scraping

beautifulsoup

web-scraping