Beautifulsoup 4 个跨度包含 '@' return 奇怪的结果
Beautifulsoup 4 spans containg '@' return strange results
我能够使用以下方法获得所需的跨度列表:
attrs = soup.find_all("span")
这个 returns 作为键和值的跨度列表:
[
<span>back camera resolution</span>,
<span class="even">12 MP</span>
]
[
<span>front camera resolution</span>,
<span class="even">16 MP</span>
]
[
<span>video resolution</span>,
<span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
</span>
]
原HTML为:
为什么 'video resolution' 会这样转换?
该站点正在使用 CloudFlare email protection feature,它似乎已将 所有字符串 替换为 @
,并使用混淆(XOR 加密)值来防止收集电子邮件地址的抓取工具。每个替换都包含 JavaScript 代码来对其进行解码。
BeautifulSoup 不会执行 Javascript,但您的浏览器 已 执行它并用生成的解密结果替换 <a class="__cf_email__">
标签数据。
你可以用一个小的 Python 3 函数做同样的事情; JavaScript 代码所做的只是 'decrypt'(十六进制编码)值,通过使用第一个字节作为简单 XOR 解密例程中的密钥:
def decode(cfemail):
enc = bytes.fromhex(cfemail)
return bytes([c ^ enc[0] for c in enc[1:]]).decode('utf8')
def deobfuscate_cf_email(soup):
for encrypted_email in soup.select('a.__cf_email__'):
decrypted = decode(encrypted_email['data-cfemail'])
# remove the <script> tag from the tree
script_tag = encrypted_email.find_next_sibling('script')
script_tag.decompose()
# replace the <a class="__cf_email__"> tag with the decoded result
encrypted_email.replace_with(decrypted)
要在 Python 2 中执行上述操作,请将 bytes
替换为 bytearray
。
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <span>video resolution</span>,
... <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
... </span>
... ''')
>>> deobfuscate_cf_email(soup)
>>> soup
<html><body><span>video resolution</span>,
<span class="even">2160p@30fps - 1080p@30fps - 720@120fps
</span>
</body></html>
我能够使用以下方法获得所需的跨度列表:
attrs = soup.find_all("span")
这个 returns 作为键和值的跨度列表:
[
<span>back camera resolution</span>,
<span class="even">12 MP</span>
]
[
<span>front camera resolution</span>,
<span class="even">16 MP</span>
]
[
<span>video resolution</span>,
<span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
</span>
]
原HTML为:
为什么 'video resolution' 会这样转换?
该站点正在使用 CloudFlare email protection feature,它似乎已将 所有字符串 替换为 @
,并使用混淆(XOR 加密)值来防止收集电子邮件地址的抓取工具。每个替换都包含 JavaScript 代码来对其进行解码。
BeautifulSoup 不会执行 Javascript,但您的浏览器 已 执行它并用生成的解密结果替换 <a class="__cf_email__">
标签数据。
你可以用一个小的 Python 3 函数做同样的事情; JavaScript 代码所做的只是 'decrypt'(十六进制编码)值,通过使用第一个字节作为简单 XOR 解密例程中的密钥:
def decode(cfemail):
enc = bytes.fromhex(cfemail)
return bytes([c ^ enc[0] for c in enc[1:]]).decode('utf8')
def deobfuscate_cf_email(soup):
for encrypted_email in soup.select('a.__cf_email__'):
decrypted = decode(encrypted_email['data-cfemail'])
# remove the <script> tag from the tree
script_tag = encrypted_email.find_next_sibling('script')
script_tag.decompose()
# replace the <a class="__cf_email__"> tag with the decoded result
encrypted_email.replace_with(decrypted)
要在 Python 2 中执行上述操作,请将 bytes
替换为 bytearray
。
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <span>video resolution</span>,
... <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
... </span>
... ''')
>>> deobfuscate_cf_email(soup)
>>> soup
<html><body><span>video resolution</span>,
<span class="even">2160p@30fps - 1080p@30fps - 720@120fps
</span>
</body></html>