ubuntu 中未提取人偶脚本的内容
Content from puppeteer script not being fetched in ubuntu
场景
- 我有一个木偶脚本,它接收一个 url 和一个 json 对象作为执行参数。
- 从 html 文件的 php 脚本中调用它
- puppeteer 脚本转到 url,获取页面内容,console.log s 它,因此内容在包含上述 php 脚本的 html 中可用.
问题
- 在 windows 上它是 运行 完美的,给出了所需的输出。然而,我现在已经将我的项目转移到 ubuntu,这就是麻烦开始的地方。
- 内容未加载,以空白页结束。
- 我 运行 来自控制台的 puppeteer 脚本,它运行良好,注销页面内容。但是当我使用 system() 从 php 脚本调用它时,它没有。但即使在 php 脚本中,它也能在 windows 上完美运行。
这是我的 php 代码
<?php
if(isset($_POST['url'])){
$split_url = str_replace('.', '_', explode('/', $_POST['url']));
$dir = "site_config/";
$content = '';
if( is_dir($dir) ){
if ($dh = opendir($dir)){
while (($file = readdir($dh)) !== false){
$filename = str_replace('.txt','',$file);
if($filename === $split_url[2]){
$content = file_get_contents($dir.$split_url[2].'.txt');
}
}
closedir($dh);
}
}
echo '<script>document.getElementById("website-input-form").style.display = "none";</script>';
/*to run with phantomjs*/
//system('phantomjs get_page_phantomjs.js "'.$_REQUEST['url'].'" '.$content);
/*to run with puppeteer*/
system('node get_page_puppeteer.js "'.$_REQUEST['url'].'" '.$content);
// system('node sample.js "'.$_REQUEST['url'].'" '.$content);
}
?>
你可以在最后一行看到我是 运行 另一个示例 nodejs 脚本。它执行得很完美。
所以我不知道,也许人偶脚本有问题?
const puppeteer = require('puppeteer');
const url = process.argv[2]; //url from command line argument
const json = process.argv[3]; //config content from command line argument
/*_________________________STEP 1____________________________________*/
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: 3000000
});
await page.evaluate(function(json){
//removing unwanted elements from html content
Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
return script.type != "application/ld+json";
}).forEach(function(script) {
script.parentNode.removeChild(script);
});
Array.prototype.slice.call(document.getElementsByTagName("style")).filter(function(style) {
return style.type != "application/ld+json";
}).forEach(function(style) {
style.parentNode.removeChild(style);
});
Array.prototype.slice.call(document.getElementsByTagName("iframe")).filter(function(iframe) {
return iframe.type != "application/ld+json";
}).forEach(function(iframe) {
iframe.parentNode.removeChild(iframe);
});
Array.prototype.slice.call(document.getElementsByTagName("video")).filter(function(video) {
return video.type != "application/ld+json";
}).forEach(function(video) {
video.parentNode.removeChild(video);
});
Array.prototype.slice.call(document.getElementsByTagName("img")).filter(function(img) {
img.setAttribute('style','max-width: 50% !important;');
return img.src.endsWith('.svg') === true;
}).forEach(function(img) {
img.parentNode.removeChild(img);
});
//providing the site's config through an element
var inp = document.createElement('div');
inp.setAttribute('textcontent', json);
inp.setAttribute('id', 'config_available');
var XMLS = new XMLSerializer();
var inp_xmls = XMLS.serializeToString(inp);
document.body.insertAdjacentHTML('afterbegin', inp_xmls);
//injecting the logic script
inp = document.createElement('script');
inp.setAttribute('src', './scraperJavascript.js');
inp.setAttribute('type', 'text/javascript');
XMLS = new XMLSerializer();
inp_xmls = XMLS.serializeToString(inp);
document.body.insertAdjacentHTML('afterbegin', inp_xmls);
}, json)
//rendering page's html
const renderedContent = await page.evaluate(() => new XMLSerializer().serializeToString(document));
console.log(renderedContent);
await browser.close();
}
run();
但是如果脚本有问题,为什么 运行 从控制台(在 ubuntu 和 windows 上)和 php 脚本(在 windows) 但不是来自 php 脚本(在 ubuntu 上)
更新
我 运行 在木偶操纵端进行异常检查。确实发生异常,这是它的消息
Error: Failed to launch chrome! [0608/095818.625603:ERROR:icu_util.cc(133)] Invalid file descriptor to ICU data received. [0608/095818.625662:FATAL:content_main_delegate.cc(57)] Check failed: false. #0 0x55dc5336182c base::debug::StackTrace::StackTrace() #1 0x55dc532e8290 logging::LogMessage::~LogMessage() #2 0x55dc51598de3 content::ContentMainDelegate::TerminateForFatalInitializationError() #3 0x55dc53017941 content::ContentMainRunnerImpl::Initialize() #4 0x55dc53021c12 service_manager::Main() #5 0x55dc53016184 content::ContentMain() #6 0x55dc571eea39 headless::(anonymous namespace)::RunContentMain() #7 0x55dc571eeac2 headless::HeadlessBrowserMain() #8 0x55dc5301ef8f headless::HeadlessShellMain() #9 0x55dc515971ac ChromeMain #10 0x7f5204329830 __libc_start_main #11 0x55dc5159702a _start TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md at onClose (/var/www/html/master/scraper_puppeteer/node_modules/puppeteer/lib/Launcher.js:255:14) at Interface.helper.addEventListener (/var/www/html/master/scraper_puppeteer/node_modules/puppeteer/lib/Launcher.js:244:50) at emitNone (events.js:111:20) at Interface.emit (events.js:208:7) at Interface.close (readline.js:370:8) at Socket.onend (readline.js:149:10) at emitNone (events.js:111:20) at Socket.emit (events.js:208:7) at endReadableNT (_stream_readable.js:1055:12) at _combinedTickCallback (internal/process/next_tick.js:138:11)
可能是您的服务器不允许使用系统功能。查询方式:
<?php echo ini_get('disable_functions');
这是一个可能的结果:
exec,passthru,shell_exec,system,proc_open,popen,curl_multi_exec,parse_ini_file,show_source
另一个可能的原因:如果启用安全模式,您只能在该模式的特殊目录中执行文件:
<?php
if(ini_get("safe_mode")) {
echo "Can only execute files inside of this dir: " . ini_get("safe_mode_exec_dir");
}
如果以上检查成功,则可能是权限有问题。修改您的 system
调用以获取退出代码,它是非零的,存在问题:
system('node get_page_puppeteer.js "'.$_REQUEST['url'].'" '.$content, $return_code);
var_dump($return_code);
0 — no error
126 — command is found but is not executable source
127 — system doesn't know a file you're calling source
更新
来自你的人偶错误
Failed to launch chrome! [...] Invalid file descriptor to ICU data received
根据 this issue 的说法,可能存在权限问题。该用户的解决方案是
cd /usr/local/lib/node_modules/puppeteer/.local-chromium
find . -type d | xargs -L1 -Ixx sudo chmod 755 xx
find . -type f -perm /u+x | xargs -L1 -Ixx sudo chmod 755 xx
find . -type f -not -perm /u+x | xargs -L1 -Ixx sudo chmod 644 xx
场景 - 我有一个木偶脚本,它接收一个 url 和一个 json 对象作为执行参数。 - 从 html 文件的 php 脚本中调用它 - puppeteer 脚本转到 url,获取页面内容,console.log s 它,因此内容在包含上述 php 脚本的 html 中可用.
问题 - 在 windows 上它是 运行 完美的,给出了所需的输出。然而,我现在已经将我的项目转移到 ubuntu,这就是麻烦开始的地方。 - 内容未加载,以空白页结束。 - 我 运行 来自控制台的 puppeteer 脚本,它运行良好,注销页面内容。但是当我使用 system() 从 php 脚本调用它时,它没有。但即使在 php 脚本中,它也能在 windows 上完美运行。
这是我的 php 代码
<?php
if(isset($_POST['url'])){
$split_url = str_replace('.', '_', explode('/', $_POST['url']));
$dir = "site_config/";
$content = '';
if( is_dir($dir) ){
if ($dh = opendir($dir)){
while (($file = readdir($dh)) !== false){
$filename = str_replace('.txt','',$file);
if($filename === $split_url[2]){
$content = file_get_contents($dir.$split_url[2].'.txt');
}
}
closedir($dh);
}
}
echo '<script>document.getElementById("website-input-form").style.display = "none";</script>';
/*to run with phantomjs*/
//system('phantomjs get_page_phantomjs.js "'.$_REQUEST['url'].'" '.$content);
/*to run with puppeteer*/
system('node get_page_puppeteer.js "'.$_REQUEST['url'].'" '.$content);
// system('node sample.js "'.$_REQUEST['url'].'" '.$content);
}
?>
你可以在最后一行看到我是 运行 另一个示例 nodejs 脚本。它执行得很完美。
所以我不知道,也许人偶脚本有问题?
const puppeteer = require('puppeteer');
const url = process.argv[2]; //url from command line argument
const json = process.argv[3]; //config content from command line argument
/*_________________________STEP 1____________________________________*/
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: 3000000
});
await page.evaluate(function(json){
//removing unwanted elements from html content
Array.prototype.slice.call(document.getElementsByTagName("script")).filter(function(script) {
return script.type != "application/ld+json";
}).forEach(function(script) {
script.parentNode.removeChild(script);
});
Array.prototype.slice.call(document.getElementsByTagName("style")).filter(function(style) {
return style.type != "application/ld+json";
}).forEach(function(style) {
style.parentNode.removeChild(style);
});
Array.prototype.slice.call(document.getElementsByTagName("iframe")).filter(function(iframe) {
return iframe.type != "application/ld+json";
}).forEach(function(iframe) {
iframe.parentNode.removeChild(iframe);
});
Array.prototype.slice.call(document.getElementsByTagName("video")).filter(function(video) {
return video.type != "application/ld+json";
}).forEach(function(video) {
video.parentNode.removeChild(video);
});
Array.prototype.slice.call(document.getElementsByTagName("img")).filter(function(img) {
img.setAttribute('style','max-width: 50% !important;');
return img.src.endsWith('.svg') === true;
}).forEach(function(img) {
img.parentNode.removeChild(img);
});
//providing the site's config through an element
var inp = document.createElement('div');
inp.setAttribute('textcontent', json);
inp.setAttribute('id', 'config_available');
var XMLS = new XMLSerializer();
var inp_xmls = XMLS.serializeToString(inp);
document.body.insertAdjacentHTML('afterbegin', inp_xmls);
//injecting the logic script
inp = document.createElement('script');
inp.setAttribute('src', './scraperJavascript.js');
inp.setAttribute('type', 'text/javascript');
XMLS = new XMLSerializer();
inp_xmls = XMLS.serializeToString(inp);
document.body.insertAdjacentHTML('afterbegin', inp_xmls);
}, json)
//rendering page's html
const renderedContent = await page.evaluate(() => new XMLSerializer().serializeToString(document));
console.log(renderedContent);
await browser.close();
}
run();
但是如果脚本有问题,为什么 运行 从控制台(在 ubuntu 和 windows 上)和 php 脚本(在 windows) 但不是来自 php 脚本(在 ubuntu 上)
更新
我 运行 在木偶操纵端进行异常检查。确实发生异常,这是它的消息
Error: Failed to launch chrome! [0608/095818.625603:ERROR:icu_util.cc(133)] Invalid file descriptor to ICU data received. [0608/095818.625662:FATAL:content_main_delegate.cc(57)] Check failed: false. #0 0x55dc5336182c base::debug::StackTrace::StackTrace() #1 0x55dc532e8290 logging::LogMessage::~LogMessage() #2 0x55dc51598de3 content::ContentMainDelegate::TerminateForFatalInitializationError() #3 0x55dc53017941 content::ContentMainRunnerImpl::Initialize() #4 0x55dc53021c12 service_manager::Main() #5 0x55dc53016184 content::ContentMain() #6 0x55dc571eea39 headless::(anonymous namespace)::RunContentMain() #7 0x55dc571eeac2 headless::HeadlessBrowserMain() #8 0x55dc5301ef8f headless::HeadlessShellMain() #9 0x55dc515971ac ChromeMain #10 0x7f5204329830 __libc_start_main #11 0x55dc5159702a _start TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md at onClose (/var/www/html/master/scraper_puppeteer/node_modules/puppeteer/lib/Launcher.js:255:14) at Interface.helper.addEventListener (/var/www/html/master/scraper_puppeteer/node_modules/puppeteer/lib/Launcher.js:244:50) at emitNone (events.js:111:20) at Interface.emit (events.js:208:7) at Interface.close (readline.js:370:8) at Socket.onend (readline.js:149:10) at emitNone (events.js:111:20) at Socket.emit (events.js:208:7) at endReadableNT (_stream_readable.js:1055:12) at _combinedTickCallback (internal/process/next_tick.js:138:11)
可能是您的服务器不允许使用系统功能。查询方式:
<?php echo ini_get('disable_functions');
这是一个可能的结果:
exec,passthru,shell_exec,system,proc_open,popen,curl_multi_exec,parse_ini_file,show_source
另一个可能的原因:如果启用安全模式,您只能在该模式的特殊目录中执行文件:
<?php
if(ini_get("safe_mode")) {
echo "Can only execute files inside of this dir: " . ini_get("safe_mode_exec_dir");
}
如果以上检查成功,则可能是权限有问题。修改您的 system
调用以获取退出代码,它是非零的,存在问题:
system('node get_page_puppeteer.js "'.$_REQUEST['url'].'" '.$content, $return_code);
var_dump($return_code);
0 — no error
126 — command is found but is not executable source
127 — system doesn't know a file you're calling source
更新 来自你的人偶错误
Failed to launch chrome! [...] Invalid file descriptor to ICU data received
根据 this issue 的说法,可能存在权限问题。该用户的解决方案是
cd /usr/local/lib/node_modules/puppeteer/.local-chromium
find . -type d | xargs -L1 -Ixx sudo chmod 755 xx
find . -type f -perm /u+x | xargs -L1 -Ixx sudo chmod 755 xx
find . -type f -not -perm /u+x | xargs -L1 -Ixx sudo chmod 644 xx