Perl:抓取网站以及如何使用 Perl 从网站下载 PDF 文件 Selenium:Chrome
Perl : Scrape website and how to download PDF files from the website using Perl Selenium:Chrome
所以我正在研究在 Perl 上使用 Selenium:Chrome 抓取网站,我只是想知道如何下载 2017 年到 2021 年的所有 pdf 文件并将其存储到该网站的文件夹中 https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 .到目前为止,这就是我所做的
use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;
my $collection_name = "mre_zen_test3";
make_path("$collection_name");
#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;
#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);
#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");
此脚本只下载网站的全部内容。希望这里有人可以帮助我并教我。非常感谢。
这里有一些工作代码,可以帮助您满怀希望地开始工作
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # only convenience
use Selenium::Chrome;
my $base_url = q(https://www.fda.gov/drugs/)
. q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
my $show = 1; # to see navigation. set to false for headless operation
# A little demo of how to set some browser options
my %chrome_capab = do {
my @cfg = ($show)
? ('window-position=960,10', 'window-size=950,1180')
: 'headless';
'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
};
my $drv = Selenium::Chrome->new( %chrome_capab );
my @years = 2017..2021;
foreach my $year (@years) {
my $url = $base_url . "untitled-letters-$year";
$drv->get($url);
say "\nPage title: ", $drv->get_title;
sleep 1 if $show;
my $elem = $drv->find_element(
q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
);
sleep 1 if $show;
# Downloading the file is surprisingly not simple with Selenium (see text)
# But as we found the link we can get its url and then use Selenium-provided
# user-agent (it's LWP::UserAgent)
my $href = $elem->get_attribute('href');
say "pdf's url: $href";
my $response = $drv->ua->get($href);
die $response->status_line if not $response->is_success;
say "Downloading 'Content-Type': ", $response->header('Content-Type');
my $filename = "download_$year.pdf";
say "Save as $filename";
path($filename)->spew( $response->decoded_content );
}
这需要走捷径、切换方法并回避一些问题(需要解决这些问题才能更充分地利用这个有用的工具)。它从每一页下载一个pdf;下载所有我们需要更改用于定位它们的 XPath 表达式
my @hrefs =
map { $_->get_attribute('href') }
$drv->find_elements(
# There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
q{//li[contains(text(), '(PDF)')]}
. q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]}
);
现在遍历链接,更仔细地形成文件名,然后像上面的程序一样下载每个链接。如果有需要,我可以进一步填补空白。
代码将 pdf 文件放在磁盘的工作目录中。请在 运行 之前检查一下,以确保没有任何内容被覆盖!
初学者请参阅 Selenium::Remove::Driver。
注意:此特定任务不需要 Selenium;都是直接的 HTTP 请求,没有 JavaScript。所以 LWP::UserAgent
或 Mojo
就可以了。但我认为你想学习如何使用 Selenium,因为它经常被需要并且很有用。
所以我正在研究在 Perl 上使用 Selenium:Chrome 抓取网站,我只是想知道如何下载 2017 年到 2021 年的所有 pdf 文件并将其存储到该网站的文件夹中 https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 .到目前为止,这就是我所做的
use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;
my $collection_name = "mre_zen_test3";
make_path("$collection_name");
#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;
#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);
#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");
此脚本只下载网站的全部内容。希望这里有人可以帮助我并教我。非常感谢。
这里有一些工作代码,可以帮助您满怀希望地开始工作
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # only convenience
use Selenium::Chrome;
my $base_url = q(https://www.fda.gov/drugs/)
. q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
my $show = 1; # to see navigation. set to false for headless operation
# A little demo of how to set some browser options
my %chrome_capab = do {
my @cfg = ($show)
? ('window-position=960,10', 'window-size=950,1180')
: 'headless';
'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
};
my $drv = Selenium::Chrome->new( %chrome_capab );
my @years = 2017..2021;
foreach my $year (@years) {
my $url = $base_url . "untitled-letters-$year";
$drv->get($url);
say "\nPage title: ", $drv->get_title;
sleep 1 if $show;
my $elem = $drv->find_element(
q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
);
sleep 1 if $show;
# Downloading the file is surprisingly not simple with Selenium (see text)
# But as we found the link we can get its url and then use Selenium-provided
# user-agent (it's LWP::UserAgent)
my $href = $elem->get_attribute('href');
say "pdf's url: $href";
my $response = $drv->ua->get($href);
die $response->status_line if not $response->is_success;
say "Downloading 'Content-Type': ", $response->header('Content-Type');
my $filename = "download_$year.pdf";
say "Save as $filename";
path($filename)->spew( $response->decoded_content );
}
这需要走捷径、切换方法并回避一些问题(需要解决这些问题才能更充分地利用这个有用的工具)。它从每一页下载一个pdf;下载所有我们需要更改用于定位它们的 XPath 表达式
my @hrefs =
map { $_->get_attribute('href') }
$drv->find_elements(
# There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
q{//li[contains(text(), '(PDF)')]}
. q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]}
);
现在遍历链接,更仔细地形成文件名,然后像上面的程序一样下载每个链接。如果有需要,我可以进一步填补空白。
代码将 pdf 文件放在磁盘的工作目录中。请在 运行 之前检查一下,以确保没有任何内容被覆盖!
初学者请参阅 Selenium::Remove::Driver。
注意:此特定任务不需要 Selenium;都是直接的 HTTP 请求,没有 JavaScript。所以 LWP::UserAgent
或 Mojo
就可以了。但我认为你想学习如何使用 Selenium,因为它经常被需要并且很有用。