使用 HTML::TokeParser 解析 Html 音频标签
Parse Html Audio Tag Using HTML::TokeParser
I am trying to write a spider in perl which will parse all audio tags in a domain and attempt to download the respective audio/mpeg
content from each audio tag found.
下面是我的代码片段,它使用 HTML::TokeParser
解析 html 以便从 a
标签中提取链接:
my($response, $base, $stream, $pageURL, $tag, $url);
$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;
$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );
while($tag = $stream->get_tag('a')) {
next unless defined($url = $tag->[1]{'href'});
print $url."\n";
}
The above code snippet extracts all links from a given html page. This is used in a loop along with a hash of urls to crawl all pages in a given domain.
下面是另一个几乎与第一个完全相同的片段,只是我试图提取 audio
标签 而不是 a
标签:
my($response, $base, $stream, $pageURL, $tag, $url);
$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;
$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );
while($tag = $stream->get_tag('audio')) {
next unless defined($url = $tag->[1]{'onplaying'});
print $url."\n";
}
由于某种原因,未检测到 audio
标签。有什么我想念的吗?
Reading the HTML::TokeParser documentation I figure that I can not extract attributes of nested html elements.
考虑下面的标记:
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>
我想解析整个 html 以仅提取找到的所有 audio
标签的 src
属性。因此,如果 html 看起来像这样:
<body>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 2.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%202.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 3.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%203.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 4.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%204.mp3">
</audio>
</body>
预期的输出应该是这样的:
http://www.example.com/mp3/Some%20Mp3%20File.mp3
http://www.example.com/mp3/Some%20Mp3%20File%202.mp3
http://www.example.com/mp3/Some%20Mp3%20File%203.mp3
http://www.example.com/mp3/Some%20Mp3%20File%204.mp3
So I need to parse html files to extract only the src
attributes of each audio
tag present.
我不熟悉 HTML::Token,但 Mojo::DOM from Mojolicious 可用于使用熟悉的 CSS 语法轻松查找和提取链接:
use Mojo::DOM;
my $html = '<body> ... ';
my $dom = Mojo::DOM->new($html);
my @src = map { $_->{src} }
$dom->find('audio[onplaying] source[src]')->each;
如果您需要从网络上抓取 HTML 文件或音频文件,您也可以将此与 Mojo::UserAgent 结合使用。
I am trying to write a spider in perl which will parse all audio tags in a domain and attempt to download the respective
audio/mpeg
content from each audio tag found.
下面是我的代码片段,它使用 HTML::TokeParser
解析 html 以便从 a
标签中提取链接:
my($response, $base, $stream, $pageURL, $tag, $url);
$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;
$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );
while($tag = $stream->get_tag('a')) {
next unless defined($url = $tag->[1]{'href'});
print $url."\n";
}
The above code snippet extracts all links from a given html page. This is used in a loop along with a hash of urls to crawl all pages in a given domain.
下面是另一个几乎与第一个完全相同的片段,只是我试图提取 audio
标签 而不是 a
标签:
my($response, $base, $stream, $pageURL, $tag, $url);
$response = 'http://example.com/page-with-some-audio-content';
$base = URI->new( $response->base )->canonical;
$stream = HTML::TokeParser->new( $response->content_ref );
$pageURL = URI->new( $response->request->uri );
while($tag = $stream->get_tag('audio')) {
next unless defined($url = $tag->[1]{'onplaying'});
print $url."\n";
}
由于某种原因,未检测到 audio
标签。有什么我想念的吗?
Reading the HTML::TokeParser documentation I figure that I can not extract attributes of nested html elements.
考虑下面的标记:
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>
我想解析整个 html 以仅提取找到的所有 audio
标签的 src
属性。因此,如果 html 看起来像这样:
<body>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 2.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%202.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 3.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%203.mp3">
</audio>
<audio onplaying="podPress_html5_count('http://www.example.com/mp3/Some Mp3 File 4.mp3', this.id)">
<source src="http://www.example.com/mp3/Some%20Mp3%20File%204.mp3">
</audio>
</body>
预期的输出应该是这样的:
http://www.example.com/mp3/Some%20Mp3%20File.mp3
http://www.example.com/mp3/Some%20Mp3%20File%202.mp3
http://www.example.com/mp3/Some%20Mp3%20File%203.mp3
http://www.example.com/mp3/Some%20Mp3%20File%204.mp3
So I need to parse html files to extract only the
src
attributes of eachaudio
tag present.
我不熟悉 HTML::Token,但 Mojo::DOM from Mojolicious 可用于使用熟悉的 CSS 语法轻松查找和提取链接:
use Mojo::DOM;
my $html = '<body> ... ';
my $dom = Mojo::DOM->new($html);
my @src = map { $_->{src} }
$dom->find('audio[onplaying] source[src]')->each;
如果您需要从网络上抓取 HTML 文件或音频文件,您也可以将此与 Mojo::UserAgent 结合使用。