如何从 html 文件中提取特定 header 下的所有文本
How to extract all the text under a particular header from an html file
我正在尝试从 html 文件中提取特定标题下的所有文本。
我想为此使用 xmllint 实用程序。
我在 Linux 环境中工作。
这是 html 文件:https://kernelnewbies.org/Linux_3.6#Block
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="viewport" content="width=device-width,initial-scale=1.0">
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<meta name="keywords" content="Linux, kernel, operating system, changes, changelog, file system, Linus Torvalds, open source, device drivers">
<meta name="description" content="Summary of the changes and new features merged in the Linux kernel during the 3.6 development cycle">
<meta name="robots" content="index,nofollow">
<title>Linux_3.6 - Linux Kernel Newbies</title>
<script type="text/javascript" src="/moin_static199/common/js/common.js"></script>
<script type="text/javascript">
<!--
var search_hint = "Search";
.............
.............
<h1 id="Block">5. Block</h1>
<span class="anchor" id="line-118"></span><ul><li><p class="line862">Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use <a class="http" href="http://git.kernel.org/linus/c83f6bf98dc1f1a194118b3830706cebbebda8c4">(commit)</a> <span class="anchor" id="line-119"></span></li><li><p class="line862">Device mapper RAID: Add support for MD RAID10 <a class="http" href="http://git.kernel.org/linus/63f33b8dda88923487004b20fba825486d009e7b">(commit)</a> <span class="anchor" id="line-120"></span></li><li><p class="line862">Device mapper thin: add read-only and fail I/O modes <a class="http" href="http://git.kernel.org/linus/e49e582965b3694f07a106adc83ddb44aa4f0890">(commit)</a> <span class="anchor" id="line-121"></span></li><li><p class="line862">Device mapper: remove persistent data debug space map checker <a class="http" href="http://git.kernel.org/linus/3caf6d73d4dc163b2d135e0b52b052a2b63e5216">(commit)</a> <span class="anchor" id="line-122"></span></li><li><p class="line862">md/raid1: prevent merging too large request <a class="http" href="http://git.kernel.org/linus/12cee5a8a29e7263e39953f1d941f723c617ca5f">(commit)</a> <span class="anchor" id="line-123"></span><span class="anchor" id="line-124"></span></li></ul><p class="line867">
.............
.............
},
clickWrapper: function () {
if ( ($(this).attr('href') === location.hash)
|| !('onhashchange' in window.document.body) ) {
setTimeout(function () { $(window).trigger("hashchange"); }, 1);
}
},
};
$('#pagebox a[href^="#"]:not([href="#"])').on("click", mdAnchorFix.clickWrapper);
$(window).on("hashchange", mdAnchorFix.jump);
if (location.hash) setTimeout(function () { mdAnchorFix.jump(); }, 100);
}(jQuery);
</script>
<!-- End of JavaScript -->
</body>
</html>
我想要以下输出:
5. Block
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use (commit)
Device mapper RAID: Add support for MD RAID10 (commit)
Device mapper thin: add read-only and fail I/O modes (commit)
Device mapper: remove persistent data debug space map checker (commit)
md/raid1: prevent merging too large request (commit)
基本上,我想创建一个脚本,从内核新手网站提取不同内核版本中块层更改的描述。
如您问题的评论中所述,您可以使用 xidel 而不是 xmllint 来完成大部分工作。所以试试这个:
xidel https://kernelnewbies.org/Linux_3.6 --extract '//h1[@id="Block"]//following-sibling::ul[1]//p/text()'
我正在尝试从 html 文件中提取特定标题下的所有文本。 我想为此使用 xmllint 实用程序。 我在 Linux 环境中工作。
这是 html 文件:https://kernelnewbies.org/Linux_3.6#Block
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="viewport" content="width=device-width,initial-scale=1.0">
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<meta name="keywords" content="Linux, kernel, operating system, changes, changelog, file system, Linus Torvalds, open source, device drivers">
<meta name="description" content="Summary of the changes and new features merged in the Linux kernel during the 3.6 development cycle">
<meta name="robots" content="index,nofollow">
<title>Linux_3.6 - Linux Kernel Newbies</title>
<script type="text/javascript" src="/moin_static199/common/js/common.js"></script>
<script type="text/javascript">
<!--
var search_hint = "Search";
.............
.............
<h1 id="Block">5. Block</h1>
<span class="anchor" id="line-118"></span><ul><li><p class="line862">Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use <a class="http" href="http://git.kernel.org/linus/c83f6bf98dc1f1a194118b3830706cebbebda8c4">(commit)</a> <span class="anchor" id="line-119"></span></li><li><p class="line862">Device mapper RAID: Add support for MD RAID10 <a class="http" href="http://git.kernel.org/linus/63f33b8dda88923487004b20fba825486d009e7b">(commit)</a> <span class="anchor" id="line-120"></span></li><li><p class="line862">Device mapper thin: add read-only and fail I/O modes <a class="http" href="http://git.kernel.org/linus/e49e582965b3694f07a106adc83ddb44aa4f0890">(commit)</a> <span class="anchor" id="line-121"></span></li><li><p class="line862">Device mapper: remove persistent data debug space map checker <a class="http" href="http://git.kernel.org/linus/3caf6d73d4dc163b2d135e0b52b052a2b63e5216">(commit)</a> <span class="anchor" id="line-122"></span></li><li><p class="line862">md/raid1: prevent merging too large request <a class="http" href="http://git.kernel.org/linus/12cee5a8a29e7263e39953f1d941f723c617ca5f">(commit)</a> <span class="anchor" id="line-123"></span><span class="anchor" id="line-124"></span></li></ul><p class="line867">
.............
.............
},
clickWrapper: function () {
if ( ($(this).attr('href') === location.hash)
|| !('onhashchange' in window.document.body) ) {
setTimeout(function () { $(window).trigger("hashchange"); }, 1);
}
},
};
$('#pagebox a[href^="#"]:not([href="#"])').on("click", mdAnchorFix.clickWrapper);
$(window).on("hashchange", mdAnchorFix.jump);
if (location.hash) setTimeout(function () { mdAnchorFix.jump(); }, 100);
}(jQuery);
</script>
<!-- End of JavaScript -->
</body>
</html>
我想要以下输出:
5. Block
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use (commit)
Device mapper RAID: Add support for MD RAID10 (commit)
Device mapper thin: add read-only and fail I/O modes (commit)
Device mapper: remove persistent data debug space map checker (commit)
md/raid1: prevent merging too large request (commit)
基本上,我想创建一个脚本,从内核新手网站提取不同内核版本中块层更改的描述。
如您问题的评论中所述,您可以使用 xidel 而不是 xmllint 来完成大部分工作。所以试试这个:
xidel https://kernelnewbies.org/Linux_3.6 --extract '//h1[@id="Block"]//following-sibling::ul[1]//p/text()'