要解析的正则表达式 XML - RSS 提要

Question

<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>

对于上面的内容，我想提取标题，link和描述我如何制定我的正则表达式规则来捕捉这个？

最终目标是将提取的内容转储到我创建的预定义 sql 数据库中

Answer 1

正如评论中所建议的，您很可能应该使用 XML 解析器而不是正则表达式，但由于 RSS 提要的格式可能是一致的并且非常简单，正则表达式解决方案也可能有效。

对于当前示例，您可以使用：

<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>

解释：

<(.+)> - 匹配起始标签，捕获名称
\s* - 匹配可选的空白字符（您的示例中的新行）
(?:<!\[CDATA\[)? - <![CDATA[ 的非捕获组，匹配 0 或 1 次
\s* - 匹配可选的空白字符
(.*) - 将捕获任何字符的捕获组
\s* - 匹配可选的空白字符
(?:]]>)? - ]]>（CDATA 关闭）的非捕获组，匹配 0 或 1 次
\s* - 匹配可选的空白字符
<\/> - 匹配与开始标签同名的结束标签（对第一个捕获组的反向引用）

let input = `<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>`;

let regex = /<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>/g;

let result;
do {
  result = regex.exec(input);
  if (result) {
    console.log(result[1] + ": " + result[2]);
  }
} while (result);

要解析的正则表达式 XML - RSS 提要

Regex to Parse XML - RSS Feed

regex

rss

parsing

xml-parsing