要解析的正则表达式 XML - RSS 提要
Regex to Parse XML - RSS Feed
<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>
对于上面的内容,我想提取标题,link和描述
我如何制定我的正则表达式规则来捕捉这个?
最终目标是将提取的内容转储到我创建的预定义 sql 数据库中
正如评论中所建议的,您很可能应该使用 XML 解析器而不是正则表达式,但由于 RSS 提要的格式可能是一致的并且非常简单,正则表达式解决方案也可能有效。
对于当前示例,您可以使用:
<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>
解释:
<(.+)>
- 匹配起始标签,捕获名称
\s*
- 匹配可选的空白字符(您的示例中的新行)
(?:<!\[CDATA\[)?
- <![CDATA[
的非捕获组,匹配 0 或 1 次
\s*
- 匹配可选的空白字符
(.*)
- 将捕获任何字符的捕获组
\s*
- 匹配可选的空白字符
(?:]]>)?
- ]]>
(CDATA 关闭)的非捕获组,匹配 0 或 1 次
\s*
- 匹配可选的空白字符
<\/>
- 匹配与开始标签同名的结束标签(对第一个捕获组的反向引用)
let input = `<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>`;
let regex = /<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>/g;
let result;
do {
result = regex.exec(input);
if (result) {
console.log(result[1] + ": " + result[2]);
}
} while (result);
<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>
对于上面的内容,我想提取标题,link和描述 我如何制定我的正则表达式规则来捕捉这个?
最终目标是将提取的内容转储到我创建的预定义 sql 数据库中
正如评论中所建议的,您很可能应该使用 XML 解析器而不是正则表达式,但由于 RSS 提要的格式可能是一致的并且非常简单,正则表达式解决方案也可能有效。
对于当前示例,您可以使用:
<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>
解释:
<(.+)>
- 匹配起始标签,捕获名称\s*
- 匹配可选的空白字符(您的示例中的新行)(?:<!\[CDATA\[)?
-<![CDATA[
的非捕获组,匹配 0 或 1 次\s*
- 匹配可选的空白字符(.*)
- 将捕获任何字符的捕获组\s*
- 匹配可选的空白字符(?:]]>)?
-]]>
(CDATA 关闭)的非捕获组,匹配 0 或 1 次\s*
- 匹配可选的空白字符<\/>
- 匹配与开始标签同名的结束标签(对第一个捕获组的反向引用)
let input = `<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump's desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>`;
let regex = /<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/>/g;
let result;
do {
result = regex.exec(input);
if (result) {
console.log(result[1] + ": " + result[2]);
}
} while (result);