剥离标签并将所有 br 和 p 标签替换为单个 space

Question

去除所有 html 标签以及   和  标签替换为单个 space 并删除所有换行符的正则表达式是什么？

例如：

<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>

应该变成：

Heading hyperlink paragraph1 paragraph2

我试过以下方法：

$string = preg_replace( ["/<br\s*\/?>/i","/<\/p\s*>/i"]," ",$string);
$string = preg_replace(["/<\/?[^>]+>/", "/\r?\n|\r/"],"",$string);

这给了我：

Heading              hyperlink         paragraph1 paragraph2

任何实际可行的单行或更优雅的解决方案的想法？

Answer 1

您可以将多个被白色 space 包围的标签分组，并用单个 space 替换它们。要替换的正则表达式是，

(\s*<[^>]+>\s*)+

这会给你一个 space 来代替所有这些标签，最后使用 trim() 摆脱最右边和最左边的 space spaces你可能不需要。

Demo

这是演示的php代码，

$html = '<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';

echo trim(preg_replace("/(\s*<[^>]+>\s*)+/", " ", $html));

打印，

Heading hyperlink paragraph1 paragraph2

Answer 2

你可以用这个

<\s*\/?\s*br[^>]*>|<\s*\/?\s*p[^>]*>|\n

Explanation

<\s*\/?\s*br[^>]*> - 匹配   或  或   与任意数量的白色 space 并且匹配属性。
<\s*\/?\s*p[^>]*> - 匹配  或  或  与任意数量的白色 space 也匹配属性。
\n - 匹配新行。

Demo

Answer 3

您可以保留现有的内容并删除多余的空格

$stripped = preg_replace('/\s+/', ' ', $string);

那个returns:

Heading hyperlink paragraph1 paragraph2

Answer 4

这就是我要做的：

$a = '<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';


echo trim(preg_replace(['/<[^>]*>/','/\s+/'],' ', $a));

输出

 Heading hyperlink paragraph1 paragraph2

Sandbox

第一个正则表达式删除标签，用 space 替换它们，第二个正则表达式取多个 spaces 并将其更改为一个。

这工作得很好，但我可以看到它可能偏离具体要求的方式。

What is the regex to strip all html tags and where there are and tags replace with a single space and remove all line breaks

因此，如果您想要 "full" 解决方案，您可以这样做：

$a = '<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p></p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';

echo preg_replace([
    '/<(?:br|p)[^>]*>/i', //replace br p with ' '
    '/<[^>]*>/',  //replace any tag with ''
    '/\s+/', //remove run on space
    '/^\s+|\s+$/' //trim
],[
    ' ', '', ' ', ''
], $a);

请注意，我添加了一个 <big> 标签并删除了  标签之间的任何 space。这样做是为了突出一些事情。

例如，如果您从第二个示例中获取文本并在第一个示例中使用它，您将得到这个（因为大标签）：

Heading hyperlink p aragraph1 paragraph2

更新后的示例输出正确。但是，这是一个很大的但是，我更改了输入文本，因此可能没有必要将其过度复杂化。

 标签只是表明它在删除所有带有 '' 的 HTML 标签之前将 space 放在它们之间。

Sandbox

更新

@ArtisticPhoenix how would I accomodate  

首先，我将使用 html_entity_decode 转换字符串，但是其中有一些难点。这些与编码有关。所以这是正确的做法：

$a = '<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p>&nbsp;</p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';

 //convert entities using UTF-8
$a = html_entity_decode($a, ENT_QUOTES, 'UTF-8');

echo preg_replace([
    '/<(?:br|p)[^>]*>/i', //replace br p with ' '
    '/<[^>]*>/',  //replace any tag with ''
    '/\s+/u', //remove run on space - replace using the unicode flag
    '/^\s+|\s+$/u' //trim - replace using the unicode flag
],[
    ' ', '', ' ', ''
], $a);

请注意在 /\s+/u 和 /^\s+|\s+$/u 上方的正则表达式中添加了 u 标记。

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

问题来自将其解码为 ASCII 160（nbsp）而不是 ASCII 32 字符（单个 space）。反正我们可以用UTF-8来整理如上图

Sandbox

Answer 5

方法是使用两种模式

P1 : <[\/\d\w]+.*?> 这将清除所有标签。

P2 : [\n\s]+ 并将其替换为 single Whitespace

例子：

$string = preg_replace( "<[\/\d\w]+.*?>","",$string);
$string = preg_replace("[\n\s]+"," ",$string);

Answer 6

将 HTML 视为字符串并使用正则表达式绝不是一个好主意。不涉及 DOM 解析器的唯一体面的解决方案是使用 PHP 的内置 strip_tags function (which uses a state machine，因此仍然容易受到破坏 HTML 的潜在问题的影响）然后您可以使用正则表达式压缩生成的空白：

<?php
$html = '<h1>Heading</h1>
<br>
<br />
<a href="#">hyperlink</a>
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';

echo preg_replace("/\s+/", " ", strip_tags($html));

输出：

Heading hyperlink paragraph1 paragraph2

剥离标签并将所有 br 和 p 标签替换为单个 space

Strip tags and replace all br and p tags with a single space

php

regex

preg-replace