如何防止简单 HTML DOM 明文将顺序 <div> 元素的单词连接在一起

Question

我正在解析包含以下摘录的网页：

<div>foo</div><div>bar</div>

使用以下代码：

$html = file_get_html("http://example.com");
$pt = $html->plaintext;
echo $pt;

$ptreturns"foobar"。我想要的是 "foo bar"，即在单独元素中的单词之间添加一个空白 space。

除了 <div> 之外还有其他元素，我在其中看到了这种行为，因此解决方案必须适用于所有可以包含可视文本的元素类型。

有没有办法操纵 $html 对象在元素之间添加 space 或让 plaintext 在它找到的每个单词后添加 space？我可以处理在结果 $pt.

中有一个双 space

我尝试了 $html = str_replace ( "</" , " </" , $html );，但结果为空，可能是因为我正在尝试编辑一个对象，而不是一个字符串，然后该对象被破坏了。

更新

根据一些反馈，我尝试了以下方法：

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;

这是我想要的结果，但我不知道是否有更有效的方法。

Answer 1

当您使用明文方法时，它会被串联起来。以下应该为您提供一个 div 数组。

$html = file_get_html("http://example.com");
$pt = $html->find('div');
print_r($pt);

Answer 2

如果你使用file_get_contents获取字符串而不是HTML的对象，你可以使用preg_match_all获取所有div标签，然后应用strip_tags 到每个匹配的标签，使用 array_walk 给你留下值。

试试这个：

$str = file_get_contents("some_file_with_your_html.php");
// Assume the above returns something like the below
$str = "<div>sdsd</div><div id='some_id_1' attribute>test</div><div><div>inside</div></div><div><h1>header</h1></div><p>sdscdsds</p><div>another</div>";

// matches all div tags with any optional attributes and CSS declarations
$tagsFound = preg_match_all("|<div([^>]?)+>(.*)</div+>|U", $str, $matches);
if ((bool)$tagsFound !== false) {
    // Apply the anonymous function to each array value
    array_walk($matches[0], function (&$value, $index) {
        $value = strip_tags($value);
    });
}

这将在 HTML:

中留下一组文本

print ('<pre>');
print_r($matches[0]);
print ('</pre>');

Array
    (
        [0] => sdsd
        [1] => test
        [2] => inside
        [3] => header
        [4] => another
    )

然后，如果需要，您可以对生成的数组执行 implode，以用 space 分隔单词。

参考文献：

http://be2.php.net/manual/en/function.preg-match-all.php

http://be2.php.net/manual/en/function.array-walk.php

http://be2.php.net/manual/en/function.strip-tags.php

http://php.net/manual/en/pcre.pattern.php

Answer 3

因为您无法确定哪些元素将为 plaintext 生成结果，如果您将整个页面作为字符串读取，则可以执行 str_replace 添加 space 在每个标签关闭字符之前。 (</htmltag>)

此处的其他建议答案取决于了解哪些元素包含可读文本，但这并不能提前知道。

这似乎产生了预期的效果：

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;

Answer 4

我遇到了这个问题，我想用粗体显示纯文本，但我遇到了污染问题，要做到这一点，只需这样做：首先，找到所有粗体文本并将它们存储在数组中接下来，你抓住你想要的元素的内部文本最后，剥离标签（另一个只对我有用的步骤是将粗体数组中的所有文本替换为来自table此处的文本）

$elements = $html->find('p');
foreach ($elements as $key => $element) {
        $text = $element->innertext;
        $text = strip_tags($text);
        // one extra step for me only I replace bold texts
}

如何防止简单 HTML DOM 明文将顺序 <div> 元素的单词连接在一起

How to prevent Simple HTML DOM plaintext from concatenating words together for sequential <div> elements

php

simple-html-dom