使用 powershell 预处理 html 数据

Question

我有一些 html 客户数据的源代码需要在使用连接字符串拆分的行部署之前从 html 标签中清除。

我希望能够定位特定类型的信息。例如，如果客户在他的页面上有一个类别列表。每个 'category' 都坐在一个易于区分的标签内：

<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>

是否可以删除未嵌套在类似 html 标签中的所有其他内容？

比方说，例如我想要发生在 * 内部的所有事情。这样每个非  标签及其内容都会被删除。所有 *** 的内容都会保留，没有标签。那是我可以在 powershell 中做的事情吗？让我们避免 paste.exe 和 cygwin 类型的东西。我正在寻找标准的本机 windows 方法（cmd 或 powershell）。

同样，我想删除所有标签。

只是我不删除的内容应该仅限于在特定标签中找到的内容。喜欢，Shopping 符合*个人资料的一切

只留下内容。没有标签。

来自：Home and Graden

至：Home and Graden

我真的在寻找如何在 powershell 中执行此操作的答案，而无需安装任何东西或对 OS (windows10)

Answer 1

在 Whosebug 上提问之前，请尝试调查问题。您知道 PowerShell 中有一个 -replace 运算符可以让您使用 RegEx 吗？您确定 RegEx 可以帮助您解决问题吗？

无论如何，这是您可以采用的一种方法。

$html = '<span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>'
if ($html -match '(<span.*>)(?<Category>.+)(</span>)') { 
    $Matches.Category 
}

Home and Graden

-match 运算符可以测试 RegEx。 RegEx (<span.*>)(?<Category>.+)() 将创建三个组，其中一个名为 Category。该类别位于跨度标签之间。对于您的输入，您必须确保所有类别都位于 span 标签内。如果-match returns为真，自动变量$Matches被填充。由于我们将第二组命名为 Category，因此我们可以将其作为属性和 $Matches.Category 轻松访问。

或者，对于更复杂的 html 文件，您可以使用 PowerShell 解析 html，请参阅 Powershell Tip : Parsing HTML from a local File or a String

Answer 2

而不是为此使用精致的Regular Expressions, you might just use the [System.Net.WebUtility]::HtmlDecode方法：

$Html = '<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>'
([Xml][System.Net.WebUtility]::HtmlDecode($Html)).GetElementsByTagName('span').'#text'

结果：

Cryptocurrency

使用 powershell 预处理 html 数据

preprocessing html data with powershell

string

powershell

str-replace