正则表达式不会在预处理中从文本数据中删除网站

Question

我正在做文本预处理，在我的文本中有网站。我想删除这些，但我做不到。

下面是示例文本：

\n\nWorldwide web (www)\n\nName for the entirety of documents linked through hyperlinks on the Internet; often used as a synonym for the latter26.\n\n\n\n\n\n\n\n24\xe2\x80\x83\twww.sicherheitskultur.at, Information Security Glossary\n\n25\xe2\x80\x83\tSource of text (partly): KS\xc3\x96: Cyber Risk Matrix - Glossary\n\n26\xe2\x80\x83\twww.sicherheitskultur.at, Information Security Glossary\n\n\n\n\n\n23\n'

网站可见（粗体），我想删除这些网站。

我已经尝试了一种代码（来自 Whosebug 的回答-Python code to remove HTML tags from a string），但它没有删除这些网站。

代码如下：

def remove_web(text):
    cleanr = re.compile('<.*?.*#>')
    text = re.sub(cleanr, '', text)
    return text

提前致谢！

Answer 1

所以如果你只想删除这个特别是 URL，你可以使用这个正则表达式：

www\.[a-z]+\.at

（采用 David Amar 的解决方案。）

Answer 2

www(\.\w+)+

说明： - 首先是 www - 然后至少一个像这样的块：一个点 + 一些文本（字母、数字、下划线）

要匹配 url 中的更多字符（例如连字符），请将 \w 替换为 [a-zA-Z0-9_-] 之类的字符集，例如

正则表达式不会在预处理中从文本数据中删除网站

Regex is not removing websites from text data in preprocessing

regex

text

python-3.7