需要将#tags 拆分为文本
Need to split #tags to text
我需要以自动方式将#tags 拆分为有意义的词。
示例输入:
- 爱美
- 我喜欢
- 我的爸爸英雄
示例输出
- 我爱美国
- 我喜欢的人
- 我的英雄爸爸
我可以使用任何实用程序或打开 API 来实现此目的吗?
检查 - Word Segmentation Task from Norvig 的工作。
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('iloveusa') # ['i', 'love', 'us', 'a']
print segment('mycrushlike') # ['my', 'crush', 'like']
print segment('mydadhero') # ['my', 'dad', 'hero']
要获得比这更好的解决方案,您可以使用 bigram/trigram。
更多示例位于:Word Segmentation Task
我需要以自动方式将#tags 拆分为有意义的词。
示例输入:
- 爱美
- 我喜欢
- 我的爸爸英雄
示例输出
- 我爱美国
- 我喜欢的人
- 我的英雄爸爸
我可以使用任何实用程序或打开 API 来实现此目的吗?
检查 - Word Segmentation Task from Norvig 的工作。
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('iloveusa') # ['i', 'love', 'us', 'a']
print segment('mycrushlike') # ['my', 'crush', 'like']
print segment('mydadhero') # ['my', 'dad', 'hero']
要获得比这更好的解决方案,您可以使用 bigram/trigram。
更多示例位于:Word Segmentation Task