检查从 Twitter 中提取的趋势的语言
Checking the language of extracted trends from twitter
我只是使用 python 中的 tweepy 模块从 Twitter 中提取热门标签。我面临一个主要问题,我想检查标签是否为英文。应删除非英文标签。
示例:
tags=['AskOrange','CharlestonShooting','ReplyToASong','UberLIVE','Otecmatkasyn']
不应该Otecmatkasyn
.
您需要使用的是语言检测器API。好的是Google, but it is not free. Another good option is Language Detection API提供的那个。
选择最适合您的 API 后,您需要分析文本,使其作为一个句子有意义。例如,标记 'AskOrange'
必须拆分为 'Ask Orange'
。您可以遍历字符串的每个字符,检查它是否为大写并在其中插入 space:
new_tags = []
for tag in tags:
new_word = tag
uppercases = 0 # In case your sentence has several uppercases
for i in xrange(1, len(tag)):
if tag[i].istitle():
new_word = new_word[:i+uppercases] + ' ' + new_word[i+uppercases:]
uppercases = uppercases + 1
new_tags.append(new_word)
最后,将您的 new_tags
列表发送到 API 以检测语言。
import re,urllib2
def find_words(each_func):
i=0
wordsineach_func=[]
while len(each_func) >0:
i=i+1
word_found=longest_word(each_func)
if len(word_found)>0:
wordsineach_func.append(word_found)
each_func=each_func.replace(word_found,"")
# print i,word_found,each
return wordsineach_func
def longest_word(phrase):
phrase_length=len(phrase)
words_found=[];index=0
outerstring=""
while index < phrase_length:
outerstring=outerstring+phrase[index]
index=index+1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) ==0:
words_found.append(phrase)
return max(words_found, key=len)
data = urllib2.urlopen('https://s3.amazonaws.com/hr-testcases/479/assets/words.txt')
words=[]
for line in data:
words.append(line.replace("\n",""))
string="#honesthournow20"
string=string.replace("#","")
new_words=re.split(r'(\d+)',string)
output=[]
for each in new_words:
each_words=find_words(each)
for each_word in each_words:
output.append(each_word)
print output
然后检查语言。
我只是使用 python 中的 tweepy 模块从 Twitter 中提取热门标签。我面临一个主要问题,我想检查标签是否为英文。应删除非英文标签。
示例:
tags=['AskOrange','CharlestonShooting','ReplyToASong','UberLIVE','Otecmatkasyn']
不应该Otecmatkasyn
.
您需要使用的是语言检测器API。好的是Google, but it is not free. Another good option is Language Detection API提供的那个。
选择最适合您的 API 后,您需要分析文本,使其作为一个句子有意义。例如,标记 'AskOrange'
必须拆分为 'Ask Orange'
。您可以遍历字符串的每个字符,检查它是否为大写并在其中插入 space:
new_tags = []
for tag in tags:
new_word = tag
uppercases = 0 # In case your sentence has several uppercases
for i in xrange(1, len(tag)):
if tag[i].istitle():
new_word = new_word[:i+uppercases] + ' ' + new_word[i+uppercases:]
uppercases = uppercases + 1
new_tags.append(new_word)
最后,将您的 new_tags
列表发送到 API 以检测语言。
import re,urllib2
def find_words(each_func):
i=0
wordsineach_func=[]
while len(each_func) >0:
i=i+1
word_found=longest_word(each_func)
if len(word_found)>0:
wordsineach_func.append(word_found)
each_func=each_func.replace(word_found,"")
# print i,word_found,each
return wordsineach_func
def longest_word(phrase):
phrase_length=len(phrase)
words_found=[];index=0
outerstring=""
while index < phrase_length:
outerstring=outerstring+phrase[index]
index=index+1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) ==0:
words_found.append(phrase)
return max(words_found, key=len)
data = urllib2.urlopen('https://s3.amazonaws.com/hr-testcases/479/assets/words.txt')
words=[]
for line in data:
words.append(line.replace("\n",""))
string="#honesthournow20"
string=string.replace("#","")
new_words=re.split(r'(\d+)',string)
output=[]
for each in new_words:
each_words=find_words(each)
for each_word in each_words:
output.append(each_word)
print output
然后检查语言。