根据匹配的关键字分离过滤的推文:Twitter4j API
Segregating filtered tweets based on matched keywords : Twitter4j API
我创建了按以下关键字过滤的推特流。
TwitterStream twitterStream = getTwitterStreamInstance();
FilterQuery filtre = new FilterQuery();
String[] keywordsArray = { "iphone", "samsung" , "apple", "amazon"};
filtre.track(keywordsArray);
twitterStream.filter(filtre);
twitterStream.addListener(listener);
根据匹配的关键字分离推文的最佳方法是什么。例如所有匹配 "iphone" 的推文应存储到 "IPHONE" table 中,所有匹配 "samsung" 的推文将存储到 "SAMSUNG" table 中,并且很快。注意:过滤关键词的个数大约是500个。
嗯,您可以创建一个类似于 ArrayList 的 class,但您可以创建一个 ArrayList 数组,将其称为 TweetList。 class 需要一个插入函数。
然后使用两个for循环在推文中搜索并找到包含在普通数组列表中的匹配关键字,然后将它们添加到与关键字ArrayList中关键字索引匹配的TweetList
for (int i = 0; i < tweets.length; i++)
{
String[] split = tweets[i].split(" ");// split the tweet up
for (int j = 0; j < split.length; j++)
if (keywords.contains(split[j]))//check each word against the keyword list
list[keywords.indexOf(j)].insert[tweets[i]];//add the tweet to the tree index that matches index of the keyword
}
以下是您如何使用 StatusListener
查询收到的 Status
对象:
final Set<String> keywords = new HashSet<String>();
keywords.add("apple");
keywords.add("samsung");
// ...
final StatusListener listener = new StatusAdapter() {
@Override
public void onStatus(Status status) {
final String statusText = status.getText();
for (String keyword : keywords) {
if (statusText.contains(keyword)) {
dao.insert(keyword, statusText);
}
}
}
};
final TwitterStream twitterStream = getTwitterStreamInstance();
final FilterQuery fq = new FilterQuery();
fq.track(keywords.toArray(new String[0]));
twitterStream.addListener(listener);
twitterStream.filter(fq);
我看到 DAO 的定义如下:
public interface StatusDao {
void insert(String tableSuffix, Status status);
}
然后您将拥有一个与每个关键字相对应的数据库 table。该实现将使用 tableSuffix
将 Status
存储在正确的 table 中,sql 大致如下所示:
INSERT INTO status_$tableSuffix$ VALUES (...)
备注:
如果推文包含 'apple' 和 'samsung',此实现会将 Status
插入到多个 table 中。
此外,这是一个非常幼稚的实现,您可能需要考虑将批处理插入 tables...但这取决于您将收到的推文数量.
如评论中所述,API 在匹配时会考虑其他属性,例如URL 和嵌入的推文(如果存在)因此搜索关键字匹配的状态文本可能不够。
似乎找出推文属于哪个关键字的唯一方法是遍历 Status
对象的多个属性。以下代码需要一个具有方法 insertTweet(String tweetText, Date createdAt, String keyword)
的数据库服务,如果找到多个关键字,每条推文都会多次存储在数据库中。如果在推文文本中至少找到一个关键字,则不会搜索其他属性以查找更多关键字。
// creates a map of the keywords with a compiled pattern, which matches the keyword
private Map<String, Pattern> keywordsMap = new HashMap<>();
private TwitterStream twitterStream;
private DatabaseService databaseService; // implement and add this service
public void start(List<String> keywords) {
stop(); // stop the streaming first, if it is already running
if(keywords.size() > 0) {
for(String keyword : keywords) {
keywordsMap.put(keyword, Pattern.compile(keyword, Pattern.CASE_INSENSITIVE));
}
twitterStream = new TwitterStreamFactory().getInstance();
StatusListener listener = new StatusListener() {
@Override
public void onStatus(Status status) {
insertTweetWithKeywordIntoDatabase(status);
}
/* add the unimplemented methods from the interface */
};
twitterStream.addListener(listener);
FilterQuery filterQuery = new FilterQuery();
filterQuery.track(keywordsMap.keySet().toArray(new String[keywordsMap.keySet().size()]));
filterQuery.language(new String[]{"en"});
twitterStream.filter(filterQuery);
}
else {
System.err.println("Could not start querying because there are no keywords.");
}
}
public void stop() {
keywordsMap.clear();
if(twitterStream != null) {
twitterStream.shutdown();
}
}
private void insertTweetWithKeywordIntoDatabase(Status status) {
// search for keywords in tweet text
List<String> keywords = getKeywordsFromTweet(status.getText());
if (keywords.isEmpty()) {
StringBuffer additionalDataFromTweets = new StringBuffer();
// get extended urls
if (status.getURLEntities() != null) {
for (URLEntity url : status.getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get retweeted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getText());
}
// get retweeted status -> quoted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getQuotedStatus().getText());
}
// get retweeted status -> quoted status -> extended urls
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getRetweetedStatus().getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get quoted status -> text
if (status.getQuotedStatus() != null && status.getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getQuotedStatus().getText());
}
// get quoted status -> extended urls
if (status.getQuotedStatus() != null && status.getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
String additionalData = additionalDataFromTweets.toString();
keywords = getKeywordsFromTweet(additionalData);
}
if (keywords.isEmpty()) {
System.err.println("ERROR: No Keyword found for: " + status.toString());
} else {
// insert into database
for(String keyword : keywords) {
databaseService.insertTweet(status.getText(), status.getCreatedAt(), keyword);
}
}
}
// returns a list of keywords which are found in a tweet
private List<String> getKeywordsFromTweet(String tweet) {
List<String> result = new ArrayList<>();
for (String keyword : keywordsMap.keySet()) {
Pattern p = keywordsMap.get(keyword);
if (p.matcher(tweet).find()) {
result.add(keyword);
}
}
return result;
}
我创建了按以下关键字过滤的推特流。
TwitterStream twitterStream = getTwitterStreamInstance();
FilterQuery filtre = new FilterQuery();
String[] keywordsArray = { "iphone", "samsung" , "apple", "amazon"};
filtre.track(keywordsArray);
twitterStream.filter(filtre);
twitterStream.addListener(listener);
根据匹配的关键字分离推文的最佳方法是什么。例如所有匹配 "iphone" 的推文应存储到 "IPHONE" table 中,所有匹配 "samsung" 的推文将存储到 "SAMSUNG" table 中,并且很快。注意:过滤关键词的个数大约是500个。
嗯,您可以创建一个类似于 ArrayList 的 class,但您可以创建一个 ArrayList 数组,将其称为 TweetList。 class 需要一个插入函数。
然后使用两个for循环在推文中搜索并找到包含在普通数组列表中的匹配关键字,然后将它们添加到与关键字ArrayList中关键字索引匹配的TweetList
for (int i = 0; i < tweets.length; i++)
{
String[] split = tweets[i].split(" ");// split the tweet up
for (int j = 0; j < split.length; j++)
if (keywords.contains(split[j]))//check each word against the keyword list
list[keywords.indexOf(j)].insert[tweets[i]];//add the tweet to the tree index that matches index of the keyword
}
以下是您如何使用 StatusListener
查询收到的 Status
对象:
final Set<String> keywords = new HashSet<String>();
keywords.add("apple");
keywords.add("samsung");
// ...
final StatusListener listener = new StatusAdapter() {
@Override
public void onStatus(Status status) {
final String statusText = status.getText();
for (String keyword : keywords) {
if (statusText.contains(keyword)) {
dao.insert(keyword, statusText);
}
}
}
};
final TwitterStream twitterStream = getTwitterStreamInstance();
final FilterQuery fq = new FilterQuery();
fq.track(keywords.toArray(new String[0]));
twitterStream.addListener(listener);
twitterStream.filter(fq);
我看到 DAO 的定义如下:
public interface StatusDao {
void insert(String tableSuffix, Status status);
}
然后您将拥有一个与每个关键字相对应的数据库 table。该实现将使用 tableSuffix
将 Status
存储在正确的 table 中,sql 大致如下所示:
INSERT INTO status_$tableSuffix$ VALUES (...)
备注:
如果推文包含 'apple' 和 'samsung',此实现会将
Status
插入到多个 table 中。此外,这是一个非常幼稚的实现,您可能需要考虑将批处理插入 tables...但这取决于您将收到的推文数量.
如评论中所述,API 在匹配时会考虑其他属性,例如URL 和嵌入的推文(如果存在)因此搜索关键字匹配的状态文本可能不够。
似乎找出推文属于哪个关键字的唯一方法是遍历 Status
对象的多个属性。以下代码需要一个具有方法 insertTweet(String tweetText, Date createdAt, String keyword)
的数据库服务,如果找到多个关键字,每条推文都会多次存储在数据库中。如果在推文文本中至少找到一个关键字,则不会搜索其他属性以查找更多关键字。
// creates a map of the keywords with a compiled pattern, which matches the keyword
private Map<String, Pattern> keywordsMap = new HashMap<>();
private TwitterStream twitterStream;
private DatabaseService databaseService; // implement and add this service
public void start(List<String> keywords) {
stop(); // stop the streaming first, if it is already running
if(keywords.size() > 0) {
for(String keyword : keywords) {
keywordsMap.put(keyword, Pattern.compile(keyword, Pattern.CASE_INSENSITIVE));
}
twitterStream = new TwitterStreamFactory().getInstance();
StatusListener listener = new StatusListener() {
@Override
public void onStatus(Status status) {
insertTweetWithKeywordIntoDatabase(status);
}
/* add the unimplemented methods from the interface */
};
twitterStream.addListener(listener);
FilterQuery filterQuery = new FilterQuery();
filterQuery.track(keywordsMap.keySet().toArray(new String[keywordsMap.keySet().size()]));
filterQuery.language(new String[]{"en"});
twitterStream.filter(filterQuery);
}
else {
System.err.println("Could not start querying because there are no keywords.");
}
}
public void stop() {
keywordsMap.clear();
if(twitterStream != null) {
twitterStream.shutdown();
}
}
private void insertTweetWithKeywordIntoDatabase(Status status) {
// search for keywords in tweet text
List<String> keywords = getKeywordsFromTweet(status.getText());
if (keywords.isEmpty()) {
StringBuffer additionalDataFromTweets = new StringBuffer();
// get extended urls
if (status.getURLEntities() != null) {
for (URLEntity url : status.getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get retweeted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getText());
}
// get retweeted status -> quoted status -> text
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getRetweetedStatus().getQuotedStatus().getText());
}
// get retweeted status -> quoted status -> extended urls
if (status.getRetweetedStatus() != null && status.getRetweetedStatus().getQuotedStatus() != null
&& status.getRetweetedStatus().getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getRetweetedStatus().getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
// get quoted status -> text
if (status.getQuotedStatus() != null && status.getQuotedStatus().getText() != null) {
additionalDataFromTweets.append(status.getQuotedStatus().getText());
}
// get quoted status -> extended urls
if (status.getQuotedStatus() != null && status.getQuotedStatus().getURLEntities() != null) {
for (URLEntity url : status.getQuotedStatus().getURLEntities()) {
if (url != null && url.getExpandedURL() != null) {
additionalDataFromTweets.append(url.getExpandedURL());
}
}
}
String additionalData = additionalDataFromTweets.toString();
keywords = getKeywordsFromTweet(additionalData);
}
if (keywords.isEmpty()) {
System.err.println("ERROR: No Keyword found for: " + status.toString());
} else {
// insert into database
for(String keyword : keywords) {
databaseService.insertTweet(status.getText(), status.getCreatedAt(), keyword);
}
}
}
// returns a list of keywords which are found in a tweet
private List<String> getKeywordsFromTweet(String tweet) {
List<String> result = new ArrayList<>();
for (String keyword : keywordsMap.keySet()) {
Pattern p = keywordsMap.get(keyword);
if (p.matcher(tweet).find()) {
result.add(keyword);
}
}
return result;
}