无法读取在 FileWriter 之后生成的 csv 文件并在 Java 中找到重复项
Unable to read csv file that was produced after FileWriter and find duplicate in Java
我正在尝试读取从 FileWriter 创建的第一个 csv 文件。
第一个csv文件的输出是occurs/appear超过10次的实体名称(列[1])的内容。
读取第一个 csv 文件后,我试图检查列 [5] 的重复项(即推文令牌)并写入,并将其添加到第二个 csv 文件中。
我尝试使用 .contains
方法,它不检查重复项。
Update: I have successfully read the file but not able to remove duplicates in EventDetectionToken()
.
代码如下:
import java.io.*;
import java.util.*;
public class EventDetectioncopy {
public static void main(String[] args) throws FileNotFoundException, IOException{
//1st csv file
System.out.print("Enter a name for new Tweet Cluster sorting by name entity: ");
BufferedReader scanName = new BufferedReader(new InputStreamReader(System.in));
String newNamefile = scanName.readLine();
//2nd csv file
System.out.print("Enter a name for new Tweet Cluster sorting by tweet tokens: ");
BufferedReader scanToken = new BufferedReader(new InputStreamReader(System.in));
String newTokenfile = scanToken.readLine();
try {
eventDetectionName(newNamefile);
eventDetectionToken(newNamefile, newTokenfile);
}
catch (FileNotFoundException e) {
System.out.println(e);
}
catch (IOException e){
}
}
public static void eventDetectionToken(String fileInput, String fileOuput) throws FileNotFoundException, IOException{
FileWriter newCsv = new FileWriter(fileOutput + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(fileInput + ".csv"));
String data;
try{
String temp = null;
List<String> tempList = new ArrayList<String>();
do
{
data = reader.readLine();
String tweetToken = null;
if(data != null)
{
String[] splitText = data.split(",");
tweetToken = splitText[5];
}
if(temp != null)
{
if(data == null || tweetToken.contains(tweetToken))
{
if(!(temp.equals(tweetToken)))
{
for (int i = 0; i < tempList.size(); i ++)
{
newCsvBW.append(tempList.get(i));
newCsvBW.append("\n");
System.out.println(tempList.get(i));
}
}
tempList.clear();
temp = tweetToken;
}
}
else
{
temp = tweetToken;
}
tempList.add(data);
}
while(data != null);
}
finally
{
newCsvBW.close();
reader.close();
}
}
public static void eventDetectionName(String filename) throws FileNotFoundException, IOException{
String csv = "1day/clusters.sortedby.clusterid.csv";
FileWriter newCsv = new FileWriter(filename + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(csv));
String data;
try{
String temp = null;
List<String> tempList = new ArrayList<String>();
List<Long> tempTime = new ArrayList<Long>();
do
{
data = reader.readLine();
String nameEntity = null;
if (data != null)
{
String[] splitText = data.split(",");
nameEntity = splitText[1];
}
if (temp != null)
{
if (data == null || !(nameEntity.equals(temp)))
{
if (tempList.size() >= 10)
{
for (int i = 0; i < tempList.size(); i++)
{
newCsvBW.append(tempList.get(i));
newCsvBW.append("\n");
System.out.println(tempList.get(i));
}
}
tempList.clear();
temp = nameEntity;
}
}
else
{
temp = nameEntity;
}
tempList.add(data);
}
while (data != null);
}
finally
{
reader.close();
newCsvBW.close();
}
}
}
以下是原始 csv 文件的一些内容:"clusters.sortedby.clusterid.csv",在 运行 EventDetectioncopy.java
之前,带有重复的推文标记(第 [5] 列):
[clusterid]、[name entity]、[tweetid]、[timestamp]、[userid]、[tweet token]、[tweet text]
1 rick ross 2.5582E+17 1.34983E+12 389746870 rick ross dice pineappl Rick Ross x diced pineapples
1 rick ross 2.5582E+17 1.34983E+12 56082039 dice pineappl uhhh rick ross voic Diced Pineapples. UHHH *Rick Ross voice*
1 rick ross 2.55821E+17 1.34983E+12 870278689 rick ross trend Why is Rick Ross trending?
1 rick ross 2.55822E+17 1.34983E+12 379948188 lmfao rick ross grunt Lmfao he did that rick ross grunt .
1 rick ross 2.55822E+17 1.34983E+12 276594374 play rick ross they played w| rick ross !
1 rick ross 2.55822E+17 1.34983E+12 386219877 rick ross ugli Rick Ross So Ugly ..
1 rick ross 2.55822E+17 1.34983E+12 53327754 wanna play rick ross belli I Wanna Play in Rick Ross Belly..!
1 rick ross 2.55824E+17 1.34983E+12 19690034 rick ross dice pineappl ft wale amp drake video via laleak Rick Ross - Diced Pineapples ft. Wale & Drake (Video) via @laleakers
1 rick ross 2.55825E+17 1.34983E+12 357250991 husband rick ross where my husband rick ross î„…î‰
1 rick ross 2.55825E+17 1.34983E+12 53734179 throw rick ross kirko bangz *Throws Rick ross At Kirko Bangz*
1 rick ross 2.55825E+17 1.34983E+12 462179553 rick ross stay fresh Rick Ross Stay Fresh!!!!
1 rick ross 2.55827E+17 1.34983E+12 46744853 offici music video dice pineappl rick ross drake wale Official Music Video " Diced Pineapples" Rick Ross / Drake / Wale
1 rick ross 2.55829E+17 1.34983E+12 461725574 saw rick ross uhhh ifxckgaygirl dadd i saw rick ross their .. uhhh @ifxckgaygirls dadd :p
1 rick ross 2.55832E+17 1.34983E+12 283244204 rick ross wavi fat guy Rick Ross is a wavy fat guy
1 rick ross 2.55832E+17 1.34983E+12 528834435 rick ross dice pineappl Rick Ross - Diced Pineapples
1 rick ross 2.55835E+17 1.34983E+12 463279022 rick ross featur wale amp drake dice pineappl ricki ross experi downtim less 24 hour Rick Ross featuring Wale & Drake – Diced Pineapples: Ricky Ross experiences no downtime as less than 24 hours ...
1 rick ross 2.55835E+17 1.34983E+12 28460245 yuck lalasodiddi need husband rick ross take award home hiphiopaward YUCK! RT @LalaSoDiddy: I need my husband Rick Ross to take some awards home #HipHiopAwards
1 rick ross 2.55836E+17 1.34983E+12 330811468 kingkennzi rick ross round “@KingKennzie: Rick Ross is very round.†ðŸ
1 rick ross 2.55836E+17 1.34983E+12 124024753 rick ross titti Rick Ross Titties!
1 rick ross 2.55836E+17 1.34983E+12 765822380 rick ross titti tho Rick Ross and them titties tho!!!
2 tyler oakley 2.55821E+17 1.34983E+12 867420925 know someth trend new asktyl tyleroakley live HOW DO YOU KNOW WHEN SOMETHING IS TRENDING? IM NEW TO THIS... #aSKTYLER
2 tyler oakley 2.55822E+17 1.34983E+12 504044044 asktyl get perfect quiff tyleroakley live #AskTyler How do you get a perfect quiff :)?
2 tyler oakley 2.55822E+17 1.34983E+12 709347721 asktyl realli homework right now tyleroakley live #asktyler i really should be doing homework right now
2 tyler oakley 2.55822E+17 1.34983E+12 171667747 obsess right now asktyl tyleroakley live what is your obsession right now? #asktyler
3 wiz khalifa 2.5582E+17 1.34983E+12 588829718 dont like wiz khalifa look sexi I don't like Wiz Khalifa but he looks sexy.
3 wiz khalifa 2.55856E+17 1.34984E+12 502086440 feel like wiz khalifa right now I feel like wiz Khalifa right now..
3 wiz khalifa 2.55866E+17 1.34984E+12 446056049 like wiz khalifa hes ador realli look like hot cheeto man thingi I like Wiz Khalifa he's adorable, but he really do look like the hot cheeto man thingy
3 wiz khalifa 2.55883E+17 1.34984E+12 67747115 np ne yo ft wiz khalifa dont make em like #Np Ne-Yo ft. Wiz Khalifa - They don't make em like you
Update: How can I remove the duplicates of it?
为什么 FileReader
无法读取 newNamefile
?
那是因为
中的变量newNamefile
BufferedReader reader = new BufferedReader(new FileReader(newNamefile));
不存在于EventDetectioncopy#eventDetectionToken
的范围内。
建议的解决方案
更改变量以匹配方法中的参数:
BufferedReader reader = new BufferedReader(new FileReader(filename));
String csvFile = csvFilePath1;
BufferedReader br = null;
BufferedReader br1 = null;
String line = "";
String csv = csvFilePath;
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(csv);
} catch (IOException e) {
e.printStackTrace();
}
HashSet<String> lines = new HashSet<>();
try {
br = new BufferedReader(new FileReader(csvFile));
br1 = new BufferedReader(new FileReader(csvFilePath1));
int headerRow = 10;
for (int i = 0; i <= headerRow; i++) {
fileWriter.append(br1.readLine() + "\n");
}
br1.close();
while ((line = br.readLine()) != null) {
if (lines.add(line) && lines.size() >= 5) {
fileWriter.append(line);
fileWriter.append("\n");
}
}
fileWriter.flush();
fileWriter.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
EDITED: 它将删除所有重复项并仅保留一项。
public static void eventDetectionToken(String fileInput, String fileOuput)
throws FileNotFoundException, IOException {
FileWriter newCsv = new FileWriter(fileOuput + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(fileInput + ".csv"));
String data;
try {
List<String> existanceTokens = new ArrayList<String>();
do {
data = reader.readLine();
String tweetToken = null;
if (data != null) {
String[] splitText = data.split(",");
tweetToken = splitText[5];
if (!(existanceTokens.contains(tweetToken))) {
newCsvBW.append(data);
newCsvBW.append("\n");
existanceTokens.add(tweetToken);
}
}
} while (data != null);
} finally {
newCsvBW.close();
reader.close();
}
}
但是,如果您想首先创建包含 [name entity] 副本的 CSV 文件,而不是基于此文件创建包含 [tweet token] 副本的第二个文件,则需要将 inputCSV
更改为 newNamefile
第二次 eventDetection
调用如下:
eventDetection(inputCSV, newNamefile, 1);
eventDetection(newNamefile, newTokenfile, 5);
希望对您有所帮助。
我正在尝试读取从 FileWriter 创建的第一个 csv 文件。
第一个csv文件的输出是occurs/appear超过10次的实体名称(列[1])的内容。
读取第一个 csv 文件后,我试图检查列 [5] 的重复项(即推文令牌)并写入,并将其添加到第二个 csv 文件中。
我尝试使用 .contains
方法,它不检查重复项。
Update: I have successfully read the file but not able to remove duplicates in
EventDetectionToken()
.
代码如下:
import java.io.*;
import java.util.*;
public class EventDetectioncopy {
public static void main(String[] args) throws FileNotFoundException, IOException{
//1st csv file
System.out.print("Enter a name for new Tweet Cluster sorting by name entity: ");
BufferedReader scanName = new BufferedReader(new InputStreamReader(System.in));
String newNamefile = scanName.readLine();
//2nd csv file
System.out.print("Enter a name for new Tweet Cluster sorting by tweet tokens: ");
BufferedReader scanToken = new BufferedReader(new InputStreamReader(System.in));
String newTokenfile = scanToken.readLine();
try {
eventDetectionName(newNamefile);
eventDetectionToken(newNamefile, newTokenfile);
}
catch (FileNotFoundException e) {
System.out.println(e);
}
catch (IOException e){
}
}
public static void eventDetectionToken(String fileInput, String fileOuput) throws FileNotFoundException, IOException{
FileWriter newCsv = new FileWriter(fileOutput + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(fileInput + ".csv"));
String data;
try{
String temp = null;
List<String> tempList = new ArrayList<String>();
do
{
data = reader.readLine();
String tweetToken = null;
if(data != null)
{
String[] splitText = data.split(",");
tweetToken = splitText[5];
}
if(temp != null)
{
if(data == null || tweetToken.contains(tweetToken))
{
if(!(temp.equals(tweetToken)))
{
for (int i = 0; i < tempList.size(); i ++)
{
newCsvBW.append(tempList.get(i));
newCsvBW.append("\n");
System.out.println(tempList.get(i));
}
}
tempList.clear();
temp = tweetToken;
}
}
else
{
temp = tweetToken;
}
tempList.add(data);
}
while(data != null);
}
finally
{
newCsvBW.close();
reader.close();
}
}
public static void eventDetectionName(String filename) throws FileNotFoundException, IOException{
String csv = "1day/clusters.sortedby.clusterid.csv";
FileWriter newCsv = new FileWriter(filename + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(csv));
String data;
try{
String temp = null;
List<String> tempList = new ArrayList<String>();
List<Long> tempTime = new ArrayList<Long>();
do
{
data = reader.readLine();
String nameEntity = null;
if (data != null)
{
String[] splitText = data.split(",");
nameEntity = splitText[1];
}
if (temp != null)
{
if (data == null || !(nameEntity.equals(temp)))
{
if (tempList.size() >= 10)
{
for (int i = 0; i < tempList.size(); i++)
{
newCsvBW.append(tempList.get(i));
newCsvBW.append("\n");
System.out.println(tempList.get(i));
}
}
tempList.clear();
temp = nameEntity;
}
}
else
{
temp = nameEntity;
}
tempList.add(data);
}
while (data != null);
}
finally
{
reader.close();
newCsvBW.close();
}
}
}
以下是原始 csv 文件的一些内容:"clusters.sortedby.clusterid.csv",在 运行 EventDetectioncopy.java
之前,带有重复的推文标记(第 [5] 列):
[clusterid]、[name entity]、[tweetid]、[timestamp]、[userid]、[tweet token]、[tweet text]
1 rick ross 2.5582E+17 1.34983E+12 389746870 rick ross dice pineappl Rick Ross x diced pineapples
1 rick ross 2.5582E+17 1.34983E+12 56082039 dice pineappl uhhh rick ross voic Diced Pineapples. UHHH *Rick Ross voice*
1 rick ross 2.55821E+17 1.34983E+12 870278689 rick ross trend Why is Rick Ross trending?
1 rick ross 2.55822E+17 1.34983E+12 379948188 lmfao rick ross grunt Lmfao he did that rick ross grunt .
1 rick ross 2.55822E+17 1.34983E+12 276594374 play rick ross they played w| rick ross !
1 rick ross 2.55822E+17 1.34983E+12 386219877 rick ross ugli Rick Ross So Ugly ..
1 rick ross 2.55822E+17 1.34983E+12 53327754 wanna play rick ross belli I Wanna Play in Rick Ross Belly..!
1 rick ross 2.55824E+17 1.34983E+12 19690034 rick ross dice pineappl ft wale amp drake video via laleak Rick Ross - Diced Pineapples ft. Wale & Drake (Video) via @laleakers
1 rick ross 2.55825E+17 1.34983E+12 357250991 husband rick ross where my husband rick ross î„…î‰
1 rick ross 2.55825E+17 1.34983E+12 53734179 throw rick ross kirko bangz *Throws Rick ross At Kirko Bangz*
1 rick ross 2.55825E+17 1.34983E+12 462179553 rick ross stay fresh Rick Ross Stay Fresh!!!!
1 rick ross 2.55827E+17 1.34983E+12 46744853 offici music video dice pineappl rick ross drake wale Official Music Video " Diced Pineapples" Rick Ross / Drake / Wale
1 rick ross 2.55829E+17 1.34983E+12 461725574 saw rick ross uhhh ifxckgaygirl dadd i saw rick ross their .. uhhh @ifxckgaygirls dadd :p
1 rick ross 2.55832E+17 1.34983E+12 283244204 rick ross wavi fat guy Rick Ross is a wavy fat guy
1 rick ross 2.55832E+17 1.34983E+12 528834435 rick ross dice pineappl Rick Ross - Diced Pineapples
1 rick ross 2.55835E+17 1.34983E+12 463279022 rick ross featur wale amp drake dice pineappl ricki ross experi downtim less 24 hour Rick Ross featuring Wale & Drake – Diced Pineapples: Ricky Ross experiences no downtime as less than 24 hours ...
1 rick ross 2.55835E+17 1.34983E+12 28460245 yuck lalasodiddi need husband rick ross take award home hiphiopaward YUCK! RT @LalaSoDiddy: I need my husband Rick Ross to take some awards home #HipHiopAwards
1 rick ross 2.55836E+17 1.34983E+12 330811468 kingkennzi rick ross round “@KingKennzie: Rick Ross is very round.†ðŸ
1 rick ross 2.55836E+17 1.34983E+12 124024753 rick ross titti Rick Ross Titties!
1 rick ross 2.55836E+17 1.34983E+12 765822380 rick ross titti tho Rick Ross and them titties tho!!!
2 tyler oakley 2.55821E+17 1.34983E+12 867420925 know someth trend new asktyl tyleroakley live HOW DO YOU KNOW WHEN SOMETHING IS TRENDING? IM NEW TO THIS... #aSKTYLER
2 tyler oakley 2.55822E+17 1.34983E+12 504044044 asktyl get perfect quiff tyleroakley live #AskTyler How do you get a perfect quiff :)?
2 tyler oakley 2.55822E+17 1.34983E+12 709347721 asktyl realli homework right now tyleroakley live #asktyler i really should be doing homework right now
2 tyler oakley 2.55822E+17 1.34983E+12 171667747 obsess right now asktyl tyleroakley live what is your obsession right now? #asktyler
3 wiz khalifa 2.5582E+17 1.34983E+12 588829718 dont like wiz khalifa look sexi I don't like Wiz Khalifa but he looks sexy.
3 wiz khalifa 2.55856E+17 1.34984E+12 502086440 feel like wiz khalifa right now I feel like wiz Khalifa right now..
3 wiz khalifa 2.55866E+17 1.34984E+12 446056049 like wiz khalifa hes ador realli look like hot cheeto man thingi I like Wiz Khalifa he's adorable, but he really do look like the hot cheeto man thingy
3 wiz khalifa 2.55883E+17 1.34984E+12 67747115 np ne yo ft wiz khalifa dont make em like #Np Ne-Yo ft. Wiz Khalifa - They don't make em like you
Update: How can I remove the duplicates of it?
为什么 FileReader
无法读取 newNamefile
?
那是因为
中的变量newNamefile
BufferedReader reader = new BufferedReader(new FileReader(newNamefile));
不存在于EventDetectioncopy#eventDetectionToken
的范围内。
建议的解决方案
更改变量以匹配方法中的参数:
BufferedReader reader = new BufferedReader(new FileReader(filename));
String csvFile = csvFilePath1;
BufferedReader br = null;
BufferedReader br1 = null;
String line = "";
String csv = csvFilePath;
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(csv);
} catch (IOException e) {
e.printStackTrace();
}
HashSet<String> lines = new HashSet<>();
try {
br = new BufferedReader(new FileReader(csvFile));
br1 = new BufferedReader(new FileReader(csvFilePath1));
int headerRow = 10;
for (int i = 0; i <= headerRow; i++) {
fileWriter.append(br1.readLine() + "\n");
}
br1.close();
while ((line = br.readLine()) != null) {
if (lines.add(line) && lines.size() >= 5) {
fileWriter.append(line);
fileWriter.append("\n");
}
}
fileWriter.flush();
fileWriter.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
EDITED: 它将删除所有重复项并仅保留一项。
public static void eventDetectionToken(String fileInput, String fileOuput)
throws FileNotFoundException, IOException {
FileWriter newCsv = new FileWriter(fileOuput + "." + "csv");
BufferedWriter newCsvBW = new BufferedWriter(newCsv);
BufferedReader reader = new BufferedReader(new FileReader(fileInput + ".csv"));
String data;
try {
List<String> existanceTokens = new ArrayList<String>();
do {
data = reader.readLine();
String tweetToken = null;
if (data != null) {
String[] splitText = data.split(",");
tweetToken = splitText[5];
if (!(existanceTokens.contains(tweetToken))) {
newCsvBW.append(data);
newCsvBW.append("\n");
existanceTokens.add(tweetToken);
}
}
} while (data != null);
} finally {
newCsvBW.close();
reader.close();
}
}
但是,如果您想首先创建包含 [name entity] 副本的 CSV 文件,而不是基于此文件创建包含 [tweet token] 副本的第二个文件,则需要将 inputCSV
更改为 newNamefile
第二次 eventDetection
调用如下:
eventDetection(inputCSV, newNamefile, 1);
eventDetection(newNamefile, newTokenfile, 5);
希望对您有所帮助。