读取带有字符串和浮点数据的文本文件并将其存储到哈希图中
Read and Store text file with string and float data into a hashmap
我有一个文本文件,文件中的每一行都以一个词开头,后跟 50 个浮点数,表示该词的向量描述(嵌入)。我正在尝试读取文件并将每个单词及其嵌入存储在哈希 table 中。我面临的问题是我收到数字格式异常或有时出现数组越界异常。如何读取和存储每个单词及其在哈希映射中的嵌入?
sNode class:
public class sNode{ // Node class for hash map
public String word;
public float[] embedding;
public sNode next;
public sNode(String S, float[] E, sNode N){ // Constructor
word = S;
embedding = new float[50];
for (int i=0;i<50;i++)
embedding[i] = E[i]; next = N;
}
hashTableStrings class:
public class hashTableStrings{
private static sNode [] H;
private int TABLE_SIZE;
private int size;
public hashTableStrings(int n){ // Initialize all lists to null H = new sNode[n]; for(int i=0;i<n;i++) H[i] = null; }
size = 0;
TABLE_SIZE = n;
H = new sNode[TABLE_SIZE];
for(int i=0;i<TABLE_SIZE;i++)
H[i] = null;
}
public int getSize(){ // Function to get number of key-value pairs
return size;
}
public static void main (String [] args) throws IOException{
Scanner scanner = new Scanner(new FileReader("glove.6B.50d.txt"));
HashMap<String, Float> table = new HashMap<String, Float>();
while (scanner.hasNextLine()) {
String[] words = scanner.nextLine().split("\t\t"); // split space between word and float number embedding
for (int i=0; i<50;i++){
table.put(words[0], Float.parseFloat(words[i]));
}
}
System.out.println(table);
}
Txt 文件示例:
文件可以在下面link找到:
https://nlp.stanford.edu/projects/glove/
下载文件
glove.6B.zip
并打开
glove.6B.50d.txt
文本文件。
您得到 "Array out of Bound" 异常的原因是因为您通过“\t\t”双制表符 space 拆分字符串。而只有一个 space。因此,每一行没有被分成多个单词,而是作为 1 个完整的字符串,并且您只得到 1 个长度数组。
String[] words = scanner.nextLine().split("\t\t");
// words.length will return 1, since it contains only single String( Whole line).
将split("\t\t")
替换为split(" ")
,应该可以修复problem.By的方式,每行总共有51个单词(如果每行包含起始单词)。所以你应该 i < 51 not i <50
.
for(i = 1; i < 51; i++){
// Do your work...
}
// i is starting from 1st index because at 0th index, the starting word will be placed and the floating points starts from 1st index.
但是,正如@Satish Thulva 指出的那样,您使用 HashMap 的 code.The 方式仍然存在一些问题,关键字(单词)将只有最后一个浮动值(而不是整个浮动值)行中的值)作为它的值。
例如,
truecar.com -0.23163 0.39098 -0.7428 1.5123 -1.2368 -0.89173 -0.051826 -1.1305 0.96384 -0.12672 -0.8412 -0.76053 0.10582 -0.23173 0.11274 0.26327 0.053071 0.66657 0.9423 -0.78162 1.6225 0.097435 -0.67124 0.46235 0.3226 1.3423 0.87102 0.2217 -0.068228 0.73468 -1.0692 -0.85722 -0.49683 -1.4468 -1.1979 -0.49506 -0.36319 0.53553 -0.046529 1.5829 -0.1326 -0.55717 -0.17242 0.99214 0.73551 -0.51421 0.29743 0.19933 0.87613 0.63135
在您的情况下,结果将是
Key: truecar.com value: 0.63135
要将 key
的所有浮动值存储为 Value
,请使用 HashMap<String, Float[]>
String[] words = scanner.nextLine().split(" "); // split space between word and float number embedding
//An Array of Float which will keep values for words.
Float values[] = new Float[ words.length-1 ]; // because we are not going to store word as its value.
for( int i=1; i< words.length; i++){
values[i-1] = Float.parseFloat(words[i]) ; }
// Now all the values are stored in array.
// Now store it in the Map.
table.put(words[0], values);
我有一个文本文件,文件中的每一行都以一个词开头,后跟 50 个浮点数,表示该词的向量描述(嵌入)。我正在尝试读取文件并将每个单词及其嵌入存储在哈希 table 中。我面临的问题是我收到数字格式异常或有时出现数组越界异常。如何读取和存储每个单词及其在哈希映射中的嵌入?
sNode class:
public class sNode{ // Node class for hash map
public String word;
public float[] embedding;
public sNode next;
public sNode(String S, float[] E, sNode N){ // Constructor
word = S;
embedding = new float[50];
for (int i=0;i<50;i++)
embedding[i] = E[i]; next = N;
}
hashTableStrings class:
public class hashTableStrings{
private static sNode [] H;
private int TABLE_SIZE;
private int size;
public hashTableStrings(int n){ // Initialize all lists to null H = new sNode[n]; for(int i=0;i<n;i++) H[i] = null; }
size = 0;
TABLE_SIZE = n;
H = new sNode[TABLE_SIZE];
for(int i=0;i<TABLE_SIZE;i++)
H[i] = null;
}
public int getSize(){ // Function to get number of key-value pairs
return size;
}
public static void main (String [] args) throws IOException{
Scanner scanner = new Scanner(new FileReader("glove.6B.50d.txt"));
HashMap<String, Float> table = new HashMap<String, Float>();
while (scanner.hasNextLine()) {
String[] words = scanner.nextLine().split("\t\t"); // split space between word and float number embedding
for (int i=0; i<50;i++){
table.put(words[0], Float.parseFloat(words[i]));
}
}
System.out.println(table);
}
Txt 文件示例:
文件可以在下面link找到: https://nlp.stanford.edu/projects/glove/
下载文件
glove.6B.zip
并打开
glove.6B.50d.txt
文本文件。
您得到 "Array out of Bound" 异常的原因是因为您通过“\t\t”双制表符 space 拆分字符串。而只有一个 space。因此,每一行没有被分成多个单词,而是作为 1 个完整的字符串,并且您只得到 1 个长度数组。
String[] words = scanner.nextLine().split("\t\t");
// words.length will return 1, since it contains only single String( Whole line).
将split("\t\t")
替换为split(" ")
,应该可以修复problem.By的方式,每行总共有51个单词(如果每行包含起始单词)。所以你应该 i < 51 not i <50
.
for(i = 1; i < 51; i++){
// Do your work...
}
// i is starting from 1st index because at 0th index, the starting word will be placed and the floating points starts from 1st index.
但是,正如@Satish Thulva 指出的那样,您使用 HashMap 的 code.The 方式仍然存在一些问题,关键字(单词)将只有最后一个浮动值(而不是整个浮动值)行中的值)作为它的值。 例如,
truecar.com -0.23163 0.39098 -0.7428 1.5123 -1.2368 -0.89173 -0.051826 -1.1305 0.96384 -0.12672 -0.8412 -0.76053 0.10582 -0.23173 0.11274 0.26327 0.053071 0.66657 0.9423 -0.78162 1.6225 0.097435 -0.67124 0.46235 0.3226 1.3423 0.87102 0.2217 -0.068228 0.73468 -1.0692 -0.85722 -0.49683 -1.4468 -1.1979 -0.49506 -0.36319 0.53553 -0.046529 1.5829 -0.1326 -0.55717 -0.17242 0.99214 0.73551 -0.51421 0.29743 0.19933 0.87613 0.63135
在您的情况下,结果将是
Key: truecar.com value: 0.63135
要将 key
的所有浮动值存储为 Value
,请使用 HashMap<String, Float[]>
String[] words = scanner.nextLine().split(" "); // split space between word and float number embedding
//An Array of Float which will keep values for words.
Float values[] = new Float[ words.length-1 ]; // because we are not going to store word as its value.
for( int i=1; i< words.length; i++){
values[i-1] = Float.parseFloat(words[i]) ; }
// Now all the values are stored in array.
// Now store it in the Map.
table.put(words[0], values);