读取带有字符串和浮点数据的文本文件并将其存储到哈希图中

Read and Store text file with string and float data into a hashmap

我有一个文本文件,文件中的每一行都以一个词开头,后跟 50 个浮点数,表示该词的向量描述(嵌入)。我正在尝试读取文件并将每个单词及其嵌入存储在哈希 table 中。我面临的问题是我收到数字格式异常或有时出现数组越界异常。如何读取和存储每个单词及其在哈希映射中的嵌入?

sNode class:

public class sNode{ // Node class for hash map
public String word; 
public float[] embedding; 
public sNode next;

public sNode(String S, float[] E, sNode N){ // Constructor
    word = S; 
    embedding = new float[50];
    for (int i=0;i<50;i++) 
        embedding[i] = E[i]; next = N; 
}

hashTableStrings class:

public class hashTableStrings{ 
private static sNode [] H;
private int TABLE_SIZE;
private int size; 

public hashTableStrings(int n){ // Initialize all lists to null H = new sNode[n]; for(int i=0;i<n;i++) H[i] = null; }
    size = 0;
    TABLE_SIZE = n;
    H = new sNode[TABLE_SIZE]; 
    for(int i=0;i<TABLE_SIZE;i++) 
        H[i] = null;
}

public int getSize(){ // Function to get number of key-value pairs
    return size;
}


public static void main (String [] args) throws IOException{
    Scanner scanner = new Scanner(new FileReader("glove.6B.50d.txt"));

    HashMap<String, Float> table = new HashMap<String, Float>();

    while (scanner.hasNextLine()) {
        String[] words = scanner.nextLine().split("\t\t"); // split space between word and float number embedding
        for (int i=0; i<50;i++){
            table.put(words[0], Float.parseFloat(words[i]));
        }
    }

    System.out.println(table);

}

Txt 文件示例:

文件可以在下面link找到: https://nlp.stanford.edu/projects/glove/

下载文件

glove.6B.zip

并打开

glove.6B.50d.txt

文本文件。

您得到 "Array out of Bound" 异常的原因是因为您通过“\t\t”双制表符 space 拆分字符串。而只有一个 space。因此,每一行没有被分成多个单词,而是作为 1 个完整的字符串,并且您只得到 1 个长度数组。

 String[] words = scanner.nextLine().split("\t\t");
// words.length will return 1, since it contains only single String( Whole line).

split("\t\t")替换为split(" "),应该可以修复problem.By的方式,每行总共有51个单词(如果每行包含起始单词)。所以你应该 i < 51 not i <50.

  for(i = 1; i < 51; i++){
     // Do your work...
    }

  //  i is starting from 1st index because at 0th index, the starting word will be placed and the floating points starts from 1st index.

但是,正如@Satish Thulva 指出的那样,您使用 HashMap 的 code.The 方式仍然存在一些问题,关键字(单词)将只有最后一个浮动值(而不是整个浮动值)行中的值)作为它的值。 例如,

truecar.com  -0.23163  0.39098  -0.7428  1.5123  -1.2368  -0.89173  -0.051826  -1.1305  0.96384  -0.12672  -0.8412  -0.76053  0.10582  -0.23173  0.11274  0.26327  0.053071  0.66657  0.9423  -0.78162  1.6225  0.097435  -0.67124  0.46235  0.3226  1.3423  0.87102  0.2217  -0.068228  0.73468  -1.0692  -0.85722  -0.49683  -1.4468  -1.1979  -0.49506  -0.36319  0.53553  -0.046529  1.5829  -0.1326  -0.55717  -0.17242  0.99214  0.73551  -0.51421  0.29743  0.19933  0.87613  0.63135

在您的情况下,结果将是

 Key: truecar.com  value: 0.63135

要将 key 的所有浮动值存储为 Value,请使用 HashMap<String, Float[]>

String[] words = scanner.nextLine().split(" "); // split space between word and float number embedding

        //An Array of Float which will keep values for words.
        Float values[] = new Float[ words.length-1 ];    //  because we are not going to store word as its value.
        for( int i=1; i< words.length; i++){
            values[i-1] = Float.parseFloat(words[i]) ; }

        // Now all the values are stored in array.
        // Now store it in the Map.
        table.put(words[0], values);