将大数据插入 Cloud Spanner Table

Insert Large data into Cloud Spanner Table

我想将大数据插入 Google 的 Cloud Spanner Table。

这就是我在 node.js 应用程序中所做的,但由于 txt 文件太大(将近 2GB)而停止。

1.load txt file

2.read line by line

3.split the line by "|"

4.build data object

5.insert data to Cloud Spanner table

Mysql 支持使用 .sql 文件插入数据。 Cloud Spanner 是否也支持某种方式?

Cloud Spanner 目前不公开批量导入方法。听起来您打算单独插入每一行,这不是最佳方法。该文档具有 efficient bulk loading:

的最佳(和不良)做法

To get optimal write throughput for bulk loads, partition your data by primary key with this pattern:

Each partition contains a range of consecutive rows. Each commit contains data for only a single partition. A good rule of thumb for your number of partitions is 10 times the number of nodes in your Cloud Spanner instance. So if you have N nodes, with a total of 10*N partitions, you can assign rows to partitions by:

Sorting your data by primary key. Dividing it into 10*N separate sections. Creating a set of worker tasks that upload the data. Each worker will write to a single partition. Within the partition, it is recommended that your worker write the rows sequentially. However, writing data randomly within a partition should also provide reasonably high throughput.

As more of your data is uploaded, Cloud Spanner automatically splits and rebalances your data to balance load on the nodes in your instance. During this process, you may experience temporary drops in throughput.

Following this pattern, you should see a maximum overall bulk write throughput of 10-20 MiB per second per node.

您似乎还试图在处理之前将整个大文件加载到内存中。对于大文件,您应该着眼于加载和处理块而不是整个文件。我是 Node 专家,但您可能应该尝试将其作为流读取,而不是将所有内容都保存在内存中。