基于集合的非规范化数据批量导入规范化 SQL Server 2014 数据库表

Set-based bulk import of denormalized data into normalized SQL Server 2014 database tables

以下简化模型可以很好地基于 bulk/set 在 #BulkData 中插入非规范化数据(欢迎提出改进建议):

IF OBJECT_ID('tempdb..#Things') IS NOT NULL 
   DROP TABLE #Things

IF OBJECT_ID('tempdb..#Categories') IS NOT NULL 
   DROP TABLE #Categories

IF OBJECT_ID('tempdb..#ThingsToCategories') IS NOT NULL 
   DROP TABLE #ThingsToCategories

IF OBJECT_ID('tempdb..#BulkData') IS NOT NULL 
   DROP TABLE #BulkData

CREATE TABLE #Things
(
    ThingId INT IDENTITY(1,1) PRIMARY KEY,
    ThingName NVARCHAR(255)
)

CREATE TABLE #Categories
(
    CategoryId INT IDENTITY(1,1) PRIMARY KEY,
    CategoryName NVARCHAR(255)
)

CREATE TABLE #ThingsToCategories
(
    ThingId INT,
    CategoryId INT
)

CREATE TABLE #BulkData
(
    ThingName NVARCHAR(255),
    CategoryName NVARCHAR(255)
)

-- the following would be done from a flat file via a bulk import 
INSERT INTO #BulkData
    SELECT N'Thing1', N'Category1'
        UNION 
    SELECT N'Thing2', N'Category1'
        UNION 
    SELECT N'Thing3', N'Category2'

INSERT INTO #Categories
    SELECT DISTINCT CategoryName 
    FROM #BulkData 
    WHERE CategoryName NOT IN (SELECT DISTINCT CategoryName 
                               FROM #Categories)

INSERT INTO #Things
    SELECT DISTINCT ThingName 
    FROM #BulkData 
    WHERE ThingName NOT IN (SELECT DISTINCT ThingName FROM #Things)

INSERT INTO #ThingsToCategories
    SELECT ThingId, CategoryId
    FROM #BulkData 
    INNER JOIN #Things ON #BulkData.ThingName = #Things.ThingName
    INNER JOIN #Categories ON #BulkData.CategoryName = #Categories.CategoryName

SELECT * FROM #Categories
SELECT * FROM #Things
SELECT * FROM #ThingsToCategories

我遇到的一个问题是,在将数据插入 #ThingsToCategories 之前,可以访问 #Things 中的数据。

我能否将上述内容包装在事务中 (?) 以便仅在整个批量导入完成后才使 #Things 可用?

像这样:

BEGIN TRANSACTION X
 -- insert into all normalised tables
COMMIT TRANSACTION X

这对几百万条记录有效吗?

我猜也可以降低日志记录级别?

  1. 我能否将上述内容包装在事务中 (?) 以便仅在整个批量导入完成后才使 #Things 可用?像这样:

BEGIN TRANSACTION X
 -- insert into all normalised tables
COMMIT TRANSACTION X

答案是肯定的。来自 Documentation on Transactions:

A transaction is a single unit of work. If a transaction is successful, all of the data modifications made during the transaction are committed and become a permanent part of the database. If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased.

事务具有以下四个标准属性,通常用首字母缩略词 ACID 表示。 tutorialspoint.com 上 SQL Transactions 上引用以下 link:

Atomicity: ensures that all operations within the work unit are completed successfully; otherwise, the transaction is aborted at the point of failure, and previous operations are rolled back to their former state.

Consistency: ensures that the database properly changes states upon a successfully committed transaction.

Isolation: enables transactions to operate independently of and transparent to each other.

Durability: ensures that the result or effect of a committed transaction persists in case of a system failure.


  1. 这是否适用于几百万个条目?

再一次,是的。条目的数量无关紧要。这次用我自己的话来说:

  • 原子性:如果事务成功,事务中的所有操作将在事务完成时立即生效,即在提交事务时.如果事务中至少有一个操作失败,则所有操作都将回滚(换句话说,none 保留)。 交易中的操作量无关紧要。

  • 隔离:其他事务不会看到其他事务的操作,除非它们被提交。

但是有不同的Transaction Isolation Levels。 SQL 服务器的默认值为 READ COMMITTED:

Specifies that statements cannot read data that has been modified but not committed by other transactions. [...]

这是一个 trade-off 级别,用于平衡性能和一致性。理想情况下,您需要所有内容 SERIALIZABLE(请参阅文档,太长以至于 copy/paste)。这种隔离级别以性能(-)换取一致性(+)。在很多情况下,READ COMMITTED 隔离级别就足够了,但您应该了解它的工作原理,并将其与您的事务应该如何工作相对于其他事务的完成进行比较。

另请注意,事务将锁定数据库对象(行、table、模式...),如果其他事务想要读取或修改这些对象(取决于类型的锁)。因此,最好将事务中的操作量保持在较低水平。但有时,交易只是做了很多事情,它们不能被分解。