使用 SSIS 以 non-standard 格式加载 CSV
Load CSV with non-standard formatting using SSIS
我的任务是加载 csv 文件中的会计交易记录。该文件包含应用于整个文件的一行 header 信息,但出于某种原因,它按帐号将数据分组在交易数据之上,但与 ID 在同一列中。
"ID","Name","Date","Debit","Credit","Balance"
,,,,,
"1150 - Cash in Bank",,,,,
"Starting Balance",,,,,"59,612.78"
615892,"Account Name 1","5/5/2018","2,100.00",,"61,712.78"
645761,"Account Name 2","5/7/2018",,7,"61,705.78"
615892,"Account Name 3","5/8/2018",,"2,144.33","59,561.45"
713300,"Account Name 4","5/8/2018","2,144.33",,"61,705.78"
713300,"Account Name 5","5/8/2018",,"2,144.33","59,561.45"
693615,"Account Name 6","5/9/2018",,"1,650.00","57,911.45"
"Net Change",,,,,"-1,701.33"
,,,"4,244.33","5,945.66","57,911.45"
"3150 - Owner Contribution",,,,,
"Starting Balance",,,,,0
713300,"Account Name 4","5/8/2018",,"2,144.33","-2,144.33"
"Net Change",,,,,"-2,144.33"
,,,0,"2,144.33","-2,144.33"
谁能告诉我如何处理这个问题?我知道如何仅通过几个变量和逐行处理在逻辑上完成此操作,但我根本不是 C# 或前端开发人员。我最大的问题似乎是你不能像SQL那样写一篇文章并测试它。我可以查询 table,查看数据并继续构建它,但使用 C# 我需要整个脚本才能协同工作。我如何从一个小块开始并扩展?甚至将第一个帐户名称读入变量并将其显示为数据流任务中的变量。只是我可以发送代码并取回一些东西,似乎我在网上找到的每个脚本都有一些编译错误,我还没有足够的知识来解决它们。
这应该将所有这些都放入一个 DataTable 结构中,然后您可以使用该结构来分配或执行任何操作。如果您需要不同类型的最终对象,请告诉我。
var data = string.Empty; //String var to hold file
var tbl = new DataTable("MyData"); //Tmp dataTable object
using (var fs = new StreamReader(@"C:\Temp\test.csv")) //Open file
data = fs.ReadToEnd(); //Read entirely into data variable
var rows = data.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries); //Split into array by lines. RemoveEmpty's for end of file extra lines.
var cnt = 0; //Counter to know header
foreach (var row in rows) //Iterate rows
{
var cells = row.Split(new string[] { "\",\"" }, StringSplitOptions.None); //Split row into cells. Leave empties here cause some cells might be empty.
if (cnt == 0) foreach (var cell in cells) //If is header row add columns
tbl.Columns.Add(new DataColumn(cell));
else //Else data row
{
var dataRow = tbl.NewRow(); //New row
dataRow.ItemArray = cells; //Assign cell values
tbl.Rows.Add(dataRow); //Add row to table.
}
cnt++;
}
编辑:清理使用并添加评论。
EDIT2:如果文件太大,这里有一个流媒体版本:
var cnt = 0; //Row counter
var tbl = new DataTable("MyData"); //Tmp dataTable object
using (var fs = new StreamReader(@"C:\Temp\test.csv")) //Load file
{
do //Start loop
{
var row = fs.ReadLine(); //Get first line
var cells = row.Split(new string[] { "\",\"" }, StringSplitOptions.None); //Split into cells
if (cnt == 0) //If is header row
{
foreach (var cell in cells) //For each header
tbl.Columns.Add(new DataColumn(cell)); //Add Column
} else { //Not header row
var dataRow = tbl.NewRow(); //Create new row based on tmp table
dataRow.ItemArray = cells; //Assign cell values
tbl.Rows.Add(row); //Add row to table
}
cnt++;
} while (!fs.EndOfStream); //If not done loop
}
解决方案概述
我在VB.Net中提供了我的答案,因为它可能更容易理解,尤其是你不是 C# 开发人员
- 在
Dataflow task
中,在Flat File Source
之后添加一个Script Component
- 将所有列标记为输入列并添加 8 个输出列
- 在
Input0_ProcessInputRow
检查 ID 列是否不为空并且它包含一个整数以创建输出行,否则如果它包含帐号或起始余额将这些值存储到变量中,否则忽略该行.
详细解决方案
- 添加平面文件连接管理器,Select文本文件
- 将文本限定符更改为
"
- 添加数据流任务
- 在数据流任务中添加平面文件源、脚本组件和 OLEDB 目标
在脚本组件中 Select 所有列作为输入列
添加 8 个输出列(主要列 + 帐户 + 起始余额)(所有类型 DT_STR
)
- 将
OutputBuffer
SynchronousInput
属性改为None
- Select 脚本语言
Visual Basic
在脚本编辑器中编写以下脚本
Private AccountName as String = ""
Private StartingBalance as String = ""
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Not Row.ID_IsNull AndAlso
Not String.IsNullOrEmpty(Row.ID.Trim) Then
'Skip Bad Rows
If Row.ID = "" Then Exit Sub
If Integer.TryParse(Row.ID,New Integer) Then
Output0Buffer.AddRow()
Output0Buffer.ID = Row.ID
Output0Buffer.Name = Row.Name
Output0Buffer.Date = Row.Date
Output0Buffer.Debit = Row.Debit
Output0Buffer.Credit = Row.Credit
Output0Buffer.Balance = Row.Balance
Output0Buffer.Account = AccountName
Output0Buffer.StartingBalance = StartingBalance
Elseif Row.ID.Contains("Starting Balance") Then
StartingBalance = Row.Balance
Elseif Row.ID.Contains("-") Then
AccountName = Row.ID
Else
'Ignore Row
Exit Sub
End If
End If
End Sub
- 将输出列映射到目标列
- 输出将是:
我刚看到这个 post。就在 1 天前经历了非常相似的经历,我会推荐 运行 宁下面的宏(它可以是 Excel 或 CSV 中的 运行,但你不能保存代码如果您使用 CSV 扩展名保存更改)。
' Add reference to Microsoft Active X Data Objects 2.8 Library
Sub testexportsql()
Dim Cn As ADODB.Connection
Dim ServerName As String
Dim DatabaseName As String
Dim TableName As String
Dim UserID As String
Dim Password As String
Dim rs As ADODB.Recordset
Dim RowCounter As Long
Dim NoOfFields As Integer
Dim StartRow As Long
Dim EndRow As Long
Dim ColCounter As Integer
Set rs = New ADODB.Recordset
ServerName = "server_name" ' Enter your server name here
DatabaseName = "db_name" ' Enter your database name here
TableName = "table_name" ' Enter your Table name here
UserID = "" ' Enter your user ID here
' (Leave ID and Password blank if using windows Authentification")
Password = "" ' Enter your password here
NoOfFields = 10 ' Enter number of fields to update (eg. columns in your worksheet)
StartRow = 2 ' Enter row in sheet to start reading records
EndRow = 100 ' Enter row of last record in sheet
' CHANGES
Dim shtSheetToWork As Worksheet
Set shtSheetToWork = ActiveWorkbook.Worksheets("sheet_name")
'********
Set Cn = New ADODB.Connection
Cn.Open "Driver={SQL Server};Server=" & ServerName & ";Database=" & DatabaseName & _
";Uid=" & UserID & ";Pwd=" & Password & ";"
rs.Open TableName, Cn, adOpenKeyset, adLockOptimistic
'EndRow = shtSheetToWork.Cells(Rows.Count, 1).End(xlUp).Row
For RowCounter = StartRow To EndRow
rs.AddNew
For ColCounter = 1 To NoOfFields
rs(ColCounter - 1) = shtSheetToWork.Cells(RowCounter, ColCounter)
Next ColCounter
Debug.Print RowCounter
Next RowCounter
rs.UpdateBatch
' Tidy up
rs.Close
Set rs = Nothing
Cn.Close
Set Cn = Nothing
End Sub
希望此解决方案对您有用。它绝对适合我。
我的任务是加载 csv 文件中的会计交易记录。该文件包含应用于整个文件的一行 header 信息,但出于某种原因,它按帐号将数据分组在交易数据之上,但与 ID 在同一列中。
"ID","Name","Date","Debit","Credit","Balance"
,,,,,
"1150 - Cash in Bank",,,,,
"Starting Balance",,,,,"59,612.78"
615892,"Account Name 1","5/5/2018","2,100.00",,"61,712.78"
645761,"Account Name 2","5/7/2018",,7,"61,705.78"
615892,"Account Name 3","5/8/2018",,"2,144.33","59,561.45"
713300,"Account Name 4","5/8/2018","2,144.33",,"61,705.78"
713300,"Account Name 5","5/8/2018",,"2,144.33","59,561.45"
693615,"Account Name 6","5/9/2018",,"1,650.00","57,911.45"
"Net Change",,,,,"-1,701.33"
,,,"4,244.33","5,945.66","57,911.45"
"3150 - Owner Contribution",,,,,
"Starting Balance",,,,,0
713300,"Account Name 4","5/8/2018",,"2,144.33","-2,144.33"
"Net Change",,,,,"-2,144.33"
,,,0,"2,144.33","-2,144.33"
谁能告诉我如何处理这个问题?我知道如何仅通过几个变量和逐行处理在逻辑上完成此操作,但我根本不是 C# 或前端开发人员。我最大的问题似乎是你不能像SQL那样写一篇文章并测试它。我可以查询 table,查看数据并继续构建它,但使用 C# 我需要整个脚本才能协同工作。我如何从一个小块开始并扩展?甚至将第一个帐户名称读入变量并将其显示为数据流任务中的变量。只是我可以发送代码并取回一些东西,似乎我在网上找到的每个脚本都有一些编译错误,我还没有足够的知识来解决它们。
这应该将所有这些都放入一个 DataTable 结构中,然后您可以使用该结构来分配或执行任何操作。如果您需要不同类型的最终对象,请告诉我。
var data = string.Empty; //String var to hold file
var tbl = new DataTable("MyData"); //Tmp dataTable object
using (var fs = new StreamReader(@"C:\Temp\test.csv")) //Open file
data = fs.ReadToEnd(); //Read entirely into data variable
var rows = data.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries); //Split into array by lines. RemoveEmpty's for end of file extra lines.
var cnt = 0; //Counter to know header
foreach (var row in rows) //Iterate rows
{
var cells = row.Split(new string[] { "\",\"" }, StringSplitOptions.None); //Split row into cells. Leave empties here cause some cells might be empty.
if (cnt == 0) foreach (var cell in cells) //If is header row add columns
tbl.Columns.Add(new DataColumn(cell));
else //Else data row
{
var dataRow = tbl.NewRow(); //New row
dataRow.ItemArray = cells; //Assign cell values
tbl.Rows.Add(dataRow); //Add row to table.
}
cnt++;
}
编辑:清理使用并添加评论。
EDIT2:如果文件太大,这里有一个流媒体版本:
var cnt = 0; //Row counter
var tbl = new DataTable("MyData"); //Tmp dataTable object
using (var fs = new StreamReader(@"C:\Temp\test.csv")) //Load file
{
do //Start loop
{
var row = fs.ReadLine(); //Get first line
var cells = row.Split(new string[] { "\",\"" }, StringSplitOptions.None); //Split into cells
if (cnt == 0) //If is header row
{
foreach (var cell in cells) //For each header
tbl.Columns.Add(new DataColumn(cell)); //Add Column
} else { //Not header row
var dataRow = tbl.NewRow(); //Create new row based on tmp table
dataRow.ItemArray = cells; //Assign cell values
tbl.Rows.Add(row); //Add row to table
}
cnt++;
} while (!fs.EndOfStream); //If not done loop
}
解决方案概述
我在VB.Net中提供了我的答案,因为它可能更容易理解,尤其是你不是 C# 开发人员
- 在
Dataflow task
中,在Flat File Source
之后添加一个 - 将所有列标记为输入列并添加 8 个输出列
- 在
Input0_ProcessInputRow
检查 ID 列是否不为空并且它包含一个整数以创建输出行,否则如果它包含帐号或起始余额将这些值存储到变量中,否则忽略该行.
Script Component
详细解决方案
- 添加平面文件连接管理器,Select文本文件
- 将文本限定符更改为
"
- 添加数据流任务
- 在数据流任务中添加平面文件源、脚本组件和 OLEDB 目标
在脚本组件中 Select 所有列作为输入列
添加 8 个输出列(主要列 + 帐户 + 起始余额)(所有类型
DT_STR
)
- 将
OutputBuffer
SynchronousInput
属性改为None
- Select 脚本语言
Visual Basic
在脚本编辑器中编写以下脚本
Private AccountName as String = "" Private StartingBalance as String = ""
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Not Row.ID_IsNull AndAlso
Not String.IsNullOrEmpty(Row.ID.Trim) Then
'Skip Bad Rows
If Row.ID = "" Then Exit Sub
If Integer.TryParse(Row.ID,New Integer) Then
Output0Buffer.AddRow()
Output0Buffer.ID = Row.ID
Output0Buffer.Name = Row.Name
Output0Buffer.Date = Row.Date
Output0Buffer.Debit = Row.Debit
Output0Buffer.Credit = Row.Credit
Output0Buffer.Balance = Row.Balance
Output0Buffer.Account = AccountName
Output0Buffer.StartingBalance = StartingBalance
Elseif Row.ID.Contains("Starting Balance") Then
StartingBalance = Row.Balance
Elseif Row.ID.Contains("-") Then
AccountName = Row.ID
Else
'Ignore Row
Exit Sub
End If
End If
End Sub
- 将输出列映射到目标列
- 输出将是:
我刚看到这个 post。就在 1 天前经历了非常相似的经历,我会推荐 运行 宁下面的宏(它可以是 Excel 或 CSV 中的 运行,但你不能保存代码如果您使用 CSV 扩展名保存更改)。
' Add reference to Microsoft Active X Data Objects 2.8 Library
Sub testexportsql()
Dim Cn As ADODB.Connection
Dim ServerName As String
Dim DatabaseName As String
Dim TableName As String
Dim UserID As String
Dim Password As String
Dim rs As ADODB.Recordset
Dim RowCounter As Long
Dim NoOfFields As Integer
Dim StartRow As Long
Dim EndRow As Long
Dim ColCounter As Integer
Set rs = New ADODB.Recordset
ServerName = "server_name" ' Enter your server name here
DatabaseName = "db_name" ' Enter your database name here
TableName = "table_name" ' Enter your Table name here
UserID = "" ' Enter your user ID here
' (Leave ID and Password blank if using windows Authentification")
Password = "" ' Enter your password here
NoOfFields = 10 ' Enter number of fields to update (eg. columns in your worksheet)
StartRow = 2 ' Enter row in sheet to start reading records
EndRow = 100 ' Enter row of last record in sheet
' CHANGES
Dim shtSheetToWork As Worksheet
Set shtSheetToWork = ActiveWorkbook.Worksheets("sheet_name")
'********
Set Cn = New ADODB.Connection
Cn.Open "Driver={SQL Server};Server=" & ServerName & ";Database=" & DatabaseName & _
";Uid=" & UserID & ";Pwd=" & Password & ";"
rs.Open TableName, Cn, adOpenKeyset, adLockOptimistic
'EndRow = shtSheetToWork.Cells(Rows.Count, 1).End(xlUp).Row
For RowCounter = StartRow To EndRow
rs.AddNew
For ColCounter = 1 To NoOfFields
rs(ColCounter - 1) = shtSheetToWork.Cells(RowCounter, ColCounter)
Next ColCounter
Debug.Print RowCounter
Next RowCounter
rs.UpdateBatch
' Tidy up
rs.Close
Set rs = Nothing
Cn.Close
Set Cn = Nothing
End Sub
希望此解决方案对您有用。它绝对适合我。