提高检查文件定界符的性能
Increase performance for checking file delimiters
在花了一些时间寻找最明确的方法来检查文件的 body 是否具有与 header 相同数量的分隔符后,我想出了这个代码:
Param #user enters the directory path and delimiter they are checking for
(
[string]$source,
[string]$delim
)
#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");
foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
$read2 = New-Object System.IO.StreamReader($files.FullName)
$DataLine = (Get-Content $files.FullName)[0]
$validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
$lineNum = 1 #used to write to host which line is bad in file
$thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
$badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error
while (!$read2.EndOfStream)
{
$line = $read2.ReadLine()
$total = $line.Split($delim).Length - 1;
if ($total -eq $validCount)
{
$thisOK = 1
}
elseif ($total -ne $validCount)
{
Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
$thisOK = 0
$badLine = 1
break; #break or else it will repeat each line that is bad
}
$lineNum++
}
if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
{
Write-Output "$files is ok"
}
if ($validCount -eq 0)
{
Write-Output "$files does not contain entered delimeter: $delim"
}
$read2.Close()
$read2.Dispose()
} #end foreach loop
#} catch {
# $ErrorMessage = $_.Exception.Message
# $FailedItem = $_.Exception.ItemName
#}
它适用于我目前测试过的内容。但是,当涉及到较大的文件时,它需要相当长的时间。我想知道我可以为这段代码做些什么或更改它以使其更快地处理这些 text/CSV 文件?
此外,我的 try..catch
语句被注释掉了,因为当我包含它们时脚本似乎 运行 - 没有错误只是输入了一个新的命令行。作为一个想法,我正在寻找一个简单的 GUI 供其他用户仔细检查。
示例文件:
HeaderA|HeaderB|HeaderC|HeaderD //header line
DataLnA|DataLnBBDataLnC|DataLnD|DataLnE //bad line
DataLnA|DataLnB|DataLnC|DataLnD| //bad line
DataLnA|DataLnB|DataLnC|DataLnD //good line
现在我看了一下,我想可能存在一个问题,如果分隔符数量正确但列不匹配,如下所示:
HeaderA|HeaderB|HeaderC|HeaderD
DataLnA|DataLnBDataLnC|DataLnD|
我看到的主要问题是您正在读取文件 两次 -- 一次是调用 Get-Content
,它将整个文件读入内存,第二次使用 while
循环。您可以通过替换此行来使处理速度加倍:
$DataLine = (Get-Content $files.FullName)[0] #inefficient
有了这个:
$DataLine = Get-Content $files.FullName -First 1 #efficient
在花了一些时间寻找最明确的方法来检查文件的 body 是否具有与 header 相同数量的分隔符后,我想出了这个代码:
Param #user enters the directory path and delimiter they are checking for
(
[string]$source,
[string]$delim
)
#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");
foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
$read2 = New-Object System.IO.StreamReader($files.FullName)
$DataLine = (Get-Content $files.FullName)[0]
$validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
$lineNum = 1 #used to write to host which line is bad in file
$thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
$badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error
while (!$read2.EndOfStream)
{
$line = $read2.ReadLine()
$total = $line.Split($delim).Length - 1;
if ($total -eq $validCount)
{
$thisOK = 1
}
elseif ($total -ne $validCount)
{
Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
$thisOK = 0
$badLine = 1
break; #break or else it will repeat each line that is bad
}
$lineNum++
}
if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
{
Write-Output "$files is ok"
}
if ($validCount -eq 0)
{
Write-Output "$files does not contain entered delimeter: $delim"
}
$read2.Close()
$read2.Dispose()
} #end foreach loop
#} catch {
# $ErrorMessage = $_.Exception.Message
# $FailedItem = $_.Exception.ItemName
#}
它适用于我目前测试过的内容。但是,当涉及到较大的文件时,它需要相当长的时间。我想知道我可以为这段代码做些什么或更改它以使其更快地处理这些 text/CSV 文件?
此外,我的 try..catch
语句被注释掉了,因为当我包含它们时脚本似乎 运行 - 没有错误只是输入了一个新的命令行。作为一个想法,我正在寻找一个简单的 GUI 供其他用户仔细检查。
示例文件:
HeaderA|HeaderB|HeaderC|HeaderD //header line DataLnA|DataLnBBDataLnC|DataLnD|DataLnE //bad line DataLnA|DataLnB|DataLnC|DataLnD| //bad line DataLnA|DataLnB|DataLnC|DataLnD //good line
现在我看了一下,我想可能存在一个问题,如果分隔符数量正确但列不匹配,如下所示:
HeaderA|HeaderB|HeaderC|HeaderD DataLnA|DataLnBDataLnC|DataLnD|
我看到的主要问题是您正在读取文件 两次 -- 一次是调用 Get-Content
,它将整个文件读入内存,第二次使用 while
循环。您可以通过替换此行来使处理速度加倍:
$DataLine = (Get-Content $files.FullName)[0] #inefficient
有了这个:
$DataLine = Get-Content $files.FullName -First 1 #efficient