根据关键字对文件进行排序，需要一个更数据库化的解决方案

Question

我正在制作一个脚本，通过检查文件中的已知关键字将视频文件分类到文件夹中。随着关键字数量的增长失控，脚本变得非常慢，处理每个文件需要几秒钟。

@echo off    
cd /d d:\videos\shorts
if /i not "%cd%"=="d:\videos\shorts" echo invalid shorts dir. && exit /b

:: auto detect folder name via anchor file
for /r %%i in (*spirit*science*chakras*) do set conspiracies=%%~dpi
if not exist "%conspiracies%" echo conscpiracies dir missing. && pause && exit /b
for /r %%i in (*modeselektor*evil*) do set musicvideos=%%~dpi
if not exist "%musicvideos%" echo musicvideos dir missing. && pause && exit /b

for %%s in (*) do set "file=%%~nxs" & set "full=%%s" & call :count
for %%v in (*) do echo can't sort "%%~nv"
exit /b

:count
set oldfile="%file%"
set newfile=%oldfile:&=and%
if not %oldfile%==%newfile% ren "%full%" %newfile%

set count=0
set words= & rem
echo "%~n1" | findstr /i /c:"music" >nul && set words=%words%, music&& set /a count+=1
echo "%~n1" | findstr /i /c:"official video" >nul && set words=%words%, official video&& set /a count+=2
set words=%words:has, =has %
set words=%words: , =%
if not %count%==0 echo "%file%" has "%words%" %count%p for music videos
set musicvideoscount=%count%

set count=0
set words= & rem
echo "%~n1" | findstr /i /c:"misinform" >nul && set words=%words%, misinform&& set /a count+=1
echo "%~n1" | findstr /i /c:"antikythera" >nul && set words=%words%, antikythera&& set /a count+=2
set words=%words:has, =has %
set words=%words: , =%
if not %count%==0 echo "%file%" has "%words%" %count%p for conspiracies
set conspiraciescount=%count%

set wanted=3
set winner=none

:loop
:: count points and set winner (in case of tie lowest in this list wins, sort accordingly)
if %conspiraciescount%==%wanted% set winner=%conspiracies%
if %musicvideoscount%==%wanted% set winner=%musicvideos%
set /a wanted+=1
if not %wanted%==15 goto loop

if not "%winner%"=="none" move "%full%" "%winner%" >nul && echo "%winner%%file%" && echo.

注意每个关键字的 "weight value"。它计算每个类别的总分，找到最大值并将文件移动到指定给该类别的文件夹。它还会显示它找到的单词，最后列出它发现无法排序的文件，以便我可以添加关键字或调整权重值。

我已将此示例中的文件夹和关键字数量减少到最低限度。完整的脚本有六个文件夹，大小为 64k，包含所有关键字（并且还在增加）。

Answer 1

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "tempfile=%temp%\somename"
SET "categories=music conspiracies"
REM SET "categories=conspiracies music"
(
 FOR /f "tokens=1,2,*delims=," %%s IN (q45196316.txt) DO (
 FOR /f "delims=" %%a IN (
  'dir /b /a-d "%sourcedir%\*%%u*" 2^>nul'
  ) DO (
   ECHO %%a^|%%s^|%%t
 )
)
)>"%tempfile%"

SET "lastname="

FOR /f "tokens=1,2,*delims=|" %%a IN ('sort "%tempfile%"') DO (
 CALL :resolve %%b %%c "%%a"
)
:: and the last entry...
CALL :resolve dummy 0 

GOTO :EOF

:resolve
IF "%~3" equ "%lastname%" GOTO accum
:: report and reset accumulators
IF NOT DEFINED lastname GOTO RESET
SET "winner="
SET /a maxfound=0
FOR %%v IN (%categories%) DO (
 FOR /f "tokens=1,2delims=$=" %%w IN ('set $%%v') DO CALL :compare %%w %%x
)
IF DEFINED winner ECHO %winner% %lastname:&=and%
:RESET
FOR %%v IN (%categories%) DO SET /a $%%v=0
SET "lastname=%~3"
:accum
SET /a $%1+=%2

GOTO :eof

:compare
IF %2 lss %maxfound% GOTO :EOF 
IF %2 gtr %maxfound% GOTO setwinner
:: equal scores use categories to determine
IF DEFINED winner GOTO :eof
:Setwinner
SET "winner=%1"
SET maxfound=%2
GOTO :eof

您需要更改 sourcedir 的设置以适合您的情况。

我使用了一个名为 q45196316.txt 的文件，其中包含我的测试类别数据。

music,6,music
music,8,Official video
conspiracies,3,misinform
conspiracies,6,antikythera
missing,0,not appearing in this directory

我相信你的问题是重复执行findstr很耗时。

此方法使用包含 category,weight,mask 行的数据文件。 categories 变量包含按优先顺序排列的类别列表（当分数相等时）

读取数据文件，将类别分配给 %%s，将权重分配给 %%t，将掩码分配给 %%u，然后使用掩码进行目录扫描。这将为找到的每个名称匹配 echo 格式 name|category|weight 的临时文件添加一行。 dir第一次扫描后好像很快。

生成的临时文件因此每个文件名+类别加上权重都有一行，因此如果文件名属于多个类别，将创建多个条目。

然后我们扫描该文件的排序版本并解析分数。

首先，如果文件名发生变化，我们可以报告最后一个文件名。这是通过比较变量 $categoryname 中的值来完成的。由于这些是按 %categories% 的顺序扫描的，因此如果分数相等，则选择列表中的第一个类别。然后重置分数并 lastname 初始化为新文件名。

然后我们将分数累加到$categoryname

所以 - 我相信会快一点。

修订

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
SET "sourcedir=U:\sourcedir"
SET "tempfile=%temp%\somename"
SET "categories="rock music" music conspiracies"
REM SET "categories=conspiracies music"
:: set up sorting categories
SET "sortingcategories="
FOR %%a IN (%categories%) DO SET "sortingcategories=!sortingcategories!,%%~a"
SET "sortingcategories=%sortingcategories: =_%"
:: Create "tempfile" containing lines of name|sortingcategory|weight
(
 FOR /f "tokens=1,2,*delims=," %%s IN (q45196316.txt) DO (
 SET "sortingcategory=%%s"
 SET "sortingcategory=!sortingcategory: =_!"
 FOR /f "delims=" %%a IN (
  'dir /b /a-d "%sourcedir%\*%%u*" 2^>nul'
  ) DO (
   ECHO %%a^|!sortingcategory!^|%%t^|%%s^|%%u
 )
)
)>"%tempfile%"

SET "lastname="

SORT "%tempfile%">"%tempfile%.s"

FOR /f "usebackqtokens=1,2,3delims=|" %%a IN ("%tempfile%.s") DO (

 CALL :resolve %%b %%c "%%a"
)
:: and the last entry...
CALL :resolve dummy 0 

GOTO :EOF
:: resolve by totalling weights (%2) in sortingcategories (%1) 
:: for each name (%3)
:resolve
IF "%~3" equ "%lastname%" GOTO accum
:: report and reset accumulators
IF NOT DEFINED lastname GOTO RESET
SET "winner=none"
SET /a maxfound=0
FOR %%v IN (%sortingcategories%) DO (
 FOR /f "tokens=1,2delims=$=" %%w IN ('set $%%v') DO IF %%x gtr !maxfound! (SET "winner=%%v"&SET /a maxfound=%%x)
)
ECHO %winner:_= % %lastname:&=and%
:RESET
FOR %%v IN (%sortingcategories%) DO SET /a $%%v=0
SET "lastname=%~3"
:accum
SET /a $%1+=%2

GOTO :eof

我添加了一些重要的评论。

您现在可以在类别名称中包含 space - 您需要在 set catagories... 语句中引用名称（用于报告目的）。

sortingcategories 是自动派生的 - 它仅用于排序，只是名称中任何 space 被下划线替换的类别。

在创建临时文件时，类别被处理为包含下划线（排序类别），并且在解析最终位置时，下划线被删除并返回类别名称。

现在应正确处理负权重。

-- "not append *"

的进一步修订

 FOR /f "tokens=1-5delims=," %%s IN (q45196316.txt) DO (
 SET "sortingcategory=%%s"
 SET "sortingcategory=!sortingcategory: =_!"
 FOR %%z IN ("!sortingcategory!") DO (
  SETLOCAL disabledelayedexpansion
  FOR /f "delims=" %%a IN (
   'dir /b /a-d "%sourcedir%\%%~v%%u%%~w" 2^>nul'

和

向 q45196316 文件添加 2 个额外的列

music,6,music,*,*
music,8,Official video,"",*
conspiracies,3,misinform,*,*
conspiracies,6,kythera,*anti,*
missing,0,not appearing in this directory,*,*
rock music,2,metal,*,*
conspiracies,-5,negative,*,*

for /f ... %%s 现在生成包含最后两列的 %%v 和 %%w（因为 tokens 不是 1-5）

这些在 dir 命令中用作 %%u 的前缀和后缀。请注意，"" 应该用于 nothing，因为两个连续的 , 被解析为单个分隔符。 %%~v中v/w前的~表示remove the quotes.

根据关键字对文件进行排序，需要一个更数据库化的解决方案

sorting files according to keywords, need a more database-y solution

sorting

batch-file

batch-processing