对巨大的输入文件更有效的 sed

More efficient sed on huge input files

我经常需要通过删除不必要的 INSERT 语句来将巨大的 db sql 转储(超过 100gb)减少到更易于管理的文件大小。我使用以下脚本来做到这一点。 我担心我的脚本涉及对源文件进行多次迭代,这显然在计算上很昂贵。 有没有办法把我所有的SED语句合并成一个,这样源文件只需要处理一次,或者可以用更高效的方式处理?

sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql 

这个想法的方向正确吗?

cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql 

如果您有多个 sed ... | sed ...,您可以通过写 sed -e ... -e ...sed ...;... 来组合它们。但是在这种情况下还有一个更有效的方法:

sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'

或者,切换到 grep可能 更快:

grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'

grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)

您甚至可以尝试用正则表达式替换所有 ..._loglogs,如果这是您想要的。有了这个,你只需要明确列出非日志文件:

INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`

为了便于维护,有一个 INSERT INTO <table>/d 命令列表(在一个文件中)可能是有意义的,sed 可以使用这些命令来过滤 SQL 脚本。

sed命令存储在文件中,例如:

$ cat sed.cmds
/INSERT INTO attendance_log/d
/INSERT INTO analytics_models_log/d
/INSERT INTO backup_logs/d
/INSERT INTO config_log/d
/INSERT INTO course_completion_log/d

示例 SQL 脚本:

$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...

正在调用 sed 命令的文件:

$ sed -f sed.cmds sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...

为了便于维护,拥有一个表列表(在一个文件中)可能是有意义的,awk 可以使用它来过滤 SQL 脚本。

要跳过的(数据库)表列表...

$ cat table.list
attendance_log
analytics_models_log
backup_logs
config_log
course_completion_log

示例 SQL 脚本:

$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...

awk帮我们剪枝:

$ awk 'FNR==NR {table[];next} /^INSERT INTO / &&  in table{next}1' table.list sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...

备注:

  • 这完全基于问题提到INSERT INTO命令
  • 的事实
  • 假设行(感兴趣)开始INSERT INTO(否则删除^)
  • 此解决方案将需要额外的 checks/coding 来解决其他 SQL OP 想要删除的语句