对巨大的输入文件更有效的 sed
More efficient sed on huge input files
我经常需要通过删除不必要的 INSERT 语句来将巨大的 db sql 转储(超过 100gb)减少到更易于管理的文件大小。我使用以下脚本来做到这一点。
我担心我的脚本涉及对源文件进行多次迭代,这显然在计算上很昂贵。
有没有办法把我所有的SED语句合并成一个,这样源文件只需要处理一次,或者可以用更高效的方式处理?
sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql
这个想法的方向正确吗?
cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql
如果您有多个 sed ... | sed ...
,您可以通过写 sed -e ... -e ...
或 sed ...;...
来组合它们。但是在这种情况下还有一个更有效的方法:
sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'
或者,切换到 grep
,可能 更快:
grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'
或
grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)
您甚至可以尝试用正则表达式替换所有 ..._log
和 logs
,如果这是您想要的。有了这个,你只需要明确列出非日志文件:
INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`
为了便于维护,有一个 INSERT INTO <table>/d
命令列表(在一个文件中)可能是有意义的,sed
可以使用这些命令来过滤 SQL 脚本。
将sed
命令存储在文件中,例如:
$ cat sed.cmds
/INSERT INTO attendance_log/d
/INSERT INTO analytics_models_log/d
/INSERT INTO backup_logs/d
/INSERT INTO config_log/d
/INSERT INTO course_completion_log/d
示例 SQL 脚本:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
正在调用 sed
命令的文件:
$ sed -f sed.cmds sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
为了便于维护,拥有一个表列表(在一个文件中)可能是有意义的,awk
可以使用它来过滤 SQL 脚本。
要跳过的(数据库)表列表...
$ cat table.list
attendance_log
analytics_models_log
backup_logs
config_log
course_completion_log
示例 SQL 脚本:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
让awk
帮我们剪枝:
$ awk 'FNR==NR {table[];next} /^INSERT INTO / && in table{next}1' table.list sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
备注:
- 这完全基于问题仅提到
INSERT INTO
命令 的事实
- 我假设行(感兴趣)从开始
INSERT INTO
(否则删除^
)
- 此解决方案将需要额外的 checks/coding 来解决其他 SQL OP 想要删除的语句
我经常需要通过删除不必要的 INSERT 语句来将巨大的 db sql 转储(超过 100gb)减少到更易于管理的文件大小。我使用以下脚本来做到这一点。 我担心我的脚本涉及对源文件进行多次迭代,这显然在计算上很昂贵。 有没有办法把我所有的SED语句合并成一个,这样源文件只需要处理一次,或者可以用更高效的方式处理?
sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql
这个想法的方向正确吗?
cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql
如果您有多个 sed ... | sed ...
,您可以通过写 sed -e ... -e ...
或 sed ...;...
来组合它们。但是在这种情况下还有一个更有效的方法:
sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'
或者,切换到 grep
,可能 更快:
grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'
或
grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)
您甚至可以尝试用正则表达式替换所有 ..._log
和 logs
,如果这是您想要的。有了这个,你只需要明确列出非日志文件:
INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`
为了便于维护,有一个 INSERT INTO <table>/d
命令列表(在一个文件中)可能是有意义的,sed
可以使用这些命令来过滤 SQL 脚本。
将sed
命令存储在文件中,例如:
$ cat sed.cmds
/INSERT INTO attendance_log/d
/INSERT INTO analytics_models_log/d
/INSERT INTO backup_logs/d
/INSERT INTO config_log/d
/INSERT INTO course_completion_log/d
示例 SQL 脚本:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
正在调用 sed
命令的文件:
$ sed -f sed.cmds sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
为了便于维护,拥有一个表列表(在一个文件中)可能是有意义的,awk
可以使用它来过滤 SQL 脚本。
要跳过的(数据库)表列表...
$ cat table.list
attendance_log
analytics_models_log
backup_logs
config_log
course_completion_log
示例 SQL 脚本:
$ cat sample.sql
INSERT INTO attendance_log ...
INSERT INTO bubblegum ...
INSERT INTO backup_logs ...
INSERT INTO more_nonsense ...
让awk
帮我们剪枝:
$ awk 'FNR==NR {table[];next} /^INSERT INTO / && in table{next}1' table.list sample.sql
INSERT INTO bubblegum ...
INSERT INTO more_nonsense ...
备注:
- 这完全基于问题仅提到
INSERT INTO
命令 的事实
- 我假设行(感兴趣)从开始
INSERT INTO
(否则删除^
) - 此解决方案将需要额外的 checks/coding 来解决其他 SQL OP 想要删除的语句