PHP Preg_match 从字幕 srt 文件中删除时间的模式
PHP Preg_match pattern to remove time from subtitle srt file
我需要一个 preg_match 表达式来从 .srt 字幕文件(作为字符串导入)中删除所有时间,但我始终无法完全理解正则表达式模式。因此,例如它会改变:
5
00:05:50,141 --> 00:05:54,771
This is what was said
至
This is what was said
因此考虑到 This is what was said
以大写字母开头并且可以是带有标点符号的文本,我建议如下:
$re = '/.*([A-Z]{1}[A-Za-z0-9 _.,?!"\/\'$]*)/';
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said.';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);
// Print the entire match result
var_dump($matches);
不确定你在哪里卡住了,真的只有 \d+ 和 colon/comma。
$re = '/\d+.\d+:\d+:\d+,\d+\s-->\s\d+:\d+:\d+,\d+./s';
//$re = '\d+.[0-9:,]+\s-->\s[\d+:,]+./s'; //slightly compacter version of the regex
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$subst = '';
$result = preg_replace($re, $subst, $str);
echo $result;
工作演示 here。
使用更紧凑的模式,它看起来像:https://regex101.com/r/QY9QXG/2
只是为了乐趣和挑战。这是一个非正则表达式的答案。 https://3v4l.org/r7hbO
$str = "1
00:05:50,141 --> 00:05:54,771
This is what was said1
2
00:05:50,141 --> 00:05:54,771
This is what was said2
3
00:05:50,141 --> 00:05:54,771
This is what was said3
4
00:05:50,141 --> 00:05:54,771
This is what was said4
LLLL
5
00:05:50,141 --> 00:05:54,771
This is what was said5";
$count = explode(PHP_EOL.PHP_EOL, $str);
foreach($count as &$line){
$line = implode(PHP_EOL, array_slice(explode(PHP_EOL, $line), 2));
}
echo implode(PHP_EOL.PHP_EOL, $count);
非正则表达式将首先拆分成双新行,这意味着每个新的字幕组都是数组中的一个新项目。
然后循环遍历它们并在新行上再次爆炸。
前两行不需要,数组将它们切掉。
如果字幕不止一行,我需要将它们合并。通过在新行上内爆来做到这一点。
然后作为最后一步,在双新行上再次重建字符串。
As Casimir wrote in comments below I have used PHP_EOL as new line and that works in the example.
But when used on a real srt file the new line may be different.
If the code does not work as expected try replacing PHP_EOL with some other new line.
PHP代码:
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$reg = '/(.{0,}[0,1]{0,}\s{0,}[0-9]{0,}.{0,}[0-9]+[0-9]+:[0-9]{0,}.{0,})/';
echo(trim(preg_replace($reg, '', $str)));
由于 srt 文件始终具有相同的格式,您可以跳过每个行块的前两行,return 到达空行后的结果。为此并避免将整个文件加载到内存中,您可以逐行读取文件并使用生成器:
function getSubtitleLine($handle) {
$flag = 0;
$subtitle = '';
while ( false !== $line = stream_get_line($handle, 1024, "\n") ) {
$line = rtrim($line);
if ( empty($line) ) {
yield $subtitle;
$subtitle = '';
$flag = 0;
} elseif ( $flag == 2 ) {
$subtitle .= empty($subtitle) ? $line : "\n$line";
} else {
$flag++;
}
}
if ( !empty($subtitle) )
yield $subtitle;
}
if ( false !== $handle = fopen('./test.srt', 'r') ) {
foreach (getSubtitleLine($handle) as $line) {
echo $line, PHP_EOL;
}
}
我需要一个 preg_match 表达式来从 .srt 字幕文件(作为字符串导入)中删除所有时间,但我始终无法完全理解正则表达式模式。因此,例如它会改变:
5
00:05:50,141 --> 00:05:54,771
This is what was said
至
This is what was said
因此考虑到 This is what was said
以大写字母开头并且可以是带有标点符号的文本,我建议如下:
$re = '/.*([A-Z]{1}[A-Za-z0-9 _.,?!"\/\'$]*)/';
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said.';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);
// Print the entire match result
var_dump($matches);
不确定你在哪里卡住了,真的只有 \d+ 和 colon/comma。
$re = '/\d+.\d+:\d+:\d+,\d+\s-->\s\d+:\d+:\d+,\d+./s';
//$re = '\d+.[0-9:,]+\s-->\s[\d+:,]+./s'; //slightly compacter version of the regex
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$subst = '';
$result = preg_replace($re, $subst, $str);
echo $result;
工作演示 here。
使用更紧凑的模式,它看起来像:https://regex101.com/r/QY9QXG/2
只是为了乐趣和挑战。这是一个非正则表达式的答案。 https://3v4l.org/r7hbO
$str = "1
00:05:50,141 --> 00:05:54,771
This is what was said1
2
00:05:50,141 --> 00:05:54,771
This is what was said2
3
00:05:50,141 --> 00:05:54,771
This is what was said3
4
00:05:50,141 --> 00:05:54,771
This is what was said4
LLLL
5
00:05:50,141 --> 00:05:54,771
This is what was said5";
$count = explode(PHP_EOL.PHP_EOL, $str);
foreach($count as &$line){
$line = implode(PHP_EOL, array_slice(explode(PHP_EOL, $line), 2));
}
echo implode(PHP_EOL.PHP_EOL, $count);
非正则表达式将首先拆分成双新行,这意味着每个新的字幕组都是数组中的一个新项目。
然后循环遍历它们并在新行上再次爆炸。
前两行不需要,数组将它们切掉。
如果字幕不止一行,我需要将它们合并。通过在新行上内爆来做到这一点。
然后作为最后一步,在双新行上再次重建字符串。
As Casimir wrote in comments below I have used PHP_EOL as new line and that works in the example.
But when used on a real srt file the new line may be different.
If the code does not work as expected try replacing PHP_EOL with some other new line.
PHP代码:
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$reg = '/(.{0,}[0,1]{0,}\s{0,}[0-9]{0,}.{0,}[0-9]+[0-9]+:[0-9]{0,}.{0,})/';
echo(trim(preg_replace($reg, '', $str)));
由于 srt 文件始终具有相同的格式,您可以跳过每个行块的前两行,return 到达空行后的结果。为此并避免将整个文件加载到内存中,您可以逐行读取文件并使用生成器:
function getSubtitleLine($handle) {
$flag = 0;
$subtitle = '';
while ( false !== $line = stream_get_line($handle, 1024, "\n") ) {
$line = rtrim($line);
if ( empty($line) ) {
yield $subtitle;
$subtitle = '';
$flag = 0;
} elseif ( $flag == 2 ) {
$subtitle .= empty($subtitle) ? $line : "\n$line";
} else {
$flag++;
}
}
if ( !empty($subtitle) )
yield $subtitle;
}
if ( false !== $handle = fopen('./test.srt', 'r') ) {
foreach (getSubtitleLine($handle) as $line) {
echo $line, PHP_EOL;
}
}