Bash 将日期阈值添加到 S3 `cp` 函数的脚本

Bash Script To Add Date Threshold to S3 `cp` function

我希望能够 aws S3 cp 设置日期阈值,但它没有该功能的开关。

所以,我想为它写一个Bash脚本。使用 --recursive 开关调用 aws s3 ls 会给我一个包含日期和时间的目录列表,我认为我可以使用它来实现我的目标。这是一个示例输出:

2016-12-01 18:06:40 0 sftp/ 2016-12-01 20:35:39 1024 sftp/.ssh/.id_rsa.swp 2016-12-01 20:35:39 1679 sftp/.ssh/id_rsa 2016-12-01 20:35:39 405 sftp/.ssh/id_rsa.pub

遍历所有文件但只复制比指定日期更新的文件的最有效方法是什么?

这是我目前拥有的(不完整的)脚本:

#!/bin/bash

while [[ $# -gt 0 ]]
do
    key=""

    case $key in
        -m|--mtime)
            MTIME=""
            shift 2;;
        -s|--source)
            SRCDIR=""
            shift 2;;
        -d|--dest)
            DSTDIR=""
            shift 2;;
        *)
            #echo "Unknown argument: \"$key\""; exit 1;;
            break;;
    esac
done

if [ ! -d  $DSTDIR ]; then
    echo "the directory does not exist!";
    exit 1;
fi

GTDATE="$(date "+%Y%m%d%H%M%S" -d "$MTIME days ago")"
#echo "Threshold: $GTDATE"

for f in $(aws s3 ls $SRCDIR --recursive | awk '{ ??? }'); do
    #aws s3 cp
done

一种方法是将 s3 ls 结果和截止日期一起排序,然后在到达截止日期时停止处理 s3 列表:

cat <(echo "$GTDATE CUTOFF") <(aws s3 ls $SRCDIR --recursive) |
  sort -dr |
  awk '{if(=="CUTOFF"){exit}print}' |
  xargs -I{} echo aws s3 cp "$SRCDIR/{}" "$DSTDIR/{}"

(我在最后一行的 echo 中留下来让您测试并查看会发生什么命令。删除 echo 以实际执行 s3 cp 命令。)

(您也可以使用 s3 sync 而不是 cp 以避免重新下载已经是最新的文件。)

了解时间戳是本地时间还是 UTC 很重要。
如果本地是 America/Los_Angeles,日期可能会错误地解释时间(注意 03 与 11 的区别):

$ date  -d '20161201T18:06:40' +'%Y%m%dT%H:%M:%S'
20161201T03:06:40

$ date -ud '20161201T18:06:40' +'%Y%m%dT%H:%M:%S'
20161201T11:06:40    

使用 -u 还可以避免 DST 和本地更改的问题。

如果日期以 UTC 格式记录,命令 date 可以读回,并且没有空格,因此 awk 或类似的可以轻松解析它,那么在短时间内生活会变得容易得多。例如:

$ date -ud '2016-12-01 18:06:40' +'%Y-%m-%dT%H:%M:%S'
2016-12-01T18:06:40

计算机 date 和阅读它的用户都更容易。
但是你的时间戳有点不同。

假设文件没有包含换行符的名称。
选项处理后的脚本应该是这样的:

#!/bin/bash

SayError(){local a=; shift; printf '%s\n' "[=12=]: $@" >&2; exit "$a"; }

[[ ! -d  $dstdir ]] && SayError 1 "The directory $dstdir does not exist!"
[[ ! -d  $srcdir ]] && SayError 2 "The source directory $srcdir doesn't exist"
[[ -z $mtime ]] && SayError 3 "The value of mtime has not been set."

gtdate="$(date -ud "$mtime days ago" "+%Y-%m-%dT%H:%M:%S" )"
#echo "Threshold: $gtdate"

readarray -t files < <(aws s3 ls "$srcdir" --recursive)
limittime=$(date -ud "$gtdate" +"%s")

for f in "${files[@]}"; do
    IFS=' ' read day time size name <<<"$f"
    filetime=$( date -ud "${day}T${time}" +"%s" )
    if [[ $filetime -gt $limittime ]]; then
        aws s3 cp "$srcdir/$name" "$destdir/"
    fi
done

警告:未经测试的代码,请仔细查看。

对于搜索后代,这是最终脚本的草稿(我们仍在审查;在实践中,除了最后两个 echo 调用之外,您需要注释掉所有内容):

#!/bin/bash
#
# S3ToDirSync
#
# This script is a custom SFTP-S3 synchronizer utilizing the AWS CLI.
# 
# Usage: S3ToDirSync.sh -m 7 -d "/home/user" -b "user-data-files" -s "sftp"
# 
# Notes: 
#    The script is hardcoded to exclude syncing of any folder or file on a path containing "~archive"
#    See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html#available-commands for AWS CLI documentation on commands
#
# Updated: 12/05/2016

while [[ $# -gt 0 ]]
do
    key=""

    case $key in
        -m|--mtime)
            MTIME="" # nb of days (from now) to copy
            shift 2;;
        -b|--bucket)
            BUCKET="" # the S3 bucket, no trailing slashes
            shift 2;;
        -s|--source) # the S3 prefix/path, slashes at start and end of string will be added if not present
            SRCDIR=""
            shift 2;;
        -d|--dest) # the root destination folder
            DSTDIR=""
            shift 2;;
        *)
            #echo "Unknown argument: \"$key\""; exit 1;;
            break;;
    esac
done

# validations
if [ ! -d $DSTDIR ]; then
    echo "The destination directory does not exist.";
    exit 1;
fi
if [[ $DSTDIR != *"/" ]]; then
    DSTDIR=$DSTDIR\/
fi
echo "DSTDIR: $DSTDIR"

if [ -z $BUCKET ]; then
    echo "The bucket value has not been set.";
    exit 1;
fi

if [[ $BUCKET == *"/"* ]]; then
    echo "No slashes (/) in bucket arg.";
    exit 1;
fi
# add trailing slash
BUCKET=$BUCKET\/
echo "BUCKET: $BUCKET"

if [ -z $MTIME ]; then
    echo "The mtime value has not been set.";
    exit 1;
fi

# $SRCDIR may be empty, to copy everything in a bucket, but add a trailing slash if missing
if [ ! -z $SRCDIR ] && [[ $SRCDIR != *"/" ]]; then
    SRCDIR=$SRCDIR\/
fi
echo "SRCDIR: $SRCDIR"

SRCPATH=s3://$BUCKET$SRCDIR
echo "SRCPATH: $SRCPATH"

LIMITTIME=$(date -ud "$MTIME days ago" "+%s")
#echo "Threshold UTC Epoch: $LIMITTIME"

readarray -t files < <(aws s3 ls "$SRCPATH" --recursive) # TODO: ls will return up to a limit of 1000 rows, which could timeout or not be enough
for f in "${files[@]}"; do
    IFS=' ' read day time size name <<<"$f"
    FILETIME=$(date -ud "${day}T${time}" "+%s")
    # if S3 file more recent than threshold AND if not in an "~archive" folder
    if [[ $FILETIME -gt $LIMITTIME ]] && [[ $name != *"~archive"* ]]; then
        name="${name/$SRCDIR/}" # truncate ls returned name by $SRCDIR, since S3 returns full path
        echo "$SRCPATH  $name"
        destpath=$DSTDIR$name
        echo "to $destpath"
        # if a directory (trailing slash), mkdir in destination if necessary
        if [[ $name == *"/" ]] && [ ! -d $destpath ]; then
            echo "mkdir -p $destpath"
            mkdir -p $destpath
        # else a file, use aws s3 sync to benefit from rsync-like checks
        else
            echo "aws s3 sync $SRCPATH$name $destpath"
            aws s3 sync $SRCPATH$name $destpath
            # NOTE: if a file was on S3 then deleted inside MTIME window, do we delete on SFTP server? If so, add --delete
        fi
    fi
done