从 Java SDK 上传 S3 比 AWS CLI 慢得多
S3 uploads from Java SDK much slower than AWS CLI
与 AWS CLI 相比,使用 Java SDK 上传相对较小的文件 (15 MB) 要慢得多,保持一切不变:相同的笔记本电脑、相同的 AWS 账户、相同的区域。
我的代码遵循或多或少与 AWS documentation
相同的基本模式
// inputStream is ByteArrayInputStream, all in memory
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("text/plain");
metadata.setContentLength(contentLength);
PutObjectRequest request = new PutObjectRequest(bucketName, s3keyName, inputStream, metadata);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().build();
s3Client.putObject(request);
性能差异:
- AWS CLI (
aws s3 cp ...
) 大约需要 15 秒
- JavaSDK占用一分钟
CLI 工具正在尽最大努力利用分段上传来实现最快的上传性能。您可以通过使用 TransferManager class 而不是 AmazonS3Client
class.
在 Java 中实现类似的性能
请注意,这尚未在 Java 2.0 版的 AWS SDK 中实现。这时候还是under development.
AWS CLI 实际上使用 boto (Python SDK) 本身而不是 Java SDK。
aws s3 sync
比 aws s3 cp
快,aws s3 cp
比 AWS Java SDK 快,因为 AWS CLI 使用多线程同时复制多个文件以进行复制操作更快。
使用 AWS 开发工具包 Java
对于 Amazon S3 操作,您可以使用 TransferManager
实现类似的行为。
下面是示例代码片段:
import com.amazonaws.AmazonServiceException;
import com.amazonaws.services.s3.transfer.MultipleFileUpload;
import com.amazonaws.services.s3.transfer.TransferManager;
import com.amazonaws.services.s3.transfer.TransferManagerBuilder;
import com.amazonaws.services.s3.transfer.Upload;
import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
File f = new File(file_path);
TransferManager xfer_mgr = TransferManagerBuilder.standard().build();
try {
Upload xfer = xfer_mgr.upload(bucket_name, key_name, f)
// loop with Transfer.isDone()
XferMgrProgress.showTransferProgress(xfer);
// or block with Transfer.waitForCompletion()
XferMgrProgress.waitForCompletion(xfer);
} catch (AmazonServiceException e) {
System.err.println(e.getErrorMessage());
System.exit(1);
}
xfer_mgr.shutdownNow();
以下是来自 AWS 文档的参考资料:
将 AWS SDK 用于 Java脚本:
您可以使用 upload
方法实现类似的行为。
upload(params = {}, [options], [callback]) ⇒ AWS.S3.ManagedUpload
Uploads an arbitrarily sized buffer, blob, or stream, using intelligent concurrent handling of parts if the payload is large enough. You can configure the concurrent queue size by setting options
. Note that this is the only operation for which the SDK can retry requests with stream bodies.
示例:
- 上传流对象
var params = {Bucket: 'bucket', Key: 'key', Body: stream};
s3.upload(params, function(err, data) {
console.log(err, data);
});
- 正在上传并发为1,partSize为10mb的流
var params = {Bucket: 'bucket', Key: 'key', Body: stream};
var options = {partSize: 10 * 1024 * 1024, queueSize: 1};
s3.upload(params, options, function(err, data) {
console.log(err, data);
});
在第二个例子中,可以看到queueSize
定义并发来实现并行
我尝试了几种不同的选择。我什至生成了一个预签名 URL 并使用了 curl 但对我来说,对于一个 15MB 的文件,一切都需要大约 15-16 秒,而 CLI 需要 9 秒。正如@Mark B 所指出的,尽管 TransferManager 是关键。从 this code 开始,但通过 withMultipartUploadThreshold
告诉代码我想要多少线程让我减少到大约 6.5 秒。
注意三件事:
- 我正在让库读取一个文件。当我使用 ByteArrayInputStream 时,程序似乎没有多线程,我仍然在 15 秒。让 AWS 库读取文件似乎让程序自己拆分了文件。
withMultipartUploadThreshold
很棘手。如果我给它 15MB,则没有任何改进。如果我给它 5MB,则没有任何改进。但是在 1MB 时有显着的改进。不过我没弄明白下限在哪里。
- 我没有设置内容类型。这必须在单独的 S3 调用中完成。
如果您也愿意,我还有 4 个效果不佳的其他示例。
public class HighLevelMultipartUpload {
public static void main(String[] args) throws Exception {
Regions clientRegion = Regions.US_EAST_2;
String bucketName = "<my-bucket>";
String fileObjKeyName = "file.txt";
String fileName = "/tmp/file.txt";
long startTime = System.currentTimeMillis();
// this made it slower
// String fileContent = Files.readString(Path.of(fileName));
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new ProfileCredentialsProvider())
.build();
TransferManager tm = TransferManagerBuilder.standard().withMultipartUploadThreshold(1024L*1024)
.withS3Client(s3Client)
.build();
// TransferManager processes all transfers asynchronously,
// so this call returns immediately.
Upload upload = tm.upload(bucketName, fileObjKeyName, new File(fileName));
// used if using the file and supplying metadata
// Upload upload = tm.upload(bucketName, fileObjKeyName, new ByteArrayInputStream(fileContent.getBytes()), metadata);
System.out.println("Object upload started");
// Optionally, wait for the upload to finish before continuing.
upload.waitForCompletion();
System.out.println("Object upload complete");
} catch (SdkClientException e) {
e.printStackTrace();
}
System.out.println( "run took " + (System.currentTimeMillis() - startTime) + "ms");
System.exit(0);
}
}
- AWS cli 使用本机调用 (boto python),速度会更快
- 此外,当使用 aws cli 时,我们已经建立了客户端连接(因此这里的调用是对 s3 的原始调用),我们在这里只计算上传所需的时间,而不是像我们在 java.
要使用适用于 Java 的 AWS SDK 复制大于 5 GB 的 Amazon S3 对象,请使用低级 Java API。
要使用低级别 Java API 复制对象,请执行以下操作(分段上传)
通过执行AmazonS3Client.initiateMultipartUpload()
方法启动分段上传。
保存来自 AmazonS3Client.initiateMultipartUpload()
方法 returns 的响应对象的上传 ID。您为每个部分上传操作提供此上传 ID。
复制所有部分。对于您需要复制的每个部分,创建一个 CopyPartRequest
class 的新实例。提供部分信息,包括源和目标存储桶名称、源和目标对象键、上传 ID、部分第一个和最后一个字节的位置以及部分编号。
保存 AmazonS3Client.copyPart()
方法调用的响应。每个响应都包含上传部件的 ETag 值和部件号。您需要此信息才能完成分段上传。
调用AmazonS3Client.completeMultipartUpload()
方法完成复制操作
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.*;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class LowLevelMultipartCopy {
public static void main(String[] args) throws IOException {
Regions clientRegion = Regions.DEFAULT_REGION;
String sourceBucketName = "*** Source bucket name ***";
String sourceObjectKey = "*** Source object key ***";
String destBucketName = "*** Target bucket name ***";
String destObjectKey = "*** Target object key ***";
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new ProfileCredentialsProvider())
.withRegion(clientRegion)
.build();
// Initiate the multipart upload.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(destBucketName, destObjectKey);
InitiateMultipartUploadResult initResult = s3Client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
GetObjectMetadataRequest metadataRequest = new GetObjectMetadataRequest(sourceBucketName, sourceObjectKey);
ObjectMetadata metadataResult = s3Client.getObjectMetadata(metadataRequest);
long objectSize = metadataResult.getContentLength();
// Copy the object using 5 MB parts.
long partSize = 5 * 1024 * 1024;
long bytePosition = 0;
int partNum = 1;
List<CopyPartResult> copyResponses = new ArrayList<CopyPartResult>();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
long lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
CopyPartRequest copyRequest = new CopyPartRequest()
.withSourceBucketName(sourceBucketName)
.withSourceKey(sourceObjectKey)
.withDestinationBucketName(destBucketName)
.withDestinationKey(destObjectKey)
.withUploadId(initResult.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum++);
copyResponses.add(s3Client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
CompleteMultipartUploadRequest completeRequest = new CompleteMultipartUploadRequest(
destBucketName,
destObjectKey,
initResult.getUploadId(),
getETags(copyResponses));
s3Client.completeMultipartUpload(completeRequest);
System.out.println("Multipart copy complete.");
} catch (AmazonServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
e.printStackTrace();
} catch (SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
e.printStackTrace();
}
}
// This is a helper function to construct a list of ETags.
private static List<PartETag> getETags(List<CopyPartResult> responses) {
List<PartETag> etags = new ArrayList<PartETag>();
for (CopyPartResult response : responses) {
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
}
小于5GB的对象,使用单操作复制
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import java.io.IOException;
public class CopyObjectSingleOperation {
public static void main(String[] args) throws IOException {
Regions clientRegion = Regions.DEFAULT_REGION;
String bucketName = "*** Bucket name ***";
String sourceKey = "*** Source object key *** ";
String destinationKey = "*** Destination object key ***";
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new ProfileCredentialsProvider())
.withRegion(clientRegion)
.build();
// Copy the object into a new object in the same bucket.
CopyObjectRequest copyObjRequest = new CopyObjectRequest(bucketName, sourceKey, bucketName, destinationKey);
s3Client.copyObject(copyObjRequest);
} catch (AmazonServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
e.printStackTrace();
} catch (SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
e.printStackTrace();
}
}
}
- 在我当前的项目中,我使用上面描述的分段上传来上传多达 15GB 的文件,它确实有项目特定的修改。 aws docs
与 AWS CLI 相比,使用 Java SDK 上传相对较小的文件 (15 MB) 要慢得多,保持一切不变:相同的笔记本电脑、相同的 AWS 账户、相同的区域。
我的代码遵循或多或少与 AWS documentation
相同的基本模式// inputStream is ByteArrayInputStream, all in memory
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("text/plain");
metadata.setContentLength(contentLength);
PutObjectRequest request = new PutObjectRequest(bucketName, s3keyName, inputStream, metadata);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().build();
s3Client.putObject(request);
性能差异:
- AWS CLI (
aws s3 cp ...
) 大约需要 15 秒 - JavaSDK占用一分钟
CLI 工具正在尽最大努力利用分段上传来实现最快的上传性能。您可以通过使用 TransferManager class 而不是 AmazonS3Client
class.
请注意,这尚未在 Java 2.0 版的 AWS SDK 中实现。这时候还是under development.
AWS CLI 实际上使用 boto (Python SDK) 本身而不是 Java SDK。
aws s3 sync
比 aws s3 cp
快,aws s3 cp
比 AWS Java SDK 快,因为 AWS CLI 使用多线程同时复制多个文件以进行复制操作更快。
使用 AWS 开发工具包 Java
对于 Amazon S3 操作,您可以使用 TransferManager
实现类似的行为。
下面是示例代码片段:
import com.amazonaws.AmazonServiceException;
import com.amazonaws.services.s3.transfer.MultipleFileUpload;
import com.amazonaws.services.s3.transfer.TransferManager;
import com.amazonaws.services.s3.transfer.TransferManagerBuilder;
import com.amazonaws.services.s3.transfer.Upload;
import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
File f = new File(file_path);
TransferManager xfer_mgr = TransferManagerBuilder.standard().build();
try {
Upload xfer = xfer_mgr.upload(bucket_name, key_name, f)
// loop with Transfer.isDone()
XferMgrProgress.showTransferProgress(xfer);
// or block with Transfer.waitForCompletion()
XferMgrProgress.waitForCompletion(xfer);
} catch (AmazonServiceException e) {
System.err.println(e.getErrorMessage());
System.exit(1);
}
xfer_mgr.shutdownNow();
以下是来自 AWS 文档的参考资料:
将 AWS SDK 用于 Java脚本:
您可以使用 upload
方法实现类似的行为。
upload(params = {}, [options], [callback]) ⇒ AWS.S3.ManagedUpload
Uploads an arbitrarily sized buffer, blob, or stream, using intelligent concurrent handling of parts if the payload is large enough. You can configure the concurrent queue size by setting
options
. Note that this is the only operation for which the SDK can retry requests with stream bodies.
示例:
- 上传流对象
var params = {Bucket: 'bucket', Key: 'key', Body: stream};
s3.upload(params, function(err, data) {
console.log(err, data);
});
- 正在上传并发为1,partSize为10mb的流
var params = {Bucket: 'bucket', Key: 'key', Body: stream};
var options = {partSize: 10 * 1024 * 1024, queueSize: 1};
s3.upload(params, options, function(err, data) {
console.log(err, data);
});
在第二个例子中,可以看到queueSize
定义并发来实现并行
我尝试了几种不同的选择。我什至生成了一个预签名 URL 并使用了 curl 但对我来说,对于一个 15MB 的文件,一切都需要大约 15-16 秒,而 CLI 需要 9 秒。正如@Mark B 所指出的,尽管 TransferManager 是关键。从 this code 开始,但通过 withMultipartUploadThreshold
告诉代码我想要多少线程让我减少到大约 6.5 秒。
注意三件事:
- 我正在让库读取一个文件。当我使用 ByteArrayInputStream 时,程序似乎没有多线程,我仍然在 15 秒。让 AWS 库读取文件似乎让程序自己拆分了文件。
withMultipartUploadThreshold
很棘手。如果我给它 15MB,则没有任何改进。如果我给它 5MB,则没有任何改进。但是在 1MB 时有显着的改进。不过我没弄明白下限在哪里。- 我没有设置内容类型。这必须在单独的 S3 调用中完成。
如果您也愿意,我还有 4 个效果不佳的其他示例。
public class HighLevelMultipartUpload {
public static void main(String[] args) throws Exception {
Regions clientRegion = Regions.US_EAST_2;
String bucketName = "<my-bucket>";
String fileObjKeyName = "file.txt";
String fileName = "/tmp/file.txt";
long startTime = System.currentTimeMillis();
// this made it slower
// String fileContent = Files.readString(Path.of(fileName));
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new ProfileCredentialsProvider())
.build();
TransferManager tm = TransferManagerBuilder.standard().withMultipartUploadThreshold(1024L*1024)
.withS3Client(s3Client)
.build();
// TransferManager processes all transfers asynchronously,
// so this call returns immediately.
Upload upload = tm.upload(bucketName, fileObjKeyName, new File(fileName));
// used if using the file and supplying metadata
// Upload upload = tm.upload(bucketName, fileObjKeyName, new ByteArrayInputStream(fileContent.getBytes()), metadata);
System.out.println("Object upload started");
// Optionally, wait for the upload to finish before continuing.
upload.waitForCompletion();
System.out.println("Object upload complete");
} catch (SdkClientException e) {
e.printStackTrace();
}
System.out.println( "run took " + (System.currentTimeMillis() - startTime) + "ms");
System.exit(0);
}
}
- AWS cli 使用本机调用 (boto python),速度会更快
- 此外,当使用 aws cli 时,我们已经建立了客户端连接(因此这里的调用是对 s3 的原始调用),我们在这里只计算上传所需的时间,而不是像我们在 java.
要使用适用于 Java 的 AWS SDK 复制大于 5 GB 的 Amazon S3 对象,请使用低级 Java API。
要使用低级别 Java API 复制对象,请执行以下操作(分段上传)
通过执行
AmazonS3Client.initiateMultipartUpload()
方法启动分段上传。保存来自
AmazonS3Client.initiateMultipartUpload()
方法 returns 的响应对象的上传 ID。您为每个部分上传操作提供此上传 ID。复制所有部分。对于您需要复制的每个部分,创建一个
CopyPartRequest
class 的新实例。提供部分信息,包括源和目标存储桶名称、源和目标对象键、上传 ID、部分第一个和最后一个字节的位置以及部分编号。保存
AmazonS3Client.copyPart()
方法调用的响应。每个响应都包含上传部件的 ETag 值和部件号。您需要此信息才能完成分段上传。调用
AmazonS3Client.completeMultipartUpload()
方法完成复制操作
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.*;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class LowLevelMultipartCopy {
public static void main(String[] args) throws IOException {
Regions clientRegion = Regions.DEFAULT_REGION;
String sourceBucketName = "*** Source bucket name ***";
String sourceObjectKey = "*** Source object key ***";
String destBucketName = "*** Target bucket name ***";
String destObjectKey = "*** Target object key ***";
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new ProfileCredentialsProvider())
.withRegion(clientRegion)
.build();
// Initiate the multipart upload.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(destBucketName, destObjectKey);
InitiateMultipartUploadResult initResult = s3Client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
GetObjectMetadataRequest metadataRequest = new GetObjectMetadataRequest(sourceBucketName, sourceObjectKey);
ObjectMetadata metadataResult = s3Client.getObjectMetadata(metadataRequest);
long objectSize = metadataResult.getContentLength();
// Copy the object using 5 MB parts.
long partSize = 5 * 1024 * 1024;
long bytePosition = 0;
int partNum = 1;
List<CopyPartResult> copyResponses = new ArrayList<CopyPartResult>();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
long lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
CopyPartRequest copyRequest = new CopyPartRequest()
.withSourceBucketName(sourceBucketName)
.withSourceKey(sourceObjectKey)
.withDestinationBucketName(destBucketName)
.withDestinationKey(destObjectKey)
.withUploadId(initResult.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum++);
copyResponses.add(s3Client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
CompleteMultipartUploadRequest completeRequest = new CompleteMultipartUploadRequest(
destBucketName,
destObjectKey,
initResult.getUploadId(),
getETags(copyResponses));
s3Client.completeMultipartUpload(completeRequest);
System.out.println("Multipart copy complete.");
} catch (AmazonServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
e.printStackTrace();
} catch (SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
e.printStackTrace();
}
}
// This is a helper function to construct a list of ETags.
private static List<PartETag> getETags(List<CopyPartResult> responses) {
List<PartETag> etags = new ArrayList<PartETag>();
for (CopyPartResult response : responses) {
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
}
小于5GB的对象,使用单操作复制
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import java.io.IOException;
public class CopyObjectSingleOperation {
public static void main(String[] args) throws IOException {
Regions clientRegion = Regions.DEFAULT_REGION;
String bucketName = "*** Bucket name ***";
String sourceKey = "*** Source object key *** ";
String destinationKey = "*** Destination object key ***";
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new ProfileCredentialsProvider())
.withRegion(clientRegion)
.build();
// Copy the object into a new object in the same bucket.
CopyObjectRequest copyObjRequest = new CopyObjectRequest(bucketName, sourceKey, bucketName, destinationKey);
s3Client.copyObject(copyObjRequest);
} catch (AmazonServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
e.printStackTrace();
} catch (SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
e.printStackTrace();
}
}
}
- 在我当前的项目中,我使用上面描述的分段上传来上传多达 15GB 的文件,它确实有项目特定的修改。 aws docs