PDF/TIFF 文档文本检测 gcsDestinationBucketName
PDF/TIFF Document Text Detection gcsDestinationBucketName
我正在使用 google 云视觉 API.
将 Pdf 转换为文本文件
我通过那里获得了初始代码帮助,图像到文本的转换工作正常,我通过注册和激活获得了 JSON 密钥,
这是我得到的用于 pdf 到文本转换的代码
private static object DetectDocument(string gcsSourceUri,
string gcsDestinationBucketName, string gcsDestinationPrefixName)
{
var client = ImageAnnotatorClient.Create();
var asyncRequest = new AsyncAnnotateFileRequest
{
InputConfig = new InputConfig
{
GcsSource = new GcsSource
{
Uri = gcsSourceUri
},
// Supported mime_types are: 'application/pdf' and 'image/tiff'
MimeType = "application/pdf"
},
OutputConfig = new OutputConfig
{
// How many pages should be grouped into each json output file.
BatchSize = 2,
GcsDestination = new GcsDestination
{
Uri = $"gs://{gcsDestinationBucketName}/{gcsDestinationPrefixName}"
}
}
};
asyncRequest.Features.Add(new Feature
{
Type = Feature.Types.Type.DocumentTextDetection
});
List<AsyncAnnotateFileRequest> requests =
new List<AsyncAnnotateFileRequest>();
requests.Add(asyncRequest);
var operation = client.AsyncBatchAnnotateFiles(requests);
Console.WriteLine("Waiting for the operation to finish");
operation.PollUntilCompleted();
// Once the rquest has completed and the output has been
// written to GCS, we can list all the output files.
var storageClient = StorageClient.Create();
// List objects with the given prefix.
var blobList = storageClient.ListObjects(gcsDestinationBucketName,
gcsDestinationPrefixName);
Console.WriteLine("Output files:");
foreach (var blob in blobList)
{
Console.WriteLine(blob.Name);
}
// Process the first output file from GCS.
// Select the first JSON file from the objects in the list.
var output = blobList.Where(x => x.Name.Contains(".json")).First();
var jsonString = "";
using (var stream = new MemoryStream())
{
storageClient.DownloadObject(output, stream);
jsonString = System.Text.Encoding.UTF8.GetString(stream.ToArray());
}
var response = JsonParser.Default
.Parse<AnnotateFileResponse>(jsonString);
// The actual response for the first page of the input file.
var firstPageResponses = response.Responses[0];
var annotation = firstPageResponses.FullTextAnnotation;
// Here we print the full text from the first page.
// The response contains more information:
// annotation/pages/blocks/paragraphs/words/symbols
// including confidence scores and bounding boxes
Console.WriteLine($"Full text: \n {annotation.Text}");
return 0;
}
这个函数需要3个参数
字符串 gcsSourceUri,
字符串 gcsDestinationBucketName,
字符串 gcsDestinationPrefixName
我不明白我应该为这 3 个参数设置哪个值。
我以前从未在第三方工作过 API 所以这让我有点困惑
假设您拥有一个名为 'giri_bucket' 的 GCS 存储桶,并将 pdf 放在存储桶的根目录下 'test.pdf'。如果您想将操作的结果写入同一个存储桶,您可以将参数设置为
- gcsSourceUri: 'gs://giri_bucket/test.pdf'
- gcsDestinationBucketName:'giri_bucket'
- gcsDestinationPrefixName:'async_test'
操作完成后,您的 GCS 存储桶中将有 1 个或多个输出文件 giri_bucket/async_test。
如果需要,您甚至可以将输出写入不同的存储桶。您只需要确保您的 gcsDestinationBucketName + gcsDestinationPrefixName 是唯一的。
您可以在文档中阅读有关请求格式的更多信息:AsyncAnnotateFileRequest
我正在使用 google 云视觉 API.
将 Pdf 转换为文本文件我通过那里获得了初始代码帮助,图像到文本的转换工作正常,我通过注册和激活获得了 JSON 密钥,
这是我得到的用于 pdf 到文本转换的代码
private static object DetectDocument(string gcsSourceUri,
string gcsDestinationBucketName, string gcsDestinationPrefixName)
{
var client = ImageAnnotatorClient.Create();
var asyncRequest = new AsyncAnnotateFileRequest
{
InputConfig = new InputConfig
{
GcsSource = new GcsSource
{
Uri = gcsSourceUri
},
// Supported mime_types are: 'application/pdf' and 'image/tiff'
MimeType = "application/pdf"
},
OutputConfig = new OutputConfig
{
// How many pages should be grouped into each json output file.
BatchSize = 2,
GcsDestination = new GcsDestination
{
Uri = $"gs://{gcsDestinationBucketName}/{gcsDestinationPrefixName}"
}
}
};
asyncRequest.Features.Add(new Feature
{
Type = Feature.Types.Type.DocumentTextDetection
});
List<AsyncAnnotateFileRequest> requests =
new List<AsyncAnnotateFileRequest>();
requests.Add(asyncRequest);
var operation = client.AsyncBatchAnnotateFiles(requests);
Console.WriteLine("Waiting for the operation to finish");
operation.PollUntilCompleted();
// Once the rquest has completed and the output has been
// written to GCS, we can list all the output files.
var storageClient = StorageClient.Create();
// List objects with the given prefix.
var blobList = storageClient.ListObjects(gcsDestinationBucketName,
gcsDestinationPrefixName);
Console.WriteLine("Output files:");
foreach (var blob in blobList)
{
Console.WriteLine(blob.Name);
}
// Process the first output file from GCS.
// Select the first JSON file from the objects in the list.
var output = blobList.Where(x => x.Name.Contains(".json")).First();
var jsonString = "";
using (var stream = new MemoryStream())
{
storageClient.DownloadObject(output, stream);
jsonString = System.Text.Encoding.UTF8.GetString(stream.ToArray());
}
var response = JsonParser.Default
.Parse<AnnotateFileResponse>(jsonString);
// The actual response for the first page of the input file.
var firstPageResponses = response.Responses[0];
var annotation = firstPageResponses.FullTextAnnotation;
// Here we print the full text from the first page.
// The response contains more information:
// annotation/pages/blocks/paragraphs/words/symbols
// including confidence scores and bounding boxes
Console.WriteLine($"Full text: \n {annotation.Text}");
return 0;
}
这个函数需要3个参数 字符串 gcsSourceUri, 字符串 gcsDestinationBucketName, 字符串 gcsDestinationPrefixName
我不明白我应该为这 3 个参数设置哪个值。 我以前从未在第三方工作过 API 所以这让我有点困惑
假设您拥有一个名为 'giri_bucket' 的 GCS 存储桶,并将 pdf 放在存储桶的根目录下 'test.pdf'。如果您想将操作的结果写入同一个存储桶,您可以将参数设置为
- gcsSourceUri: 'gs://giri_bucket/test.pdf'
- gcsDestinationBucketName:'giri_bucket'
- gcsDestinationPrefixName:'async_test'
操作完成后,您的 GCS 存储桶中将有 1 个或多个输出文件 giri_bucket/async_test。
如果需要,您甚至可以将输出写入不同的存储桶。您只需要确保您的 gcsDestinationBucketName + gcsDestinationPrefixName 是唯一的。
您可以在文档中阅读有关请求格式的更多信息:AsyncAnnotateFileRequest