如何将 JDF 文件转换为 PDF(从多编码文档中删除文本)
How to convert a JDF file to a PDF (Removing text from a multi-encoded document)
我正在尝试使用 C# 将 JDF 文件转换为 PDF 文件。
查看 JDF format... 我可以看到该文件只是一个 XML 放在 PDF 文档顶部的文件。
我试过在 C# 中使用 StreamWriter / StreamReader
功能,但由于 PDF 文档还包含二进制数据和可变换行符(\r\t 和 \t),生成的文件无法打开为一些二进制数据在 PDF 上被销毁了。这是我尝试使用但没有成功的一些代码。
using (StreamReader reader = new StreamReader(_jdf.FullName, Encoding.Default))
{
using (StreamWriter writer = new StreamWriter(_pdf.FullName, false, Encoding.Default))
{
writer.NewLine = "\n"; //Tried without this and with \r\n
bool IsStartOfPDF = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.IndexOf("%PDF-") != -1)
{
IsStartOfPDF = true;
}
if (!IsStartOfPDF)
{
continue;
}
writer.WriteLine(line);
}
}
}
我是自己回答这个问题,因为这可能是一个有点普遍的问题,解决方案可以为其他人提供信息。
由于文件包含二进制和文本,我们不能简单地使用 StreamWriter
将二进制写回另一个文件。即使当您使用 StreamWriter
读取文件然后将所有内容写入另一个文件时,您也会意识到文件之间的差异。
您可以使用 BinaryWriter
来搜索多部分文档,并将每个字节完全按照您在另一个文档中找到的方式写入。
//Using a Binary Reader/Writer as the PDF is multitype
using (var reader = new BinaryReader(File.Open(_file.FullName, FileMode.Open)))
{
using (var writer = new BinaryWriter(File.Open(tempFileName.FullName, FileMode.CreateNew)))
{
//We are searching for the start of the PDF
bool searchingForstartOfPDF = true;
var startOfPDF = "%PDF-".ToCharArray();
//While we haven't reached the end of the stream
while (reader.BaseStream.Position != reader.BaseStream.Length)
{
//If we are still searching for the start of the PDF
if (searchingForstartOfPDF)
{
//Read the current Char
var str = reader.ReadChar();
//If it matches the start of the PDF signiture
if (str.Equals(startOfPDF[0]))
{
//Check the next few characters to see if they match
//keeping an eye on our current position in the stream incase something goes wrong
var currBasePos = reader.BaseStream.Position;
for (var i = 1; i < startOfPDF.Length; i++)
{
//If we found a char that isn't in the PDF signiture, then resume the while loop
//to start searching again from the next position
if (!reader.ReadChar().Equals(startOfPDF[i]))
{
reader.BaseStream.Position = currBasePos;
break;
}
//If we've reached the end of the PDF signiture then we've found a match
if (i == startOfPDF.Length - 1)
{
//Success
//Set the Position to the start of the PDF signiture
searchingForstartOfPDF = false;
reader.BaseStream.Position -= startOfPDF.Length;
//We are no longer searching for the PDF Signiture so
//the remaining bytes in the file will be directly wrote
//using the stream writer
}
}
}
}
else
{
//We are writing the binary now
writer.Write(reader.ReadByte());
}
}
}
}
此代码示例使用 BinaryReader
逐个读取每个字符,如果找到字符串 %PDF-
(PDF 开始签名)的匹配项,它将移动 reader 位置回到 %
然后使用 writer.Write(reader.ReadByte())
.
写入剩余的文档
我正在尝试使用 C# 将 JDF 文件转换为 PDF 文件。
查看 JDF format... 我可以看到该文件只是一个 XML 放在 PDF 文档顶部的文件。
我试过在 C# 中使用 StreamWriter / StreamReader
功能,但由于 PDF 文档还包含二进制数据和可变换行符(\r\t 和 \t),生成的文件无法打开为一些二进制数据在 PDF 上被销毁了。这是我尝试使用但没有成功的一些代码。
using (StreamReader reader = new StreamReader(_jdf.FullName, Encoding.Default))
{
using (StreamWriter writer = new StreamWriter(_pdf.FullName, false, Encoding.Default))
{
writer.NewLine = "\n"; //Tried without this and with \r\n
bool IsStartOfPDF = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.IndexOf("%PDF-") != -1)
{
IsStartOfPDF = true;
}
if (!IsStartOfPDF)
{
continue;
}
writer.WriteLine(line);
}
}
}
我是自己回答这个问题,因为这可能是一个有点普遍的问题,解决方案可以为其他人提供信息。
由于文件包含二进制和文本,我们不能简单地使用 StreamWriter
将二进制写回另一个文件。即使当您使用 StreamWriter
读取文件然后将所有内容写入另一个文件时,您也会意识到文件之间的差异。
您可以使用 BinaryWriter
来搜索多部分文档,并将每个字节完全按照您在另一个文档中找到的方式写入。
//Using a Binary Reader/Writer as the PDF is multitype
using (var reader = new BinaryReader(File.Open(_file.FullName, FileMode.Open)))
{
using (var writer = new BinaryWriter(File.Open(tempFileName.FullName, FileMode.CreateNew)))
{
//We are searching for the start of the PDF
bool searchingForstartOfPDF = true;
var startOfPDF = "%PDF-".ToCharArray();
//While we haven't reached the end of the stream
while (reader.BaseStream.Position != reader.BaseStream.Length)
{
//If we are still searching for the start of the PDF
if (searchingForstartOfPDF)
{
//Read the current Char
var str = reader.ReadChar();
//If it matches the start of the PDF signiture
if (str.Equals(startOfPDF[0]))
{
//Check the next few characters to see if they match
//keeping an eye on our current position in the stream incase something goes wrong
var currBasePos = reader.BaseStream.Position;
for (var i = 1; i < startOfPDF.Length; i++)
{
//If we found a char that isn't in the PDF signiture, then resume the while loop
//to start searching again from the next position
if (!reader.ReadChar().Equals(startOfPDF[i]))
{
reader.BaseStream.Position = currBasePos;
break;
}
//If we've reached the end of the PDF signiture then we've found a match
if (i == startOfPDF.Length - 1)
{
//Success
//Set the Position to the start of the PDF signiture
searchingForstartOfPDF = false;
reader.BaseStream.Position -= startOfPDF.Length;
//We are no longer searching for the PDF Signiture so
//the remaining bytes in the file will be directly wrote
//using the stream writer
}
}
}
}
else
{
//We are writing the binary now
writer.Write(reader.ReadByte());
}
}
}
}
此代码示例使用 BinaryReader
逐个读取每个字符,如果找到字符串 %PDF-
(PDF 开始签名)的匹配项,它将移动 reader 位置回到 %
然后使用 writer.Write(reader.ReadByte())
.