使用 outlook-message-parser 库解析 outlook 邮件
parse outlook emails using outlook-message-parser library
我正在尝试从远程邮箱的 INBOX 加载电子邮件并解析它们以提取附件并以 HTML 格式转换正文。
我使用下面的代码片段使用 outlook 消息解析器 jar 进行解析
ResultSuccess insertMessage(Message currentMsg) {
final OutlookMessageParser msgp = new OutlookMessageParser();
final OutlookMessage msg = parseMsg(currentMsg.getInputStream());
}
并且当前消息的类型为 javax.mail.Message
从服务器获取邮件的代码片段如下
Properties props = new Properties();
Message currentMessage;
Session session = Session.getInstance(props, null);
session.setDebug(debug);
store = session.getStore(PROTOCOL);
store.connect(host, username, password);
Message message[] = inboxfolder.getMessages();
Message copyMessage[] = new Message[1];
int n = message.length;
for (int j = 0; j < n; j++) {
currentMessage = message[j];
ResultSuccess result = insertMessage(currentMessage);
异常详情如下
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x615F3430305F2D2D, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:151)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:285)
at org.simplejavamail.outlookmessageparser.OutlookMessageParser.parseMsg(OutlookMessageParser.java:133)
at com.email.Email_Parse.loadMessages(Email_Parse.java:38)
at com.email.Email_Parse.getMessages(Email_Parse.java:116)
at com.email.Email_Parse.main(Email_Parse.java:26)
然而,当我尝试从本地磁盘加载电子邮件并解析它们时,问题并没有发生。
知道如何解决这个问题吗?
我想您正在使用 outlook-message-parser 来解析存储在磁盘上的电子邮件。
从邮件服务器检索到的邮件不是 Outlook 文件格式(即使远程服务器是 Microsoft Exchange 服务器或 Microsoft 的 Outlook 电子邮件服务)所以 outlook-message-parser 将无法解析它们。
您应该使用 JavaMail Api 检索邮件正文及其附件。
This page 描述了阅读带附件的邮件所需的步骤(带有几个示例)。这是摘录:
Q: How do I read a message with an attachment and save the
attachment?
A: As described above, a message with an attachment is
represented in MIME as a multipart message. In the simple case, the
results of the Message object's getContent method will be a
MimeMultipart object. The first body part of the multipart object wil
be the main text of the message. The other body parts will be
attachments. The msgshow.java demo program shows how to traverse all
the multipart objects in a message and extract the data of each of the
body parts. The getDisposition method will give you a hint as to
whether the body part should be displayed inline or should be
considered an attachment (but note that not all mailers provide this
information). So to save the contents of a body part in a file, use
the saveFile method of MimeBodyPart.
To save the data in a body part into a file (for example), use the
getInputStream method to access the attachment content and copy the
data to a FileOutputStream. Note that when copying the data you can
not use the available method to determine how much data is in the
attachment. Instead, you must read the data until EOF. The saveFile
method of MimeBodyPart will do this for you. However, you should not
use the results of the getFileName method directly to name the file to
be saved; doing so could cause you to overwrite files unintentionally,
including system files.
Note that there are also more complicated cases to be handled as well.
For example, some mailers send the main body as both plain text and
html. This will typically appear as a multipart/alternative content
(and a MimeMultipart object) in place of a simple text body part.
Also, messages that are digitally signed or encrypted are even more
complex. Handling all these cases can be challenging. Please refer to
the various MIME specifications and other resources listed on our main
page.
电子邮件并不总是 html,有时它们只是纯文本。大多数时候它们是“多部分”的。例如,一封电子邮件可以有一个 html 部分,它将被支持 html 的电子邮件客户端(gmail、thunderbird ...)显示,另一个纯文本部分可以被其他电子邮件客户端使用无法显示 html(想想基于文本的电子邮件客户端)。
因此在转储电子邮件内容之前,您必须检查其内容类型(或者如果它有多个部分,请检查各部分的内容类型)。
对于 html 部分,逐字转储内容可以为您提供所需的结果,具体取决于图像的引用方式。
如果使用 http URL(如 <img src="https://example.com/a.png"/>
)引用图像,则无需进一步操作即可在浏览器中显示结果。
如果使用 Content-Id URL(如 <img src="cid:image002.gif@01D44EB0.904DB790"/>
)引用图像,那么您必须做额外的工作才能在浏览器中正确显示结果。
您必须在电子邮件部分中寻找正确的图像,并决定如何将其包含在最终结果中。
例如,将其保存到磁盘并将 html 中的引用替换为其在磁盘上的路径,这样 <img src="cid:image002.gif@01D44EB0.904DB790"/>
就变成这样 <img src="/path/to/saved/images/imagexyz.png"/>
或者将其转换为base64格式,并用数据URI替换html中的引用,这样<img src="cid:image002.gif@01D44EB0.904DB790"/>
就变成了这样的<img src=""/>
。
我不知道是否有 java 库可以自动执行此操作。
JavaMail api 网站提供了 samples 您可以阅读以了解如何使用它。您可以查看示例中的 msgshow.java
,了解如何使用 api 检索消息的内容。
这是一个简单的示例程序,可以将最后一封邮件从gmail收件箱下载到本地目录(它可能有错误。不要忘记输入您自己的帐户和密码并替换“/tmp/messages”在您的计算机上有一个有效的目录)。
import javax.mail.*;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.Properties;
public class MessageDownloader {
private File destDir;
public MessageDownloader(File destDir){
this.destDir = destDir;
}
public void download(Part message, String basename) throws MessagingException, IOException {
System.out.println("Type : " + message.getContentType());
if(message.isMimeType("text/plain")) {
downloadTextPart((String) message.getContent(), basename + ".txt");
}else if(message.isMimeType("text/html")) {
downloadTextPart((String) message.getContent(), basename + ".html");
}else if(message.isMimeType("image/*") || Part.ATTACHMENT.equalsIgnoreCase(message.getDisposition())){
downloadDataPart(message, basename);
}else if(message.isMimeType("multipart/*")){
downloadMultiPart((Multipart) message.getContent(), basename);
}else{
System.out.println("Unrecognized type");
}
}
private void downloadDataPart(Part dataPart, String basename) throws IOException, MessagingException {
File dataFile = new File(destDir, basename + "_" + dataPart.getFileName());
Files.copy(dataPart.getInputStream(), dataFile.toPath());
}
private void downloadTextPart(String textContent, String filename) throws MessagingException, IOException{
File textFile = new File(destDir, filename);
Files.writeString(textFile.toPath(), textContent);
}
private void downloadMultiPart(Multipart multiPartMessage, String basename) throws MessagingException, IOException {
for(int partIdx = 0; partIdx < multiPartMessage.getCount(); partIdx++){
BodyPart part = multiPartMessage.getBodyPart(partIdx);
download(part, String.format("%s_%d_", basename, partIdx));
}
}
public static void main(String[] args) throws MessagingException, IOException {
Store store = getStore();
Folder folder = store.getFolder("Inbox");
folder.open(Folder.READ_ONLY);
MessageDownloader msgDownloader = new MessageDownloader(new File("/tmp/messages"));
Message lastMessage = folder.getMessage(folder.getMessageCount()-1);
msgDownloader.download(lastMessage, "last_message");
folder.close();
store.close();
}
private static Store getStore() throws MessagingException {
Properties props = new Properties();
props.setProperty("mail.smtp.ssl.enable", "true");
Session session = Session.getInstance(props, null);
Store store = session.getStore("imaps");
store.connect("imap.gmail.com", "account@gmail.com","password");
return store;
}
}
我正在尝试从远程邮箱的 INBOX 加载电子邮件并解析它们以提取附件并以 HTML 格式转换正文。
我使用下面的代码片段使用 outlook 消息解析器 jar 进行解析
ResultSuccess insertMessage(Message currentMsg) {
final OutlookMessageParser msgp = new OutlookMessageParser();
final OutlookMessage msg = parseMsg(currentMsg.getInputStream());
}
并且当前消息的类型为 javax.mail.Message
从服务器获取邮件的代码片段如下
Properties props = new Properties();
Message currentMessage;
Session session = Session.getInstance(props, null);
session.setDebug(debug);
store = session.getStore(PROTOCOL);
store.connect(host, username, password);
Message message[] = inboxfolder.getMessages();
Message copyMessage[] = new Message[1];
int n = message.length;
for (int j = 0; j < n; j++) {
currentMessage = message[j];
ResultSuccess result = insertMessage(currentMessage);
异常详情如下
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x615F3430305F2D2D, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:151)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:285)
at org.simplejavamail.outlookmessageparser.OutlookMessageParser.parseMsg(OutlookMessageParser.java:133)
at com.email.Email_Parse.loadMessages(Email_Parse.java:38)
at com.email.Email_Parse.getMessages(Email_Parse.java:116)
at com.email.Email_Parse.main(Email_Parse.java:26)
然而,当我尝试从本地磁盘加载电子邮件并解析它们时,问题并没有发生。
知道如何解决这个问题吗?
我想您正在使用 outlook-message-parser 来解析存储在磁盘上的电子邮件。
从邮件服务器检索到的邮件不是 Outlook 文件格式(即使远程服务器是 Microsoft Exchange 服务器或 Microsoft 的 Outlook 电子邮件服务)所以 outlook-message-parser 将无法解析它们。
您应该使用 JavaMail Api 检索邮件正文及其附件。
This page 描述了阅读带附件的邮件所需的步骤(带有几个示例)。这是摘录:
Q: How do I read a message with an attachment and save the attachment?
A: As described above, a message with an attachment is represented in MIME as a multipart message. In the simple case, the results of the Message object's getContent method will be a MimeMultipart object. The first body part of the multipart object wil be the main text of the message. The other body parts will be attachments. The msgshow.java demo program shows how to traverse all the multipart objects in a message and extract the data of each of the body parts. The getDisposition method will give you a hint as to whether the body part should be displayed inline or should be considered an attachment (but note that not all mailers provide this information). So to save the contents of a body part in a file, use the saveFile method of MimeBodyPart.
To save the data in a body part into a file (for example), use the getInputStream method to access the attachment content and copy the data to a FileOutputStream. Note that when copying the data you can not use the available method to determine how much data is in the attachment. Instead, you must read the data until EOF. The saveFile method of MimeBodyPart will do this for you. However, you should not use the results of the getFileName method directly to name the file to be saved; doing so could cause you to overwrite files unintentionally, including system files.
Note that there are also more complicated cases to be handled as well. For example, some mailers send the main body as both plain text and html. This will typically appear as a multipart/alternative content (and a MimeMultipart object) in place of a simple text body part. Also, messages that are digitally signed or encrypted are even more complex. Handling all these cases can be challenging. Please refer to the various MIME specifications and other resources listed on our main page.
电子邮件并不总是 html,有时它们只是纯文本。大多数时候它们是“多部分”的。例如,一封电子邮件可以有一个 html 部分,它将被支持 html 的电子邮件客户端(gmail、thunderbird ...)显示,另一个纯文本部分可以被其他电子邮件客户端使用无法显示 html(想想基于文本的电子邮件客户端)。
因此在转储电子邮件内容之前,您必须检查其内容类型(或者如果它有多个部分,请检查各部分的内容类型)。
对于 html 部分,逐字转储内容可以为您提供所需的结果,具体取决于图像的引用方式。
如果使用 http URL(如 <img src="https://example.com/a.png"/>
)引用图像,则无需进一步操作即可在浏览器中显示结果。
如果使用 Content-Id URL(如 <img src="cid:image002.gif@01D44EB0.904DB790"/>
)引用图像,那么您必须做额外的工作才能在浏览器中正确显示结果。
您必须在电子邮件部分中寻找正确的图像,并决定如何将其包含在最终结果中。
例如,将其保存到磁盘并将 html 中的引用替换为其在磁盘上的路径,这样 <img src="cid:image002.gif@01D44EB0.904DB790"/>
就变成这样 <img src="/path/to/saved/images/imagexyz.png"/>
或者将其转换为base64格式,并用数据URI替换html中的引用,这样<img src="cid:image002.gif@01D44EB0.904DB790"/>
就变成了这样的<img src=""/>
。
我不知道是否有 java 库可以自动执行此操作。
JavaMail api 网站提供了 samples 您可以阅读以了解如何使用它。您可以查看示例中的 msgshow.java
,了解如何使用 api 检索消息的内容。
这是一个简单的示例程序,可以将最后一封邮件从gmail收件箱下载到本地目录(它可能有错误。不要忘记输入您自己的帐户和密码并替换“/tmp/messages”在您的计算机上有一个有效的目录)。
import javax.mail.*;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.Properties;
public class MessageDownloader {
private File destDir;
public MessageDownloader(File destDir){
this.destDir = destDir;
}
public void download(Part message, String basename) throws MessagingException, IOException {
System.out.println("Type : " + message.getContentType());
if(message.isMimeType("text/plain")) {
downloadTextPart((String) message.getContent(), basename + ".txt");
}else if(message.isMimeType("text/html")) {
downloadTextPart((String) message.getContent(), basename + ".html");
}else if(message.isMimeType("image/*") || Part.ATTACHMENT.equalsIgnoreCase(message.getDisposition())){
downloadDataPart(message, basename);
}else if(message.isMimeType("multipart/*")){
downloadMultiPart((Multipart) message.getContent(), basename);
}else{
System.out.println("Unrecognized type");
}
}
private void downloadDataPart(Part dataPart, String basename) throws IOException, MessagingException {
File dataFile = new File(destDir, basename + "_" + dataPart.getFileName());
Files.copy(dataPart.getInputStream(), dataFile.toPath());
}
private void downloadTextPart(String textContent, String filename) throws MessagingException, IOException{
File textFile = new File(destDir, filename);
Files.writeString(textFile.toPath(), textContent);
}
private void downloadMultiPart(Multipart multiPartMessage, String basename) throws MessagingException, IOException {
for(int partIdx = 0; partIdx < multiPartMessage.getCount(); partIdx++){
BodyPart part = multiPartMessage.getBodyPart(partIdx);
download(part, String.format("%s_%d_", basename, partIdx));
}
}
public static void main(String[] args) throws MessagingException, IOException {
Store store = getStore();
Folder folder = store.getFolder("Inbox");
folder.open(Folder.READ_ONLY);
MessageDownloader msgDownloader = new MessageDownloader(new File("/tmp/messages"));
Message lastMessage = folder.getMessage(folder.getMessageCount()-1);
msgDownloader.download(lastMessage, "last_message");
folder.close();
store.close();
}
private static Store getStore() throws MessagingException {
Properties props = new Properties();
props.setProperty("mail.smtp.ssl.enable", "true");
Session session = Session.getInstance(props, null);
Store store = session.getStore("imaps");
store.connect("imap.gmail.com", "account@gmail.com","password");
return store;
}
}