从电子邮件正文中提取和提取地址,使用回复链 - JavaMail Api。

Extracting from and to addresses from body of the email, with reply chains- JavaMail Api.

我正在尝试从 enron 数据集中提取内容。我想我会尝试使用 Javamail Api,因为它很容易 parse.However,我是 JavaMail 的新手,我在网上参考了一些资料。

我能够创建文件的 MimeMessage 对象并提取各种字段。 object.getContent() 能够给我正文中的内容。

我想做的是从正文中提取 from 和 to 地址。我不知道该怎么做。

我阅读了有关创建 Multipart 对象并尝试从中提取的信息。

  1. 使用javax.mail.Message.getContent() 获取消息的内容。这个 应该 return 整个消息的内容,在一个类型的对象中 javax.mail.Multipart.

  2. 使用 java.mail.Multipart 上的方法检索特定部分 消息。这应该封装在一个类型的对象中 javax.mail.BodyPart.

  3. 使用javax.mail.BodyPart上的方法检索 您感兴趣的消息的特定部分。

在我的案例中指定的 Mime 类型不是 Multipart。但是,当我尝试上述方法时,我得到一个 "Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.Message"

我该怎么办?


以下是我正在尝试解析的文件的内容。

Message-ID: <16159836.1075855377439.JavaMail.evans@thyme>
Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST)
From: heather.dunton@enron.com
To: k..allen@enron.com
Subject: RE: West Position
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Dunton, Heather </O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON>
X-To: Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\Inbox
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst


Please let me know if you still need Curve Shift.

Thanks,
Heather
 -----Original Message-----
From:   Allen, Phillip K.  
Sent:   Friday, December 07, 2001 5:14 AM
To: Dunton, Heather
Subject:    RE: West Position

Heather,

Did you attach the file to this email?

 -----Original Message-----
From:   Dunton, Heather  
Sent:   Wednesday, December 05, 2001 1:43 PM
To: Allen, Phillip K.; Belden, Tim
Subject:    FW: West Position

Attached is the Delta position for 1/16, 1/30, 6/19, 7/13, 9/21


 -----Original Message-----
From:   Allen, Phillip K.  
Sent:   Wednesday, December 05, 2001 6:41 AM
To: Dunton, Heather
Subject:    RE: West Position

Heather,

This is exactly what we need.  Would it possible to add the prior day for each of the dates below to the pivot table.  In order to validate the curve shift on the dates below we also need the prior days ending positions.

Thank you,

Phillip Allen

 -----Original Message-----
From:   Dunton, Heather  
Sent:   Tuesday, December 04, 2001 3:12 PM
To: Belden, Tim; Allen, Phillip K.
Cc: Driscoll, Michael M.
Subject:    West Position


Attached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24



 << File: west_delta_pos.xls >> 

Let me know if you have any questions.


Heather

这是我使用的代码:

private void mailParser() throws IOException, MessagingException {
    File mailFiles = new File("/xxx/xx/xx/x/x/inbox/1");
    String host = "host.com";
    Properties properties = System.getProperties();

    properties.setProperty("mail.smtp.host", host);
    Session session = Session.getDefaultInstance(properties);

    MimeMessage email = null;
    try {
        FileInputStream fis = new FileInputStream(mailFiles);
        email = new MimeMessage(session, fis);

        //Message ID
        System.out.println("message id: " + email.getMessageID());

        //Date
        System.out.println("sent date : " + email.getSentDate());

        //From
        Address[] add = email.getFrom();
        if (add != null) {
            for (int i = 0; i < add.length; i++) {
                System.out.println("FROM  : " + add[i].toString());
            }

        //Subject
        System.out.println("\nsubject: " + email.getSubject());

        //TO
        if (email.getRecipients(Message.RecipientType.TO) != null) {
            for( Address emails: email.getRecipients(Message.RecipientType.TO)){
            System.out.println("\nrecipients to: " + Arrays.asList(email.getRecipients(Message.RecipientType.TO)));
        }

        //CC 
        if (email.getRecipients(Message.RecipientType.CC) != null) {
              for( Address emails: email.getRecipients(Message.RecipientType.CC)){   
            System.out.println("\nrecipients cc: " + Arrays.asList(email.getRecipients(Message.RecipientType.CC)));
        }

        //BCC
        if (email.getRecipients(Message.RecipientType.BCC) != null) {
              for( Address emails: email.getRecipients(Message.RecipientType.BCC)){
            System.out.println("\nrecipients bcc: " + Arrays.asList(email.getRecipients(Message.RecipientType.BCC)));
        }

        //Content type
        System.out.println("contetnt type: " + email.getContentType());

        //Content Encoding
        System.out.println("encoding: " + email.getEncoding());

        //Content of email
        Message message = (Message) email.getContent();

        if(message instanceof MimeMessage)
        {
        MimeMessage m = (MimeMessage)message;
        Object contentObject = m.getContent();
        if(contentObject instanceof Multipart)
        {
            BodyPart clearTextPart = null;
            Multipart content = (Multipart)contentObject;
            int count = content.getCount();
            for(int i=0; i<count; i++)
            {
                BodyPart part =  content.getBodyPart(i);                 
                    clearTextPart = part;
                    break;
            }

            if(clearTextPart!=null)
            {
               String result = (String) clearTextPart.getContent();
                System.out.println(result);
            }


        }

        System.out.println("Content of email" + email.getContent().toString());
    } catch (MessagingException e) {
        throw new IllegalStateException("illegal state issue", e);
    } catch (FileNotFoundException e) {
        throw new IllegalStateException("file not found issue issue: " + mailFiles.getAbsolutePath(), e);
    }
}

您看到的是对消息回复的回复,其中原始消息文本和一些 header 信息作为新文本包含在回复消息中。就 MIME 而言,原始消息的文本出现在回复消息中,就好像您自己输入的一样,就像回复消息文本的任何其他部分一样。 "Original Message" 分隔符不是 MIME 已知的东西。顶级消息只是纯文本消息,不是多部分消息,并且没有 MIME 结构。

由于JavaMail是解析邮件的MIME结构,所以没有对邮件内容进行特殊处理。恐怕您几乎只能靠自己来解析消息内容以提取 included/replied 消息文本。

您还会注意到邮件 body 中的发件人和收件人地址只是姓名,而不是电子邮件地址,并且根本不是 RFC 2822 格式。日期格式也不正确。为方便起见,邮件 reader(很可能是 Outlook)只是将原始邮件中的文本包含在 "human readable format" 的回复中。