Moving Legacy Archives to Office 365 Is Painful

archive-key-hero
Migration of legacy archives like Enterprise Vault are often left as the last part of the journey from on-premises servers to Office 365. A wide range of third-party migration tools are available, but many complex twists and turns await the unwary as they cope with journal reports, archive mailboxes, splitting and explosions, and the sheer amount of data to be migrated. It’s an interesting problem to solve.

The need for third-party archives such as Veritas Enterprise Vault was obvious a decade or so ago when Exchange 2003 depended on expensive SAN-quality storage and supported no compliance features apart from basic mailbox search. Reducing the demand for mailbox storage by offloading copies of messages to third-party archives, leaving small “stubs” behind, delivered a good solution. Companies could keep copies of email for as long as necessary and the third-party archives usually had much better compliance and eDiscovery features than existed inside Exchange.

Time marches on and software evolves. Exchange now uses low-cost storage and provides sufficient mailbox quota to allow users to keep email as long as necessary and the product sports a wide range of compliance features to help companies satisfy regulatory or legal retention requirements. Search is much better and eDiscovery, especially in Office 365, is as powerful as many specialized software products. In the new reality, there’s no surprise that companies are discarding third-party archives and want to move data to Office 365, where the data becomes more discoverable, compliant, and accessible to users.

Two years ago, J. Peter Bruzzese wrote about the pain of migrating archive data to Office 365. Things are a little easier today, not least because of two years’ additional experience of migration projects. The goal is to decommission an often-expensive third-party archive system by transferring the data contained in the archive to Exchange Online mailboxes. However, the migration of legacy archive data is usually left as one of the last things to do when companies move to Office 365, with complexity, cost, and lack of standard tool sets being amongst the common reasons.

Exchange archive concepts

Before plunging into some detail why this situation exists, it’s important to understand what data exists and the techniques that can be used to move the data to Office 365. Here’s a summary:

Two types of information can be found in third-party archives. The first is archived copies of messages extracted from user mailboxes and replaced with shortcuts pointing to the archive. The shortcuts are also referred to as stubs and have been used since the introduction of the original Enterprise Vault product in 1998. Microsoft defines stubbing as when “a third-party archival product takes a large message and turns it into a smaller item, or stub. It typically does this by deleting the attachments and modifying the message body to be smaller.” The process of converting a stub to a message is called rehydration. Some migration companies believe that rehydration is a difficult process, but it can be straightforward depending on the selected toolset.
The second kind of data are journal reports. A journal report is a legally-defensible copy of a message that contains all its content and information about who received the message. Journal rules running inside Exchange direct copies of messages to a journal recipient (identified as an SMTP address). The journal recipient is usually a special mailbox in the third-party archive where the messages are captured and collected by a “harvester” process that inserts the messages into the repository for long-term storage. Since Exchange 2003, Exchange has used “envelope journaling” as the format for its journal reports, defined as, “the body of a journal report containing information from the original message such as the sender email address, message subject, message-ID, and recipient email addresses.” The original message is included unaltered as an attachment to the journal report.
Journal splitting is a technique whereby journal data are migrated to a set of target mailboxes. A recent note published by Bob Spurzem of Archive360 titled, “Are you moving Exchange journal data to Office 365,” explained that this means: “… create separate Office 365 mailboxes (e.g., Journal Mail 001, Journal Mail 002, Journal Mail 003) and move the journal data for groups of custodians into those.” Although not explicitly stated, I assume that the targets are shared mailboxes rather than personal because they are designated for use by “groups of custodians.” Shared mailboxes can be archive-enabled as long as those mailboxes are then assigned a suitable Office 365 license in the same way as personal mailboxes.
Journal explosion is a method that does not use shared mailboxes as the migration target. Instead, individual journal reports are reconstituted as items that are recreated in the mailboxes of users who received the original message.
From Exchange 2013 CU7 onwards (and Exchange Online from late 2014), fully-expanded distribution list and BCC recipient data is preserved in message headers to better support compliance searches. When older messages are moved to Office 365, some processing might be required to populate this data.

Specialist archive migration companies

The world of archiving and the migration of data to Office 365 from legacy archives is quite a specialized area, so it pays to seek advice from someone skilled in the art before making any decisions. Table 1 lists some of the companies active in the space together with their Enterprise Vault products (full disclosure: I am a non-executive director of Quadrotech). Of course, Enterprise Vault is not the only third-party archive that might hold legacy data suitable for migration to Office 365. EMC SourceOne, ZANTAZ EAS, HP Autonomy, and Daegis AXS-One are other examples.

*Company*	*Product*
Quadrotech	Archive Shuttle
TransVault	Migrator for Enterprise Vault
DELL	Migration Manager for Email Archives
Archive360	Email Archive Migration for Enterprise Vault

Table 1: Companies active in the archive migration market

Only the Quadrotech and TransVault products are certified by Veritas to ensure that the data extracted from Enterprise Vault pass the acid test of being identical to what was archived, something that is critical when proving a legal chain of custody for email. Not being certified by Veritas does not mean that other products do not extract data in a way that preserves all its characteristics intact. It simply means that they have not been through an independent certification procedure to verify that this is so.

After you have defined your needs and understand the amount and type of data to be migrated, you can review the migration products that are available to ensure that you find a solution that matches your requirements. For example, because journal reports often serve as proof that email was sent, it’s important that legacy journal data is extracted from the archive and then imported into Office 365 in a legally defensible manner that preserves the chain of custody.

If you have a small number of user archives to deal with, it is relatively straightforward to export user archives to PST files and ingest them into Office 365 using the Import service. The problem is that PST export and ingestion can be a tedious process, especially at the level of multi-tens of terabyte volumes often seen in enterprise archives. A wide range of commercial products exist that specialize in the tasks of locating, sanitizing, deduplicating, and processing PST imports. These products automate the workflow required to process the extracted PSTs and greatly reduce the overall time required to move data to Office 365.

The problem with journals and Exchange Online

Extracting legacy journal data from third-party archives and importing that data into Office 365 requires a lot more care and attention than is necessary for mail archives. Unlike in on-premises organizations, Microsoft does not allow Exchange Online mailboxes to be used as journal recipients and explicitly prohibits the use of transport rules, journal rules, auto-forwarding, or other methods to move information into a mailbox from multiple sources for archiving purposes.

In fact, switching journals to Office 365 is a more complicated affair than simply moving data to a new repository. It’s a fundamental switch from using a separate archive populated through journaling as the basis for retention of email for compliance purposes to embrace the concept that messages remain in large user mailboxes instead of using a separate archive. Management of email that has to be retained is done through retention policies and litigation or in-place holds. Microsoft provides large mailboxes (50 GB basic quota with 100 GB archives) to enable the data to be stored. It all sounds very logical.

Journal splitting and Exchange Online

That is, until you consider some of the restrictions that exist. Techniques such as journal splitting are used all the time to migrate legacy archive data to Office 365, but some doubt exists as to whether Microsoft is happy with what’s happening. In the Exchange Online Archiving Service Description, Microsoft says, “A user’s archive mailbox is intended for just that user. Microsoft reserves the right to deny unlimited archiving in instances where a user’s archive mailbox is used to store archive data for other users.” They also explicitly point to the archive’s 100 GB quota as being “large enough to accommodate and enforce reasonable use, including the import of one user’s historical email.”

Their documentation makes it clear that Microsoft does not want Exchange Online mailboxes to store anything other than user’s personal email. Even if the practice is tolerated and facilitated by Microsoft in order to complete customer migrations to Office 365, stuffing a heap of legacy information retrieved from a third-party archive into Exchange Online is technically forbidden. The exception is when data is extracted from the third-party archive and split into sets for each user before being moved to individual mailboxes. In other words, you can extract data for Joe’s mailbox from the third-party archive and import that data into Joe’s Exchange Online mailbox because the data belongs to Joe. This technique is journal explosion.

No sane Microsoft sales representative is going to tell a customer that they can’t decommission their expensive third-party archive system and move that data into Office 365, so it’s good advice to discuss the migration path with your Microsoft sales representative before committing to final plan. Bright account managers will always find a way to make migrations like this possible. Nevertheless, to make things simpler all round, it would be good if Microsoft created and communicated a definitive set of guidelines for the migration of legacy archive data to Office 365 with which all the migration vendors could comply. Until that happens, migration vendors, customers, and Microsoft account teams will operate in an unsatisfactory murky area.

Journal explosion and Exchange Online

A journal report is a single copy of a message and it is usually sufficient to use journal splitting to transfer those items to shared mailboxes. If you decide that copies of messages need to be imported into user mailboxes, the journal explosion approach can be used to expand a journal report into individual copies of the P2-version of the message for all recipients. TransVault’s Compliance TimeMachine product is an example of a migration product that uses journal explosion. TransVault explains the process as follows:

“Individual journal messages will be analyzed to retrieve the relevant compliance information
All valid recipients (….) will be identified
Each valid recipient will then receive the relevant copy of that journal message in their Office 365 inbox in a format that Office 365 recognises – with all the relevant compliance information retained.”

In effect, journal explosion recreates messages in user mailboxes in the same way that Exchange delivers separate copies to each recipient when a new message is sent. Care is taken to ensure that journaled items are properly handled when expanded, even when they are addressed to people who have since left the company. The net effect is that when you perform an eDiscovery search and find a message in Joe’s mailbox, it is good evidence that Joe either sent or received that message. That’s correct, but an equally convincing argument can be made that it is more sensible to use journal splitting and move the legacy data into a set of mailboxes without attempting to expand them into separate copies. This approach minimizes the storage need and relies on the integrity of the journal report as a legally defensible item that continues to prove that a message was sent.

Transvault recognizes that their method ends up with “a shedload more data than you started with” but say that storage is cheap and that “this is Microsoft’s problem, and not yours.” Expansion can lead to a huge increase in storage requirements. For instance, take a 50 TB archive that you want to move to Office 365 and assume that an average of four recipients exist for each journal report. When expansion occurs, you might need 200 TB of mailbox storage to absorb the “exploded” reports. It’s doubtful that the Office 365 administrators appreciate such a laissez-faire attitude to their storage.

In addition, the fan-out of messages to individual mailboxes is not going to be a fast operation and might be throttled by the controls Office 365 has in place to prevent background processing absorbing too many resources.

Important Processing Exceptions

Whether you decide to use journal splitting or journal explosion, some additional complexities that affect the choice of migration tactic include what to do with journal reports that reference people who have since left the organization. A variation on the theme includes the situation where an SMTP address previously used by an ex-employee has been reassigned to a new employee. And if you migrate user archives, an important issue to consider is how to deal with the “stubs” that point to old archive locations. These stubs need to be replaced with links to the new locations in a seamless manner.

Should Azure be used instead of Office 365?

The Archive360 note referenced above makes a case for moving legacy archive data to “cool” Azure Blob storage instead of Office 365 and offers a justification that “Office 365 has a hard storage limit of 512 names per distribution list. If your company has a distribution list with greater than 512 names (e.g., “All Company”), it will be truncated – losing important data.” To address the problem, Archive360 proposes a product (Archive2Azure), saying that “even if the “All Company” distribution list contains 10,000 names, it will be 100% preserved with Archive2Azure.”

If such a limitation existed, it would be a good reason to consider Azure rather than Office 365, but it’s not. TechNet explains that, “Up to 10,000 members of a distribution group is preserved,” which has been the case since Microsoft started to preserve information about all recipients in message headers in late 2014. It is true that if a message has more than 10,000 recipients, the expansion of distribution list and BCC addressees into the message header will surpass the capacity of Exchange to capture the data. This might or might not be a problem. For example, if the affected messages are “All Company” bulletins that don’t contain information of interest for compliance purposes, then not being able to fully expand the distribution might not be a problem. On the other hand, if a conspirator sent a message about an illegal activity to 10,000 co-conspirators that is later uncovered by an eDiscovery search, it might.

I am equally unconvinced about the truth of Archive360’s claim that Office 365 tenants should be afraid that Microsoft will start to charge for inactive mailboxes. Although this notion helps to support the idea of migrating legacy archive content to Azure, it’s simply FUD at its finest. Microsoft has shown no sign of wanting to charge for inactive mailboxes because they want data to reside inside Office 365, including that which belongs to employees who leave a company. Given the sheer volume of storage that Microsoft has deployed in its 12 Office 365 datacenter regions and the rapid reduction in procurement and operational costs for that storage, removing the free status of inactive mailboxes is not at the top of their agenda.

I am not against the notion of selling Azure as a destination for legacy journal data, if that’s what you want to use. However, moving the data to Azure Blob storage (at some additional expense for the storage) means that you don’t have a single set of data inside Office 365 that can be managed, indexed, and searched using common interfaces. It also means that a separate set of compliance functionality spanning everything from litigation holds to comprehensive large-scale search capabilities have to be built to operate against the Azure-based repository. These features already exist inside Office 365 and Microsoft is continuing to invest in its ability to satisfy compliance requirements for even the most demanding of circumstances (like SEC rule 17A-4), so I’m not persuaded that the Azure option is a good one.

For instance, a complex eDiscovery case that goes back multiple years becomes significantly more complicated to process if the legacy data is spread across two repositories. On the other hand, if the legacy archive data is moved into Office 365, tools like Advanced eDiscovery (based on the Equivio technology acquired in January 2015), can process searches that extend over tens of millions of documents.

Moving legacy email archive data into Office 365 is complicated

The complexities described here prove that moving legacy journal data from a third-party archive to Office 365 is not a project that can happen overnight. Careful planning and a great deal of research is needed to identify your requirements (like how much data actually needs to be moved), the most appropriate method, edge conditions (like people who don’t work for the company any more) and the right destination.

The best solution might be to migrate legacy mail archives as that’s a relatively straightforward process and leave the legacy journal reports on the archive server. This approach avoids all of the complications that can arise during journal migration with the notable downside of having to perform two separate searches should the need arise during an eDiscovery project. On the other hand, it would then be possible to reduce the operating expenses by consolidating the legacy archive servers down to the minimum necessary to support the journal reports and your migration will be quicker and simpler. Later on, after the usefulness of the legacy archive expires, the legacy servers can be fully decommissioned and all of the data needed for compliance purposes will reside inside Office 365. Depending on the industry and applicable regulations, that period might be five to seven years – or even longer.

Above all, whatever plan is adopted for the migration of legacy archives, be sure to involve your legal team, just to make sure that they’re happy with what you plan to do and that all of the migrated data is useful when it reaches its new home.

Follow Tony on Twitter @12Knocksinna.

Want to know more about how to manage Office 365? Find what you need to know in “Office 365 for IT Pros”, the most comprehensive eBook covering all aspects of Office 365. Available in PDF and EPUB formats (suitable for iBooks) or for Amazon Kindle.

Tony Redmond Petri Contributor

Tony Redmond has written thousands of articles about Microsoft technology since 1996. He covers Office 365 and associated technologies for Petri.com and is also the lead author for the Office 365 for IT Pros eBook, updated monthly to keep pace with c...