On 5/26/2016 at approximately 8:30PM ET, the production authoring system became unavailable. The outage continued until all day on 5/27/2016. Though restored in the evening of 5/27, user access to the system was disabled until complete testing could be conducted on 5/31/2016.
What happened? The graphs below show large spikes in CPU, disk IO and physical memory untilizations. After exhausting these resources, the system became non-responsive and crashed. Subsequent restarts failed to stablize and also quickly exhausted resources.
Why did it crash? Just before the initial crash, a large content (355MB) package was installed. This is a typical process which we have done numerous times before while preparing customer sites in AEM. The bulk of the content contained images and documents intended for folders within digial asset manager, i.e. the dam. Whenever digial assets are uploaded to the dam, background jobs are kicked off to process the assets. This prcessing includes metadata extraction, the creation of smaller versions called renditions, and other actions. This processing can be resource intensive. With bulk upload of such assets, the resource requirements for processing are multiplied. In the past, the team has disabled asset processing workflows during content installation. However, the mechanisms previous used are no longer disabling the processes. This will be corrected before any further content package installations.
CPU usage was the first resource to escalate. The light blue indicates CPU used directly by the AEM application. The green indicates CPU used by writing and reading to the disk. The dark blue is other systems CPU usage. This activity preceded spikes in other resources, such as physical memory. Phsical memory utilization became quickly and entirely consumed. The light red indicates that 'circuit breaker' tripped which the monitoring software uses to reduce it's own footprint on the system under test. The lower right graph shows large network utilization well before the interruption which was presumably the offending package upload.
Not just batch processing: Although batch processing was the initial trigger, during testing the team observed that processing from uploading a single 12MB PDF (v1.4) was sufficient to crash the system. Though larger PDF's using more recent versions were successfully processed. It appears there is relationship with older format PDF processin of sub-assets and the problems experienced.
As of 6/10/2016, Adobe investigated and was able to reproduce the problem. Adobe suggested a long-term fix for PDF processing, which is currently being studied. Given the severity of the problem, the team is being cautious about re-enabling PDF processes until thorough testing indicates the problem has been resolved. Adobe Engineering had 'done a few fixes in stability of pdf processing and all are included in SP2.' The suggested fix has been ported to current version of the LSA AEM system.