This is a 7 Day Memory Graph covering the week of the outages. The vertical redlines indicate when the system was down. At these times, the memory usage spiked.

 

Timeline of Maintenance & Outages

2/17/2017 Removed Old Packages from crx package manager

2/22/2017 OTC prod author (version purged failed)

3/1/2017 Ran :load OAK-5193-fix.groovy with author offline

3/3/2017 AEM 6.1 SP2 + Oak 1.2.23 

3/9/2017 Author went down 

3/15/2017 long running query, many logs, restarted author

3/20/2017 Author went down

3/22/2017 Author went down. Traversal limit placed on queries.

We have studied these and similar events closely with Adobe to identify the cause of the outages. Unfortunately, it can be complicated to identify root causes for each such event. In some cases, we learned from the logs and other files what was happening. The analysis below was derived from the events having sufficient information to draw conclusions. 

 

Issue AEM Author was not responding
Why?
The CPU was 100% busy 
Why? The memory was exhausted
Why? For at least two of the outages queries were taking many resources 
Why? The queries were traversing all the data nodes 
Why?

Either the system did not have proper indexes needed to service the query. 

Or, the query itself was improper. 

 

Based on the analysis, Adobe recommended two changes to address what we believe are the root causes for the outages.

  1. Adding a limit that will prevent improper queries from running away in this manner. That change took effect on production on 3/22, and appears to be effective. Since applying the limit, author has not crashed or had significant memory events.
  2. For common queries, Adobe has suggested a change to an appropriate index. This was applied to our Staging Tier and has shown to reduce running time of some common queries. It also reduced the amount of logs generated.  

Long term memory growth & garbage collection

We have observed memory utilization steadily growing over time. This is reason for concern, because unless reclaimed the system will run out of memory and crash. The graph below shows the left portion growing slowly every day until it reached a threshold. At which point the memory was dramatically reclaimed. Provided this process continues to occur, the system should be very reliable.

CMS (concurrent mark sweep) is an algorithm for reclaiming memory thru a process known as garbage collection. 'Concurrent' means it attempts to run in parallel to the application's other duties so not to affect performance. After Young Generation memory undergoes garbage collection, the remaining objects are compacted by moving them aside into the memory pool called Old Generation. Slow Old Gen memory pool growth may indicate the application has a memory leak. A memory leak is when new objects are created in memory, but that memory cannot be reclaimed due to existing and possibly unnecessary references to those objects.

 

Conclusion

The limits placed on production author appear to allow the system sustained uptime. We are cautiously watching the memory growth hopeful that garbage collection occurs as a matter of process as shown on 3/31/2017. Based on these updates, the maintenance activity preceding these problems does not appear to be related.