How Plumbr will help you
Screenshot above originates from Plumbr exposing the root cause for a situation where one thread locked the access to a particular code block, and other thread(s) were waiting until the first thread finishes. The particular lock on ch.qos.logback.core.rolling.RollingFileAppender.subAppend() monitor blocked the access for the second thread for 24 seconds. It is also visible that it was not an isolated case - during the last 30 days there have been 11,671 other occurrences where the very same lock was impacting the end users.
The information needed to fix the problem is embedded in the following two blocks. First of these is exposing what the waiting thread was trying to accomplish:
ch.qos.logback.core.rolling.RollingFileAppender.subAppend():177 ch.qos.logback.core.OutputStreamAppender.append():103 ... ch.qos.logback.classic.Logger.info():600 com.ebilling.core.logger.LoggerImpl.info():476 ... com.ebilling.utils.core.managers.UIClientDSManager.getClientConnection():35 com.ebilling.ops.workflow.messages.QueueOps.getQueues():137
The block on the right is a snapshot taken from the thread holding the lock. From this information it is visible what the thread was dealing with when holding the lock:
java.util.regex.Pattern.createGroup():2931 java.util.regex.Pattern.group0():2852 ... ch.qos.logback.core.rolling.helper.FileFilterUtil$3.accept():85 java.io.File.listFiles():1246 ... ch.qos.logback.classic.Logger.callAppenders():260 ch.qos.logback.classic.Logger.buildLoggingEventAndAppend():442
In addition, Plumbr exposes you the full view of the transaction where this particular lock participated. Opening the transaction link in the header exposes you the following information:
From the above it is clear that the transaction waited for the vast majority of it’s duration behind the lock. Just 251 milliseconds were spent in other operations.
Checking the source code of the logback, from where the lock originates, exposes us to the culprit. Calls to logging statements include a check whether or not the log file used should be rolled over. When the rolling is necessary, all the required operations (renaming the current log file, creating new log file, checking all the rolled logs for retention policy, archiving too old log files, etc) are carried out in the rollover() method.
It is a perfect example of a lock having too big of a scope. The logging service should indeed make sure that the log messages are not lost or written to the wrong file, but instead of blocking access during the entire period when the operations are carried out, the logging service should buffer the messages during the rollover, thus decoupling the message arrival from persisting the log entries.
In general, when facing locking issues, the steps to take typically include:
- Making sure lock protects the data instead of the code. Instead of synchronizing the access to entire code block, sometimes protecting just the data being changed will be the solution.
- Reducing lock scope. All the time-consuming operations (such as round-trips to external systems) should be extracted from the locked code block.
- Splitting the locks. In case when the data you need to protect can be partitioned, you can protect the individual partition, instead of locking the entire data structure.
- Making use of the concurrent data structures in java.util and java.util.concurrent packages. The data structures such as ConcurrentHashMap or java.util.concurrent.BlockingQueue are well built and you can avoid many mistakes which might sneak in while using home-grown solutions
- Using atomic operations. For example, instead of synchronizing access to the Integer counter, you could use AtomicInteger designed just for this use case.