Skip to main content

Do It Yourself Java Profiling

This article is a free translation of the Russian one that is a transcript of the Russian video lecture done by Roman Elizarov at the Application Developer Days 2011 conference.

The lecturer talked about profiling of Java applications without any standalone tools. Instead, it's suggested to use internal JVM features (i.e. threaddumps, java agents, bytecode manipulation) to implement profiling quickly and efficiently. Moreover, it can be applied on Production environments with minimal overhead. This concept is called DIY or "Do It Yourself". Below the lecture's text and slides begin.

Today I'm giving a lecture "Do It Yourself Java Profiling". It's based on the real life experience that was gained during more than 10 years of developing high-loaded finance applications that work with huge amounts of data, millions currency rate changes per second and thousands of online users. As a result, we have to deal with profiling.
Application profiling is an inevitable step of any optimization. You profile to search for the performance bottlenecks, then optimize and return back to profiling. This is kind of infinite loop.
What is the reason to "Do It Yourself"? There are plenty of ready-to-use profiling tools.

However, using third-party profiling tool might be an issue, especially on a Production environment. It may be not allowed due to stability or safety reasons. Nevertheless, it often happens that we have to do profiling not only on Development/Test/Stage environments but on Production ones. The reason for this is that many performance bottlenecks can be observed and identified under high-load and conditions that are specific for live systems only. It may be impossible to reproduce the situation on non-Production environment.

Also the profiling tools are not transparent enough. A tool's documentation usually specifies what exactly it measures but not how exactly it calculates the measurements. In case of open-source profiling tool you will still have to waste much time to dig the details.
What's wrong about learning? Of course it's worth learning a tool if you use it often. For example, you may have an extensive knowledge about an IDE that you work with every day and it really saves much time. On the contrary, if you need to do something once in a month such as profiling, it may be real waste of time to learn deeply some profiling tool. Doing profiling yourself lets you reuse your knowledge of programming language and your own tools.
Why is the lecture about Java? Besides the fact our company develops in Java, it's a leading programming language of 2001 according to TIOBE and is superbly suitable for enterprise applications. Moreover, Java is a managed language working under a virtual machine that makes profiling in Java really simple.
I'm going to talk about solving profiling problems with just writing some Java code; also about specific features of JVM and about a technique called bytecode manipulation.
We're going to look at both CPU profiling and memory profiling using different techniques.
CPU profiling. The fastest way is just to code it. If you need to find out how much time something takes and how many times something is being called, then the easiest way is to write a bit of code instead of using tools.

Java provides a truly amazing function "currentTimeMillis" that returns the current time. You can measure time in one place and another, then just calculate the difference or whatever else you may need. It's a DIY in its most primitive view. In spite of simplicity, this technique works perfectly and it's quick, comfortable and efficient.
When does this technique work the best? Especially for big business methods that are being called not often. Moreover, as soon as you write this code, it becomes a part of your application's functionality. In general, any big enough modern application will benefit from performance measuring included as a part of it.

Actually it's quite logical to develop an application in such a way when it profiles itself. Moreover, it may be quite useful to represent the performance information to application's users. For example, how many times each method was called, how much time did it take and so on. This is very flexible and allows to measure whatever you want - number of calls, minimal time, maximal time, average, histograms of execution time distribution, medians, percentiles.

Also it allows you to trace different execution paths in different ways as it was done in the example above. Depending on the actual execution path, we log the time of loading Account object either from the cache or from the database. Finally we will be able to count how often the cache was used in comparison with direct database calls.
Finally if something unexpected happens in the working system, you can look into the log files for your statistics and use them for further investigation.

What should we do if methods are shorter and called more often? In this case the previous technique will not work because the function "currentTimeMillis" is not that fast and measures in milliseconds only.
If you need only to count the number of calls, it can be easily done with Java class "AtomicLong" (see above). It doesn't affect the performance greatly and allows you to count up to tens of thousands of execution calls per second.
What if you still need to measure the execution time of short methods? It's a very complex problem and cannot be solved using standard methods in spite of having the function "systemNanoTime". It doesn't solve this problem as this function is too slow.

The only possible solution is to use native code. For example, there is an excellent instruction for x86 processors called rdtsc that counts the number of processor's cycles. You cannot use this instruction directly from Java but it's possible to write the one-liner function in C that uses "rdtsc". Then you will be able to link this native function with Java and call it. Executing the native code will take a hundred of processor cycles. Thus, it's eligible if you need to measure a code snippet that takes at least a thousand of processor cycles and you're trying to optimize every single instruction. It's quite a rare case though.

When you need to measure a short method that is being called quite often, you should use another technique called "sampling". Instead of getting exact measurements you analyze the program execution periodically, say once per second or ten seconds, and check what the program is doing at those moments.
Just find out at what place your program is executing most often when being checked. If there is a line of code that consumes most of the running time, e.g. 90% inside some loop as in the example above, then you'll most likely see this line of code when the execution is paused.

Such point of a program is called the "hot spot". It's always a suitable candidate for optimization. The great thing is the embedded feature of JVM called "thread dump" that can provide a dump of all threads. It can be executed by CTRL-Break in Windows console and by sending a third sygnal in Linux that corresponds to the command "kill -3".
In this case JVM writes to the console the detailed description of all threads' states. So if there is indeed a hot stop in your code, you will most likely see your program at it. That's why there is no need to search for a profiler tool when you face a performance issue. You'd better make a thread dump and investigate it. If there is a single hot spot, you'll find it straightaway. Then look at this code, investigate and optimize its performance - if something is still unclear afterwards, further profiling follows.
Modern JVMs also have a great utility "jstack". Being provided with a process identifier, it outputs its thread dump. Just create several thread dumps if the first one doesn't make it clear. May be the hot spot takes 50% of program execution time. Having made several thread dumps, you'll definitely observe your program at the hot spot at least once.
This idea can be developed further. You can redirect JVM output to a file and use a script that produces a thread dump every 3 seconds.

It can be safely done on the live system as producing a thread dump is a quick enough procedure. It takes roughly about 100 milliseconds even for a big number of threads. It's logical to assume that your system is unlikely to be hard real time and you don't care about nanoseconds as soon as you write it in Java that leads to periodical garbage collection. Hence, a 100 milliseconds delay won't become an issue on the Production system. In addition, a thread dump feature is a very safe and well-tested part of JVM and I didn't face any troubles with it during all my work experience.

Finally having got a file with a thread dump, you can investigate it manually or write a simple program to analyze it automatically and collect relevant statistics. For example, it'd be useful to count how many times each method occurred there.
While usual profiling tools take into account a thread state (see "RUNNABLE" above), this state doesn't really matter because it hides important details. For example, if your program is working with a database, the network or other external resources, your code may be actually waiting for some data or response but JVM considers it as "RUNNABLE". As a result, it won't clarify which methods are actually working and which are waiting. On the other hand, if you analyze a stack trace yourself, you may know what your program is actually doing. In case it's a database request, you can measure how much time the program spends on database calls. To conclude, you may know that a method doesn't actually load the CPU although it's said to be "RUNNABLE" by JVM.
Furthermore, thread dump feature can be integrated into applications as Java provides a method to get a stacktrace information programmatically. As a result, you can integrate the profiling as a functional part of your application and use the profiling information to improve it.
However, there is an issue. If you ask JVM to make a thread dump, it won't just pause the process to create a stacktrace. Instead JVM activates a flag meaning it needs to pause in the next safe place. A "savepoint" is such a safe place that the compiler marks in the code and that is transparent enough to say what is actually stored in registers, etc. For example, there may be no savepoints at all in the sequential code snippet without any back references and loops. Also method calls can become inline due to the hotspot optimization and there won't be a savepoint.

Therefore, if you see a line in a thread dump, it may be the nearest savepoint but not a hot spot. That's because when you press "CTRL-BREAK", JVM notifies all threads to pause in the next savepoint. As soon as all threads are paused, JVM analyzes their states.
Now let's turn to the memory profiling. Again there are excellent JVM features for this. It is a great tool "jmap" that prints a histogram of the memory state showing what objects there are and how much memory they consume. This amazing tool is meant for general overview of what is happening in JVM.

If you've never profiled your program before, you'll most likely immediately observe some issues. It's already a good start to optimize the program's memory usage.
One issue is that jmap will show you all the objects including those that are actually a garbage. That's why jmap has a special option "live" that triggers a garbage collection procedure before building a histogram. Another issue is that you cannot use this option on the a live system that uses gigabytes of memory. Because a garbage collection for such a big amount of memory may take about 10 seconds that can be unacceptable if your system works with end users. They will consider it hanging if there is no response for more than 3 seconds. Actually, you should not pause a live system with user interactions for more than a second. As a result, we often have to be satisfied with a histogram of all the objects in memory including a garbage.
Also it's very useful to know additional JVM options. For instance, with "PrintClassHistogram" you can request JVM to print a class histogram each time you get a thread dump.

Using "HeapDumpOnOutOfMemory" you can request JVM to write its state on disk when it's out of memory. It's very useful as usually people start optimizing memory consumption only when the memory is out. Nobody does profiling when everything works fine. Therefore, it's useful to have this option always enabled. Then in case of OutOfMemory error, you'll have a resulting binary dump that can be analyzed with any tools later. Well, you can also get the same binary dump on the working JVM using option "dump" but this will also lead to a long delay for JVM that you shouldn't want to do on a live system.

By the way, there is no reason to be afraid of "-XX" JVM options, they are just very specific. It doesn't mean they are experimental, on the contrary they are stable and can be safely used on live Production systems.
Now let's consider memory allocation profiling. If you use memory allocation excessively that is allocated memory is being used and forgotten about. Then a garbage collector finds and releases it. If you do this often, your program will work slower. But CPU profiling won't help you to localize this issue. That's because memory allocation works amazingly fast in JVM, it works faster than in any non-managed language like C/C++ just because in Java memory allocation operation is simply a pointer increment. The allocated memory is already nulled and ready to use. So you'll never find it as a hot spot in a thread dump nor in a profiler tool. Nevertheless, it will slow down your application just because the garbage collection will take some time to release your memory. Hence, you should pay attention to how much time your application spends on the garbage collection.

The solution of this issue consists in the JVM options "-verbose:gc", "+PrintGC", "+PrintGCDetails" that log the time spent by an application on the garbage collection (see here for the complete tutorial on tuning garbage collection). If you notice that the garbage collection takes a big amount of time, it means excessively much memory is being allocated somewhere in your application.
How can you localize the issue? There is a JVM option "-Xaprof". It prints a so-called allocation profile when a JVM process is terminated. This profile shows statistics of allocated objects during application execution time - what objects were allocated and how many of them.

If you face this issue, you'll most likely see some problematic class that is being instantiated too often. So the first step is to do "aprof" that may solve your issue immediately.
Unfortunately, it may not help if you see a big amount of allocated characters or strings or any other standard class instances. If you suspect any exact place, just apply the technique with AtomicLong mentioned above to count how often a memory allocation happens. If you have no idea where to count, you can count a memory allocation calls everywhere. This can be easily done using aspect-oriented programming (AOP) or a bytecode manipulation technique.

The bytecode manipulation technique is well suitable for solving tasks like the following - you need to count how often an array of ints is being allocated everywhere it happens in the code. So you already know the object that is being allocated too often but you still need to find an exact place.
In general, the bytecode manipulation allows modifying the code after it has already been compiled. So it's a way to separate profiling from the business logic, it's opposite to the initial idea of integrating profiling as a part of application functionality. Sometimes it may be crucial not to introduce any additional changes to your code (e.g. writing to log) but just solve the issue.
The best way I know is to use ObjectWeb ASM (the list of alternatives is here). It's an extremely fast open-source library that allows the bytecode manipulation on-the-fly. It almost doesn't slow down an application's loading time.

The ASM library works pretty simple. There is a class called "ClassReader" that reads *.class files and modifies the bytecode using a Visitor pattern. In fact, it reports when it encounters a method or a field and what actually happens in that method (using MethodVisitor), etc. On the other hand, there is a "ClassWriter" that encodes classes again into the bytecode.
For example, if you want to track all array allocations, you need to write a couple of classes (see above). First, you need to define own class visitor that will return a method visitor when it encounters some original method. The method visitor tracks what happens inside the original method in a similar way. When the method visitor is being told that there is an array allocation instruction (with the code "NEWARRAY"), it's possible to insert any new code you want into the execution flow. Thus, you'll track all places where an array of ints is being allocated.
So if you have compiled classes, you should simply process them using the library mentioned above. But what if you want to perform the bytecode manipulation on-the-fly?

Then the right solution would be to use JVM option "javaagent". You need to create a special jar-file with the "Premain-Class" option set to your class name in the manifest. The class has to contain the "premain" method according to a template. Finally you'll get the "Instrumentation" interface instance that allows using own class-file transformer. This transformer will be executed by JVM for each class loading operation. So you'll be able to modify classes. It's a really flexible tool to profile your code.
In conclusion, you really don't have to learn some tool to do profiling but it's quite enough to know the bytecode, the JVM options and the standard Java libraries (e.g. java.lang package). Moreover, this knowledge allows solving a large variety of specific issues without a big overhead.

Comments

Popular posts from this blog

Connection to Amazon Neptune endpoint from EKS during development

This small article will describe how to connect to Amazon Neptune database endpoint from your PC during development. Amazon Neptune is a fully managed graph database service from Amazon. Due to security reasons direct connections to Neptune are not allowed, so it's impossible to attach a public IP address or load balancer to that service. Instead access is restricted to the same VPC where Neptune is set up, so applications should be deployed in the same VPC to be able to access the database. That's a great idea for Production however it makes it very difficult to develop, debug and test applications locally. The instructions below will help you to create a tunnel towards Neptune endpoint considering you use Amazon EKS - a managed Kubernetes service from Amazon. As a side note, if you don't use EKS, the same idea of creating a tunnel can be implemented using a Bastion server . In Kubernetes we'll create a dedicated proxying pod. Prerequisites. Setting up a tunnel.

Notes on upgrade to JSF 2.1, Servlet 3.0, Spring 4.0, RichFaces 4.3

This article is devoted to an upgrade of a common JSF Spring application. Time flies and there is already Java EE 7 platform out and widely used. It's sometimes said that Spring framework has become legacy with appearance of Java EE 6. But it's out of scope of this post. Here I'm going to provide notes about the minimal changes that I found required for the upgrade of the application from JSF 1.2 to 2.1, from JSTL 1.1.2 to 1.2, from Servlet 2.4 to 3.0, from Spring 3.1.3 to 4.0.5, from RichFaces 3.3.3 to 4.3.7. It must be mentioned that the latest final RichFaces release 4.3.7 depends on JSF 2.1, JSTL 1.2 and Servlet 3.0.1 that dictated those versions. This post should not be considered as comprehensive but rather showing how I did the upgrade. See the links for more details. Jetty & Tomcat. JSTL. JSF & Facelets. Servlet. Spring framework. RichFaces. Jetty & Tomcat First, I upgraded the application to run with the latest servlet container versio

Extracting XML comments with XQuery

I've just discovered that it's possible to process comment nodes using XQuery. Ideally it should not be the case if you take part in designing your data formats, then you should simply store valuable data in plain xml. But I have to deal with OntoML data source that uses a bit peculiar format while export to XML, i.e. some data fields are stored inside XML comments. So here is an example how to solve this problem. XML example This is an example stub of one real xml with irrelevant data omitted. There are several thousands of xmls like this stored in Sedna XML DB collection. Finally, I need to extract the list of pairs for the complete collection: identifier (i.e. SOT1209 ) and saved timestamp (i.e. 2012-12-12 23:58:13.118 GMT ). <?xml version="1.0" standalone="yes"?> <!--EXPORT_PROGRAM:=eptos-iso29002-10-Export-V10--> <!--File saved on: 2012-12-12 23:58:13.118 GMT--> <!--XML Schema used: V099--> <cat:catalogue xmlns:cat=