SyntaxHighlighter

Thursday, November 15, 2012

The useful JVM options

HotSpot JVM has a lot of options available. Maybe too much. Sometimes we are looking for the right options, or the "magic" one that can give you a serious boost on your application. Unfortunately, I think it does not exist ! However some can help you for monitoring your application or for tuning some parts.

To find the complete list of options you will find in the globals.hpp file from OpenJDK sources.
This site however can help you to navigate through the list.

I have summed up  here, in my humble opinion, some of the most usefull JVM options.

Heap sizing

You perfectly konw -Xms -Xms options (that can also abbreviate by -ms -mx), but some parts of the Java heap and non-heap can also be sized:

  • -XX:NewSize=n Defines the initial size of the Young generation, including Eden, & Survivors. 
  • -XX:MaxNewSize=n Defines the maximum size of the Young generation, including Eden & Survivors.
  • -XX:SurvivorRatio=n Ratio between Eden Size and one of the 2 survivors

n without unit is expressed in bytes, you can also used k,K, m, M, g & G to expressed respectively kilobytes, megabytes & gigabytes.
If NewSize < MaxNewSize, Young generation size can be adjust during application life. However resizing does require a FullGC. To avoid this, set the same value for both options.

  • -XX:PermSize=n Defines the initial size of the Permanent generation
  • -XX:MaxPermSize=n Defines the maximum size of the Permanent generation

n without unit is expressed in bytes, you can also used k, K, m, M, g & G to expressed respectively kilobytes, megabytes & gigabytes.
If PermSize < MaxPermSize, Permanent generation size can be adjust during application life. However resizing does require a FullGC. To avoid this, set the same value for both options.


  • -XX:InitialCodeCacheSize=n Defines the initial size of the Code Cache.
  • -XX:ReservedCodeCacheSize=n Defines the maximum size of the Code Cache.

n without unit is expressed in bytes, you can also used k, K, m, M, g & G to expressed respectively kilobytes, megabytes & gigabytes.
Code Cache stores the JITed code.This is an off-heap space, so GC does not reclaim it. if you reached the limit of the ReservedCodeCacheSize, the JIT Compiler stop to compile more methods since it cannot store them. So if you have a lot of classes/methods that are needed to be compiled, be aware of these options.
When you reach the limit, a warning is emitted on the standard output:
Java HotSpot(TM) Server VM warning: CodeCache is full. Compiler has been disabled"
with -XX:+PrintCompilation you will get also:
7383 COMPILE SKIPPED: code cache is full


GC Settings/Tuning

 For configuring GC, there is tons of options, specially for the CMS ! But at the end few are really usefull. I will not enter into fine tuning of CMS.

  • -XX:+UseSerialGC Activates classic monothreaded young GC
  • -XX:+UseParallelGC Activates multi-threaded young GC
  • -XX:+UseParNewGC Activates another multi-threaded young GC but required for CMS.
  • -XX:+UseParalelOldGC Activates multi-threaded Old GC.
  • -XX:+UseConcMarkSweepGC Activates Concurrent GC
  • -XX:+UseG1GC Activates Garbage First GC
  • -XX:ParallelGCThreads=n Fixes number of threads for Parallel*GC Usefull for multiple JVM on same machine to avoid GC threads disturb other JVM threads
  • -XX:CMSInitiatingOccupancyFraction=n Fixes fraction in percentage of the old when CMS start. Usefull when allocation rate is not very predictable.
  • -XX:+UseCMSInitiatingOccupancyOnly CMS only start based on the previous parameter and not on the internal heuristic.
  • -XX:+CMSClassUnloadingEnabled Enables to unload classes from PermGen during CMS GC. Otherwise you need to wait the FullGC.

GC Monitoring 

For an application, it is absolutely required to log GC activity in a separate file in order to diagnostic any abnormal behavior of memory management. Also note that activating this info has virtually no impact on your application, so there is no excuse to not activate these options.

  • -Xloggc:file Indicates where the gc log file will be written
With only this option you will get the following content:
8.567: [GC 16448K->168K(62848K), 0.0014256 secs]
16.979: [GC 16616K->152K(62848K), 0.0007727 secs]
25.157: [GC 16600K->152K(62848K), 0.0007124 secs]
33.881: [GC 16600K->136K(62848K), 0.0003721 secs]
43.032: [GC 16584K->152K(59968K), 0.0004123 secs]
50.795: [GC 16216K->152K(59584K), 0.0009683 secs]
58.584: [GC 15832K->148K(59136K), 0.0008669 secs]
66.430: [GC 15508K->148K(58816K), 0.0003436 secs]
73.766: [GC 15188K->148K(58688K), 0.0002917 secs]
81.082: [GC 14868K->148K(58176K), 0.0003157 secs]
88.851: [GC 14548K->148K(58048K), 0.0003148 secs]
96.379: [GC 14228K->148K(57536K), 0.0003129 secs]
101.618: [GC 13908K->704K(57472K), 0.0032532 secs]
102.684: [GC 3206K->744K(57280K), 0.0034611 secs]
102.687: [Full GC 744K->723K(57280K), 0.0525762 secs]
109.313: [GC 14227K->755K(57344K), 0.0005165 secs]
115.905: [GC 14003K->779K(56768K), 0.0006688 secs]
 
  • -XX:+PrintGCDetails Activates more details in gc logs
With this option you will get the following content:
 9.053: [GC [PSYoungGen: 16448K->168K(19136K)] 16448K->168K(62848K), 0.0014471 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
16.991: [GC [PSYoungGen: 16616K->152K(19136K)] 16616K->152K(62848K), 0.0008244 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
24.965: [GC [PSYoungGen: 16600K->152K(19136K)] 16600K->152K(62848K), 0.0006850 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
32.915: [GC [PSYoungGen: 16600K->152K(19136K)] 16600K->152K(62848K), 0.0007040 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
40.857: [GC [PSYoungGen: 16600K->152K(16256K)] 16600K->152K(59968K), 0.0007618 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
48.653: [GC [PSYoungGen: 16216K->152K(15872K)] 16216K->152K(59584K), 0.0007632 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
53.896: [GC [PSYoungGen: 15832K->192K(15552K)] 15832K->620K(59264K), 0.0131835 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
55.252: [GC [PSYoungGen: 3888K->64K(15424K)] 4317K->716K(59136K), 0.0025925 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
55.255: [Full GC (System) [PSYoungGen: 64K->0K(15424K)] [PSOldGen: 652K->715K(43712K)] 716K->715K(59136K) [PSPermGen: 5319K->5319K(16384K)], 0.0269456 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
62.544: [GC [PSYoungGen: 15360K->32K(15488K)] 16075K->747K(59200K), 0.0006252 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
69.502: [GC [PSYoungGen: 15072K->32K(14784K)] 15787K->747K(58496K), 0.0003713 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 


As you can see there is a lot more and valuable info. You can now know the type of GC configured for Young & Old (PSYoungGen & PSOlgGen = ParallelScavenge, ParallelGC), Full GC is triggered by an explicit call to System.GC() & you have the detail of memory usage of Young gen & old gen so you can deduce promotion between the two. Also with CMS you have all the CMS phases that are logged with Initial Mark pause time & Remark pause time. This is a MUST HAVE option: No JVM should start without this option.
  • -XX:+PrintGCDateStamps Prints date & time along with second elapsed since JVM's start.
 With this option you will get the following content:
2012-11-08T22:31:12.310+0100: 8.250: [GC [PSYoungGen: 16448K->168K(19136K)] 16448K->168K(62848K), 0.0007445 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:31:20.526+0100: 16.466: [GC [PSYoungGen: 16616K->136K(19136K)] 16616K->136K(62848K), 0.0008255 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:31:28.503+0100: 24.444: [GC [PSYoungGen: 16584K->152K(19136K)] 16584K->152K(62848K), 0.0007196 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2012-11-08T22:31:36.497+0100: 32.439: [GC [PSYoungGen: 16600K->152K(19136K)] 16600K->152K(62848K), 0.0007185 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:31:44.565+0100: 40.507: [GC [PSYoungGen: 16600K->152K(16256K)] 16600K->152K(59968K), 0.0007928 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:31:52.694+0100: 48.637: [GC [PSYoungGen: 16216K->152K(15872K)] 16216K->152K(59584K), 0.0007599 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:31:58.204+0100: 54.147: [GC [PSYoungGen: 15832K->192K(15552K)] 15832K->620K(59264K), 0.0072012 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2012-11-08T22:31:59.999+0100: 55.942: [GC [PSYoungGen: 4932K->64K(15424K)] 5360K->716K(59136K), 0.0015449 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:32:00.000+0100: 55.944: [Full GC (System) [PSYoungGen: 64K->0K(15424K)] [PSOldGen: 652K->716K(43712K)] 716K->716K(59136K) [PSPermGen: 5311K->5311K(16384K)], 0.0273445 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
2012-11-08T22:32:07.453+0100: 63.397: [GC [PSYoungGen: 15360K->32K(15488K)] 16076K->748K(59200K), 0.0003439 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-11-08T22:32:14.695+0100: 70.640: [GC [PSYoungGen: 15072K->32K(14784K)] 15788K->748K(58496K), 0.0007333 secs]


To remove elapsed time since JVM's start you can use -XX:-PrintGCTimeStamps which is activated by default with -Xloggc
  • -XX:+PrintReferenceGC Prints timing for special reference processing
  With this option you will get the following content:
8.494: [GC8.496: [SoftReference, 0 refs, 0.0000084 secs]8.496: [WeakReference, 3 refs, 0.0000061 secs]8.496: [FinalReference, 6 refs, 0.0000117 secs]8.496: [PhantomReference, 0 refs, 0.0000042 secs]8.496: [JNI Weak Reference, 0.0000034 secs] [PSYoungGen: 16448K->168K(19136K)] 16448K->168K(62848K), 0.0017036 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
16.874: [GC16.874: [SoftReference, 0 refs, 0.0000050 secs]16.874: [WeakReference, 1 refs, 0.0000034 secs]16.874: [FinalReference, 4 refs, 0.0000036 secs]16.874: [PhantomReference, 0 refs, 0.0000028 secs]16.874: [JNI Weak Reference, 0.0000028 secs] [PSYoungGen: 16616K->152K(19136K)] 16616K->152K(62848K), 0.0005199 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
24.889: [GC24.890: [SoftReference, 0 refs, 0.0000078 secs]24.890: [WeakReference, 1 refs, 0.0000053 secs]24.890: [FinalReference, 4 refs, 0.0000056 secs]24.890: [PhantomReference, 0 refs, 0.0000039 secs]24.890: [JNI Weak Reference, 0.0000036 secs] [PSYoungGen: 16600K->152K(19136K)] 16600K->152K(62848K), 0.0008915 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
32.829: [GC32.830: [SoftReference, 0 refs, 0.0000084 secs]32.830: [WeakReference, 1 refs, 0.0000050 secs]32.830: [FinalReference, 4 refs, 0.0000056 secs]32.830: [PhantomReference, 0 refs, 0.0000045 secs]32.830: [JNI Weak Reference, 0.0000045 secs] [PSYoungGen: 16600K->152K(19136K)] 16600K->152K(62848K), 0.0009845 secs]


Helps to troubleshoot issues with special references like Weak, Soft, Phantom & JNI.
  •  -XX:+PrintTenuringDistribution Prints object age distribution in Survivors
  With this option you will get the following content:
8.654: [GC 8.655: [ParNew
Desired survivor size 1114112 bytes, new threshold 15 (max 15)
- age   1:     187568 bytes,     187568 total
: 17472K->190K(19648K), 0.0022117 secs] 17472K->190K(63360K), 0.0023006 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
17.087: [GC 17.087: [ParNew
Desired survivor size 1114112 bytes, new threshold 15 (max 15)
- age   1:      80752 bytes,      80752 total
- age   2:     166752 bytes,     247504 total
: 17662K->286K(19648K), 0.0036440 secs] 17662K->286K(63360K), 0.0037619 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
25.491: [GC 25.491: [ParNew
Desired survivor size 1114112 bytes, new threshold 15 (max 15)
- age   1:     104008 bytes,     104008 total
- age   2:      34256 bytes,     138264 total
- age   3:     166752 bytes,     305016 total
: 17758K->321K(19648K), 0.0035331 secs] 17758K->321K(63360K), 0.0036424 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
34.745: [GC 34.746: [ParNew
Desired survivor size 1114112 bytes, new threshold 15 (max 15)
- age   1:      34400 bytes,      34400 total
- age   2:     104008 bytes,     138408 total
- age   3:      34256 bytes,     172664 total
- age   4:     166752 bytes,     339416 total
: 17793K->444K(19648K), 0.0040891 secs] 17793K->444K(63360K), 0.0042000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
43.215: [GC 43.215: [ParNew
Desired survivor size 1114112 bytes, new threshold 15 (max 15)
- age   1:     138808 bytes,     138808 total
- age   2:      34400 bytes,     173208 total
- age   3:      34264 bytes,     207472 total
- age   4:      34256 bytes,     241728 total
- age   5:     166752 bytes,     408480 total
: 17916K->586K(19648K), 0.0048752 secs] 17916K->586K(63360K), 0.0049889 secs]


This is the age distribution of the object related to the number of time they are survived from a minor GC (contained in Survivor spaces)
  • -XX:+PrintHeapAtGC Prints usage for each kind of spaces.
  With this option you will get the following content:
{Heap before GC invocations=1 (full 0):
 PSYoungGen      total 19136K, used 16448K [0x33f20000, 0x35470000, 0x49470000)
  eden space 16448K, 100% used [0x33f20000,0x34f30000,0x34f30000)
  from space 2688K, 0% used [0x351d0000,0x351d0000,0x35470000)
  to   space 2688K, 0% used [0x34f30000,0x34f30000,0x351d0000)
 PSOldGen        total 43712K, used 0K [0x09470000, 0x0bf20000, 0x33f20000)
  object space 43712K, 0% used [0x09470000,0x09470000,0x0bf20000)
 PSPermGen       total 16384K, used 1715K [0x05470000, 0x06470000, 0x09470000)
  object space 16384K, 10% used [0x05470000,0x0561ccc8,0x06470000)
8.304: [GC [PSYoungGen: 16448K->228K(19136K)] 16448K->228K(62848K), 0.0028543 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
Heap after GC invocations=1 (full 0):
 PSYoungGen      total 19136K, used 228K [0x33f20000, 0x35470000, 0x49470000)
  eden space 16448K, 0% used [0x33f20000,0x33f20000,0x34f30000)
  from space 2688K, 8% used [0x34f30000,0x34f69100,0x351d0000)
  to   space 2688K, 0% used [0x351d0000,0x351d0000,0x35470000)
 PSOldGen        total 43712K, used 0K [0x09470000, 0x0bf20000, 0x33f20000)
  object space 43712K, 0% used [0x09470000,0x09470000,0x0bf20000)
 PSPermGen       total 16384K, used 1715K [0x05470000, 0x06470000, 0x09470000)
  object space 16384K, 10% used [0x05470000,0x0561ccc8,0x06470000)
}
{Heap before GC invocations=2 (full 0):
 PSYoungGen      total 19136K, used 16676K [0x33f20000, 0x35470000, 0x49470000)
  eden space 16448K, 100% used [0x33f20000,0x34f30000,0x34f30000)
  from space 2688K, 8% used [0x34f30000,0x34f69100,0x351d0000)
  to   space 2688K, 0% used [0x351d0000,0x351d0000,0x35470000)
 PSOldGen        total 43712K, used 0K [0x09470000, 0x0bf20000, 0x33f20000)
  object space 43712K, 0% used [0x09470000,0x09470000,0x0bf20000)
 PSPermGen       total 16384K, used 1721K [0x05470000, 0x06470000, 0x09470000)
  object space 16384K, 10% used [0x05470000,0x0561e620,0x06470000)
16.577: [GC [PSYoungGen: 16676K->261K(19136K)] 16676K->261K(62848K), 0.0021589 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
Heap after GC invocations=2 (full 0):
 PSYoungGen      total 19136K, used 261K [0x33f20000, 0x35470000, 0x49470000)
  eden space 16448K, 0% used [0x33f20000,0x33f20000,0x34f30000)
  from space 2688K, 9% used [0x351d0000,0x352115f0,0x35470000)
  to   space 2688K, 0% used [0x34f30000,0x34f30000,0x351d0000)
 PSOldGen        total 43712K, used 0K [0x09470000, 0x0bf20000, 0x33f20000)
  object space 43712K, 0% used [0x09470000,0x09470000,0x0bf20000)
 PSPermGen       total 16384K, used 1721K [0x05470000, 0x06470000, 0x09470000)
  object space 16384K, 10% used [0x05470000,0x0561e620,0x06470000)
}
 You see before & after each minor or Full GC the usage of each spaces (eden, from & to survivors, old & perm)
  • -XX:+PrintSafepointStatistics Prints safepoint statistics
  With this option you will get the following content when application terminates on stdout:

         vmop                      [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
4.162: EnableBiasedLocking         [       8          0              0]      [0     0     0     0     0]  0   
8.466: ParallelGCFailedAllocation  [       8          0              0]      [0     0     0     0     2]  0   
23.927: ParallelGCFailedAllocation [       8          0              0]      [0     0     0     0     2]  0   
31.039: RevokeBias                 [       9          0              1]      [0     0     0     0     0]  0   
31.377: RevokeBias                 [      10          0              1]      [0     0     0     0     0]  0   
31.388: RevokeBias                 [      11          1              1]      [2     0     2     0     0]  0   
31.390: ParallelGCFailedAllocation [      11          0              0]      [0     0     0     0     4]  0   
31.452: RevokeBias                 [      12          0              1]      [0     0     0     0     0]  0   
31.487: BulkRevokeBias             [      12          0              1]      [0     0     0     0     0]  0   
31.515: RevokeBias                 [      12          0              1]      [0     0     0     0     0]  0   
31.550: BulkRevokeBias             [      12          0              0]      [0     0     0     0     0]  0   
32.572: ThreadDump                 [      12          0              1]      [0     0     0     0     0]  0   
38.572: ThreadDump                 [      13          0              2]      [0     0     0     0     0]  0   
39.042: ParallelGCFailedAllocation [      13          0              0]      [0     0     0     0     6]  0   
39.575: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
39.775: ParallelGCSystemGC         [      13          0              0]      [0     0     0     0    35]  0   
40.574: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
45.519: HeapDumper                 [      13          0              0]      [0     0     0     0    93]  0   
45.626: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
50.575: no vm operation            [      13          0              0]      [0     0     0     0     0]  0   
50.576: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
50.875: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
50.881: GC_HeapInspection          [      13          0              1]      [0     0     0     0     9]  0   
51.577: ThreadDump                 [      13          0              0]      [0     0     0     0     0]  0   
82.281: ParallelGCFailedAllocation [      13          0              0]      [0     0     0     0     4]  0   
82.578: ThreadDump                 [      13          0              1]      [0     0     0     0     0]  0   
88.578: ThreadDump                 [      13          0              0]      [0     0     0     0     0]  0   
88.999: ParallelGCFailedAllocation [      13          0              0]      [0     0     0     0     4]  0   
94.580: ThreadDump                 [      13          0              0]      [0     0     0     0     0]  0   
95.439: ParallelGCFailedAllocation [      13          0              0]      [0     0     0     0     2]  0   
95.579: ThreadDump                 [      13          0              0]      [0     0     0     0     0]  0   
101.579: ThreadDump                [      13          0              0]      [0     0     0     0     0]  0   
102.276: ParallelGCFailedAllocation[      13          0              0]      [0     0     0     0     5]  0   
102.579: ThreadDump                [      13          0              0]      [0     0     0     0     0]  0   
133.776: no vm operation           [      12          0              1]      [0     0     0     0   332]  0   

Polling page always armed
ThreadDump                       103
HeapDumper                         1
GC_HeapInspection                  1
ParallelGCFailedAllocation        16
ParallelGCSystemGC                 1
EnableBiasedLocking                1
RevokeBias                        57
BulkRevokeBias                     2
    0 VM operations coalesced during safepoint
Maximum sync time      2 ms
Maximum vm operation time (except for Exit VM operation)     93 ms

These stats indicate which vm operation was performed with time taken. It means that during vm operation all threads are stopped (like for STW GC)
  • -XX:+PrintGCApplicationStoppedTime Prints time were all application threads are stopped.
With this option you will get the following content:
Total time for which application threads were stopped: 0.0000257 seconds
6.253: [GC [PSYoungGen: 16448K->565K(19136K)] 16448K->565K(62848K), 0.0042058 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
Total time for which application threads were stopped: 0.0045092 seconds
Total time for which application threads were stopped: 0.0000916 seconds
Total time for which application threads were stopped: 0.0047405 seconds
Total time for which application threads were stopped: 0.0000869 seconds
Total time for which application threads were stopped: 0.0000285 seconds
Total time for which application threads were stopped: 0.0002229 seconds
Total time for which application threads were stopped: 0.0002159 seconds
13.310: [GC [PSYoungGen: 17013K->798K(19136K)] 17013K->798K(62848K), 0.0058890 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
Total time for which application threads were stopped: 0.0064142 seconds
26.148: [GC [PSYoungGen: 11951K->908K(19136K)] 11951K->908K(62848K), 0.0060538 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
26.154: [Full GC (System) [PSYoungGen: 908K->0K(19136K)] [PSOldGen: 0K->876K(43712K)] 908K->876K(62848K) [PSPermGen: 5363K->5363K(16384K)], 0.0385560 secs] [Times: user=0.03 sys=0.00, real=0.04 secs]
Total time for which application threads were stopped: 0.0450568 seconds
39.332: [GC [PSYoungGen: 16448K->150K(16256K)] 17349K->1051K(33664K), 0.0017511 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
Total time for which application threads were stopped: 0.0020282 seconds


This output is very usefull since we have the actual time during all threads are stopped by VM operations like GC, thread dump, heap dump, revoke bias, deoptimization, etc.
As you can see GCs may stop threads a little more than what is printed for just the GC processing. The time needed for a thread to reach a safepoint may add overhead.
  • -XX:+HeapDumpBeforeFullGC Generates a Heap Dump just after a Full GC. If you do not expect a Full GC to occur during a certain period of time (like during day) and if a Full GC occurred, you can take time to generate a heap dump to understand why it happens.
  • -XX:+PrintClassHistogramBeforeFullGC Prints a class histogram (number of instances per class with bytes allocated) in GC log. Light version of heap dump.
  • -XX:+HeapDumpOnOutOfMemoryError Generates a heap dump when OutOfMemory occurs. Help to determine which classes/objects generates this OOME
  • -XX:HeapDumpPath=path Specifies where to put the heap dump (java_pid<pid>.hprof)

Monitoring 

Some options are usefull for monitoring the JVM behavior and that are not related to GC.
  • -XX:+PrintCompilation Prints each method that is compiled by the JIT compiler.
With this optionyou will get the following content on stdout:
  1       java.lang.Object::<init> (1 bytes)
---   n   java.lang.System::currentTimeMillis (static)
  2       java.util.ArrayList::add (29 bytes)
  3       java.util.ArrayList::ensureCapacity (58 bytes)
---   n   java.lang.Thread::sleep (static)
  1%      com.bempel.sandbox.TestJIT::allocate @ 7 (84 bytes)
  4       com.bempel.sandbox.TestJIT::call (10 bytes)
  5       java.util.ArrayList::contains (14 bytes)
  6       java.util.ArrayList::indexOf (67 bytes)
  4   made not entrant  (2)  com.bempel.sandbox.TestJIT::call (10 bytes)
  7       com.bempel.sandbox.TestJIT::call (10 bytes)
  8       java.util.concurrent.CopyOnWriteArrayList::contains (22 bytes)
  9       java.util.concurrent.CopyOnWriteArrayList::indexOf (63 bytes)
  2%      com.bempel.sandbox.TestJIT::main @ 60 (83 bytes)

  • -XX:+LogCompilation (required -XX:+UnlockDiagnosticVMOptions) Logs JIT compiler informations into hotspot.log into the current directory. Content is in xml
<task_queued compile_id='4' method='com/bempel/sandbox/TestJIT call (Ljava/util/List;)V' bytes='10' count='5000' backedge_count='1' iicount='10000' stamp='20.696' comment='count' hot_count='10000'/>
<task_queued compile_id='5' method='java/util/ArrayList contains (Ljava/lang/Object;)Z' bytes='14' count='5000' backedge_count='1' iicount='10000' stamp='20.696' comment='count' hot_count='10000'/>
<task_queued compile_id='6' method='java/util/ArrayList indexOf (Ljava/lang/Object;)I' bytes='67' count='5000' backedge_count='1' iicount='10000' stamp='20.696' comment='count' hot_count='10000'/>
<writer thread='2508'/>
<nmethod compile_id='5' compiler='C2' level='2' entry='0x024c8a00' size='788' address='0x024c8908' relocation_offset='208' code_offset='248' stub_offset='472' consts_offset='487' scopes_data_offset='496' scopes_pcs_offset='636' dependencies_offset='748' handler_table_offset='752' nul_chk_table_offset='776' oops_offset='488' method='java/util/ArrayList contains (Ljava/lang/Object;)Z' bytes='14' count='7000' backedge_count='1' iicount='12000' stamp='20.702'/>
<writer thread='1392'/>
<nmethod compile_id='4' compiler='C2' level='2' entry='0x024c7a40' size='776' address='0x024c7948' relocation_offset='208' code_offset='248' stub_offset='568' consts_offset='583' scopes_data_offset='604' scopes_pcs_offset='668' dependencies_offset='748' nul_chk_table_offset='756' oops_offset='584' method='com/bempel/sandbox/TestJIT call (Ljava/util/List;)V' bytes='10' count='7000' backedge_count='1' iicount='12000' stamp='20.703'/>
<writer thread='2508'/>
<nmethod compile_id='6' compiler='C2' level='2' entry='0x024c9700' size='728' address='0x024c9608' relocation_offset='208' code_offset='248' stub_offset='472' consts_offset='487' scopes_data_offset='492' scopes_pcs_offset='576' dependencies_offset='688' handler_table_offset='692' nul_chk_table_offset='716' oops_offset='488' method='java/util/ArrayList indexOf (Ljava/lang/Object;)I' bytes='67' count='7000' backedge_count='1' iicount='12000' stamp='20.707'/>
<writer thread='3508'/>
<uncommon_trap thread='3508' reason='class_check' action='maybe_recompile' compile_id='4' compiler='C2' level='2' stamp='21.700'>
<jvms bci='3' method='com/bempel/sandbox/TestJIT call (Ljava/util/List;)V' bytes='10' count='7000' backedge_count='1' iicount='12000'/>
</uncommon_trap>

Can be usefull to understand why some method are deoptimized.
  • -XX:+LogVMOutput (requires -XX:+UnlockDiagnosticVMOptions) Logs all the vm output (like PrintCompilation to the default hotspot.log file in current directory.
  • -XX:LogFile=file (requires -XX:+UnlockDiagnosticVMOptions) Specifies the path and name for hotpot.log
  • -XX:+PrintAssembly (requires -XX:+UnlockDiagnosticVMOptions)Prints disassembly of method compiled by JIT compiler. See my post.
  • -XX:ErrorFile=file Specifies where the hs_err_<pid>.log will be genretaed in case of JVM crash.
  • -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=1234
    -Dcom.sun.management.jmxremote.ssl=false
    -Dcom.sun.management.jmxremote.authenticate=false
    Activates JMX remote connections to be able to connect with VisualVM.

JIT Compiler

These options control the JIT compiler behavior:
  • -XX:CICompilerCount=n Specifies number of JIT compiler threads used to compile methods
  • -XX:CompileThreshold=n Specifies number of call a method is considered as a HotSpot and JIT compiler compiles it.
  • -Xcomp Compiles method first time they are executed. (CompileThreshold=1)
  • -Xbatch Compiles methods in foreground, no execution is stopped until compilation is done.
  • -Xint Interpreted mode only. JIT compiler deactivated. (CompileThreshold=0)

Optimizations 

Some options enables or disables certain kind of JIT optimizations

  • -XX:+UseCompressedStrings Handles String objects with byte array instead of char array. Can reduce the size of consumed Heap. Beware of side effects! Disabled by default.
  • -XX:+UseBiasedLocking Enables Biased Locking. Allow to execute code including synchronized block with virtually no overhead if there is no contention on it (only one thread execution the block). However if another thread revoke the Bias, this can be costly (Stop-The-World/safepoint during revoking). Enabled by default.
  • -XX:+TraceBiasedLocking Gives extra information about BiasedLocking optimization. Disabled by default
  • -XX:+PrintBiasedLockingStatistics Gives extra information about BiasedLocking optimization. Disabled by default
  • -XX:+Inline Enables call inlining. Could be disabled for inspecting disassembly and virtual call. Enabled by default.
  • -XX:+AggressiveOpts Enables some new optimizations that are not enabled by default. If optimizations are good they will be enabled in a next release of JDK.

Wednesday, October 24, 2012

Virtual Call 911

Since we know now how to print JIT code & how far JIT compiler can optimize byte code, let's dig how it can optimize the most common use case in Java : a virtual call.

What is a virtual call ?

A classic call is just a jump to a specific address where the code of the method resides independently of the class of the object.

A virtual call is more complex: the method to be called is dependent of the class of the object. Each class can have a specific implementation of the method. So the compiler cannot resolve at compile time which implementation it should call. The resolution is based on the underlying class of an instance.

Traditional object oriented languages implement virtual call by putting in each instance of class a pointer to the Virtual Method Table (VMT) which contain all virtual methods defined in the class.


So a virtual call needs to:
  1. Load the instance 
  2. Follow the link to the VMT
  3. Jump to the address corresponding of the VMT entry.
For an inherited class, usually, a new VMT is created with inherited methods. For an overridden method, the VMT entry is replaced by the overridden implementation.

Virtual call in Java

Some OO languages like C++, C#, Delphi (yeah, I have some legacy with this language...) mix classic and virtual methods by marking them with modifiers like virtual & override keywords.
However, in Java, there is no such modifiers, so all methods are virtual by default. Of course except static, private and constructor ones.

Considering what each virtual call implies, we can expect non negligible overhead for each call ! How the JIT can help on this situation ?

Final

The old fashion way to optimize this kind of calls is to mark methods as final. JIT Compiler is informed that no other classes will extend this class in the future so no other overridden implementation is expected.

Let's verify that with this example:

package com.bempel.sandbox;

public class TestJIT
{
    public static void main(String[] args) throws Exception
    {
        MyClass obj = new MyClass();        
        for (int i = 0; i < 10000; i++)
        {
            benchVMethod(obj);
        }
        Thread.sleep(1000);
    }
   
    public static void benchVMethod(MyClass obj)
    {
        obj.vmethod();
    }
    
    private static class MyClass 
    {
        public final void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to vmethod");
            }
        }
    }
}
Executed with JVM arguments:
-server -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-Inline


We get:
[Entry Point]
[Disassembling for mach='i386']
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x02488fc0: mov    %eax,-0x3000(%esp)
  0x02488fc7: push   %ebp
  0x02488fc8: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@-1 (line 17)
  0x02488fce: test   %ecx,%ecx
  0x02488fd0: je     0x02488fe3
  0x02488fd2: nop
  0x02488fd3: call   0x0246d040         ; OopMap{off=24}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 17)
                                        ;   {optimized virtual_call}
  0x02488fd8: add    $0x8,%esp
  0x02488fdb: pop    %ebp
  0x02488fdc: test   %eax,0x110000      ;   {poll_return}
  0x02488fe2: ret    

Note: I have used -XX:-Inline to avoid confusion in reading disassembly with inline optimization.

As you can see our invokevirtual to vmethod was translated by an "optimized virtual_call" to a specific address, so a classic call.
To be sure we have an optimized virtual call, let's re-activate the inline optimizations:
[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x02468000: mov    %eax,-0x3000(%esp)
  0x02468007: push   %ebp
  0x02468008: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@-1 (line 25)
  0x0246800e: test   %ecx,%ecx
  0x02468010: je     0x02468028         ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 25)
  0x02468012: call   0x6ddfe3a0         ;*invokestatic currentTimeMillis
                                        ; - com.bempel.sandbox.TestJIT$MyClass::vmethod@0 (line 32)
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 25)
                                        ;   {runtime_call}
  0x02468017: mov    %eax,%ebx
  0x02468019: or     %edx,%ebx
  0x0246801b: je     0x02468035         ;*ifne
                                        ; - com.bempel.sandbox.TestJIT$MyClass::vmethod@5 (line 32)
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 25)
  0x0246801d: add    $0x8,%esp
  0x02468020: pop    %ebp
  0x02468021: test   %eax,0x110000      ;   {poll_return}
  0x02468027: ret    
You can see a call the System.currentTimeMillis in benchVMethod, so, effectively, the vmethod was inlined into benchVMethod. Inlining cannot be perform with regular virtual calls but only on classic calls.
What about removing final modifier ? Well the result is strictly the same ! Of course, we have only one class and no hierarchy, so let's add more classes:

package com.bempel.sandbox;

public class TestJIT
{
    public static void main(String[] args) throws Exception
    {
        MyClass obj = new MyClass();
        MyClass obj2 = new MyClass2();
        MyClass obj3 = new MyClass3();
        MyClass obj4 = new MyClass4();
        MyClass obj5 = new MyClass5();
        for (int i = 0; i < 10000; i++)
        {
            benchVMethod(obj);
            benchVMethod(obj2);
            benchVMethod(obj3);
            benchVMethod(obj4);
            benchVMethod(obj5);
        }
        Thread.sleep(1000);
    }

    public static void benchVMethod(MyClass obj)
    {
        obj.vmethod();
    }

    private static class MyClass
    {
        public void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to MyClass.vmethod");
            }
        }
    }

    private static class MyClass2 extends MyClass
    {
        @Override
        public void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to MyClass2.vmethod");
            }
        }
    }

    private static class MyClass3 extends MyClass
    {
        @Override
        public void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to MyClass3.vmethod");
            }
        }
    }

    private static class MyClass4 extends MyClass
    {
        @Override
        public void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to MyClass4.vmethod");
            }
        }
    }

    private static class MyClass5 extends MyClass
    {
        @Override
        public void vmethod()
        {
            if (System.currentTimeMillis() == 0)
            {
                System.out.println("call to MyClass5.vmethod");
            }
        }
    }
}

[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x023f8a80: mov    %eax,-0x3000(%esp)
  0x023f8a87: push   %ebp
  0x023f8a88: sub    $0x8,%esp
  0x023f8a8e: mov    $0xffffffff,%eax   ;   {oop(NULL)}
  0x023f8a93: call   0x023dd240         ; OopMap{off=24}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 25)
                                        ;   {virtual_call}
  0x023f8a98: add    $0x8,%esp
  0x023f8a9b: pop    %ebp
  0x023f8a9c: test   %eax,0x180000      ;   {poll_return}
  0x023f8aa2: ret                       ;*invokevirtual vmethod

Even with inline optimization on, we have the same result. However there is still a call to a specific address, so i suspect here there is stub before real VMT resolution. We need to rely on comments and inlining to verify virtual call optimizations.

Class Hierarchy Analysis

However reducing the number of classes overriding vmethod to 2 will give us an optimized virtual call (again here without inlining):
[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x026b7fe0: mov    %eax,-0x3000(%esp)
  0x026b7fe7: push   %ebp
  0x026b7fe8: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@-1 (line 19)
  0x026b7fee: mov    0x4(%ecx),%ebx     ; implicit exception: dispatches to 0x026b8038
  0x026b7ff1: cmp    $0x581d158,%ebx    ;   {oop('com/bempel/sandbox/TestJIT$MyClass2')}
  0x026b7ff7: je     0x026b800a
  0x026b7ff9: cmp    $0x581cc18,%ebx    ;   {oop('com/bempel/sandbox/TestJIT$MyClass')}
  0x026b7fff: jne    0x026b801b
  0x026b8001: xchg   %ax,%ax
  0x026b8003: call   0x0269d040         ; OopMap{off=40}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 19)
                                        ;   {optimized virtual_call}
  0x026b8008: jmp    0x026b8010
  0x026b800a: nop
  0x026b800b: call   0x0269d040         ; OopMap{off=48}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 19)
                                        ;   {optimized virtual_call}
  0x026b8010: add    $0x8,%esp
  0x026b8013: pop    %ebp
  0x026b8014: test   %eax,0x180000      ;   {poll_return}
  0x026b801a: ret    
  0x026b801b: mov    %ecx,%ebp
  0x026b801d: mov    $0xffffffc6,%ecx
  0x026b8022: nop
  0x026b8023: call   0x0269c700         ; OopMap{ebp=Oop off=72}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 19)
                                        ;   {runtime_call}
  0x026b8028: int3   
  0x026b8029: mov    %eax,%ecx
  0x026b802b: jmp    0x026b802f
  0x026b802d: mov    %eax,%ecx
  0x026b802f: add    $0x8,%esp
  0x026b8032: pop    %ebp
  0x026b8033: jmp    0x026b86c0         ;   {runtime_call}
  0x026b8038: mov    $0xfffffff6,%ecx
  0x026b803d: xchg   %ax,%ax
  0x026b803f: call   0x0269c700         ; OopMap{off=100}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 19)
                                        ;   {runtime_call}
  0x026b8044: int3                      ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 19)

What is interesting in here is we have multiple optimized calls. Before these, we have tested against classes, so the type of the instance is compared to see if this is MyClass or MyClass2 depending on that, we call the method accordingly. We can see that the address is the same. However the comment says OopMap{off=40} or OopMap{off=48} so we can assume that this is a stub to dispatch to the real method.

This is what we call a bimorhpic call. Monomorphic call is the first one with only one class. Virtual calls without class test is a megamorphic call.

Since we have 2 classes involved here and based on the analysis of the class hierarchy, JIT compiler is able to optimize the virtual call as a special case between 2 classes with just a test on the type.

Profiling

Another remark with the previous example: If the class is not one of the test, we fallback to virtual call but there is a int 3 instruction just after. This instruction is used as a trap by JIT compiler to trigger a deoptimization. JIT compiler try to perform optimitic optimization. During interpretation phase, JVM collects statistics about execution and calls. When number of calls of a method reaches the CompileThreshold, the JIT compiler kicks in and produced native code with optimization based on the statistics gathered. So if we have only executed calls with one or two types, virtual calls are optimized consequently. But if another class is now involved, optimizations are no longer relevant. The JIT compiler deoptimizes the previous code and produced another native code usually with less aggressive optimization, based on the new information.

So let's verify it: we have modified the code to warmup with only one type for the call but there is 2 classes used during the execution:

    public static void main(String[] args) throws Exception
    {
        MyClass obj = new MyClass();
        MyClass obj2 = new MyClass2();
        for (int i = 0; i < 10000; i++)
        {
            benchVMethod(obj);
        }
        Thread.sleep(1000);
        for (int i = 0; i < 10000; i++)
        {
            benchVMethod(obj2);
        }
        Thread.sleep(1000);
    }

When executed with only these options: -server -XX:+PrintCompilation
We see:

---   n   java.lang.System::currentTimeMillis (static)
  1       com.bempel.sandbox.TestJIT::benchVMethod (5 bytes)
  2       com.bempel.sandbox.TestJIT$MyClass::vmethod (17 bytes)
  1   made not entrant  (2)  com.bempel.sandbox.TestJIT::benchVMethod (5 bytes)
  3       com.bempel.sandbox.TestJIT::benchVMethod (5 bytes)
  4       com.bempel.sandbox.TestJIT$MyClass2::vmethod (17 bytes)

benchVMethod is compiled once (index 1) and then 'made not entrant' means deoptimized. JIT recompiled it (index 6). With -XX:+PrintAssembly we can now see 2 versions of benchVMethod:

[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x024b8d00: mov    %eax,-0x3000(%esp)
  0x024b8d07: push   %ebp
  0x024b8d08: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@-1 (line 23)
  0x024b8d0e: mov    0x4(%ecx),%ebx     ; implicit exception: dispatches to 0x024b8d44
  0x024b8d11: cmp    $0x56dcc38,%ebx    ;   {oop('com/bempel/sandbox/TestJIT$MyClass')}
  0x024b8d17: jne    0x024b8d2b
  0x024b8d19: xchg   %ax,%ax
  0x024b8d1b: call   0x0249d040         ; OopMap{off=32}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 23)
                                        ;   {optimized virtual_call}
  0x024b8d20: add    $0x8,%esp
  0x024b8d23: pop    %ebp
  0x024b8d24: test   %eax,0x180000      ;   {poll_return}
  0x024b8d2a: ret    

[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'benchVMethod' '(Lcom/bempel/sandbox/TestJIT$MyClass;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'com/bempel/sandbox/TestJIT$MyClass'
  #           [sp+0x10]  (sp of caller)
  0x024b8f60: mov    %eax,-0x3000(%esp)
  0x024b8f67: push   %ebp
  0x024b8f68: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@-1 (line 23)
  0x024b8f6e: mov    0x4(%ecx),%ebx     ; implicit exception: dispatches to 0x024b8fb8
  0x024b8f71: cmp    $0x56dcc38,%ebx    ;   {oop('com/bempel/sandbox/TestJIT$MyClass')}
  0x024b8f77: je     0x024b8f8a
  0x024b8f79: cmp    $0x56dd178,%ebx    ;   {oop('com/bempel/sandbox/TestJIT$MyClass2')}
  0x024b8f7f: jne    0x024b8f9b
  0x024b8f81: xchg   %ax,%ax
  0x024b8f83: call   0x0249d040         ; OopMap{off=40}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 23)
                                        ;   {optimized virtual_call}
  0x024b8f88: jmp    0x024b8f90
  0x024b8f8a: nop
  0x024b8f8b: call   0x0249d040         ; OopMap{off=48}
                                        ;*invokevirtual vmethod
                                        ; - com.bempel.sandbox.TestJIT::benchVMethod@1 (line 23)
                                        ;   {optimized virtual_call}
  0x024b8f90: add    $0x8,%esp
  0x024b8f93: pop    %ebp
  0x024b8f94: test   %eax,0x180000      ;   {poll_return}
  0x024b8f9a: ret    

So first one, we have only one test against one class. Second one, 2 tests for 2 classes. This is why it is so important to execute all code pathes during warmup to give to JIT compiler all informations required to be able to compile correctly on the first time.
We can also remark here that since there is 2 classes loaded, the first version, has a test on the class that is executed unlike on our first test with only one class loaded, there was no test against class since there is no class inherited (class hierarchy analysis).

Interfaces


What about interfaces ? Byte code instruction is invokeinterface instead of invokevirtual, Is it really different ?

Let's verify it with a classic List:

package com.bempel.sandbox;

import java.util.ArrayList;
import java.util.List;

public class TestJIT
{
    private static void call(List list)
    {
        list.add("foo");
    }
    
    public static void main(String[] args) throws Exception
    {
        List[] lists = new List[] { new ArrayList()};
        for (int i = 0; i < 10000; i++)
            call(lists[i % lists.length]);
        Thread.sleep(1000);
    }
}

Disassembly code:

[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'call' '(Ljava/util/List;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'java/util/List'
  #           [sp+0x10]  (sp of caller)
  0x024c9080: mov    %eax,-0x3000(%esp)
  0x024c9087: push   %ebp
  0x024c9088: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::call@-1 (line 10)
  0x024c908e: mov    0x4(%ecx),%ebp     ; implicit exception: dispatches to 0x024c90c8
  0x024c9091: cmp    $0x568f6a0,%ebp    ;   {oop('java/util/ArrayList')}
  0x024c9097: jne    0x024c90af
  0x024c9099: mov    $0x56ecb40,%edx    ;   {oop("foo")}
  0x024c909e: nop
  0x024c909f: call   0x024ad040         ; OopMap{off=36}
                                        ;*invokeinterface add
                                        ; - com.bempel.sandbox.TestJIT::call@3 (line 10)
                                        ;   {optimized virtual_call}
  0x024c90a4: add    $0x8,%esp
  0x024c90a7: pop    %ebp
  0x024c90a8: test   %eax,0x110000      ;   {poll_return}
  0x024c90ae: ret    

It looks like very similar with invokevirtual. Since there are multiple implementations of List loaded into the JVM there is a test against ArrayList class. Let's add another class in our test:

package com.bempel.sandbox;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CopyOnWriteArrayList;

public class TestJIT
{
    private static void call(List list)
    {
        list.add("foo");
    }
    
    public static void main(String[] args) throws Exception
    {
        List[] lists = new List[] { new ArrayList(), new CopyOnWriteArrayList()};
        for (int i = 0; i < 12000; i++)
            call(lists[i % lists.length]);
        Thread.sleep(1000);
    }
}

Note: I do not know yet why we need to increase to 12000 calls to get call method compiled...

And now the disassembly code:

[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'call' '(Ljava/util/List;)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = 'java/util/List'
  #           [sp+0x10]  (sp of caller)
  0x02507fe0: mov    %eax,-0x3000(%esp)
  0x02507fe7: push   %ebp
  0x02507fe8: sub    $0x8,%esp          ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::call@-1 (line 11)
  0x02507fee: mov    0x4(%ecx),%ebx     ; implicit exception: dispatches to 0x0250803c
  0x02507ff1: mov    $0x5730520,%edx    ;   {oop("foo")}
  0x02507ff6: cmp    $0x572fe18,%ebx    ;   {oop('java/util/concurrent/CopyOnWriteArrayList')}
  0x02507ffc: je     0x0250800e
  0x02507ffe: cmp    $0x56cf6a0,%ebx    ;   {oop('java/util/ArrayList')}
  0x02508004: jne    0x0250801f
  0x02508006: nop
  0x02508007: call   0x024ed040         ; OopMap{off=44}
                                        ;*invokeinterface add
                                        ; - com.bempel.sandbox.TestJIT::call@3 (line 11)
                                        ;   {optimized virtual_call}
  0x0250800c: jmp    0x02508014
  0x0250800e: nop
  0x0250800f: call   0x024ed040         ; OopMap{off=52}
                                        ;*invokeinterface add
                                        ; - com.bempel.sandbox.TestJIT::call@3 (line 11)
                                        ;   {optimized virtual_call}
  0x02508014: add    $0x8,%esp
  0x02508017: pop    %ebp
  0x02508018: test   %eax,0x180000      ;   {poll_return}
  0x0250801e: ret    

See ? A bimorphic call on add method with 2 tests against CopyOnWriteArrayList and ArrayList classes.

Conclusion

Unlike this fail attempt, we have demonstrated here that abstraction does not slow you down ! You can use your interfaces or abstract classes to call your methods without any doubt about performance. So, use List or Map interfaces instead of concrete type for variable/field, JIT compiler will optimize it for you.
Also getter and setter of object will be largely optimized by monomorphic call followed by inlining !
Finally, final keyword is no longer required to help JIT compiler, so let's remove it since it prevents you to override method in some cases !

Tuesday, October 16, 2012

How to print disassembly from JIT code ?

In my previous post, I have shown you some disassembly from JIT native code. How to do the same at home ?

I found relevant informations in this wiki page.

 First, you need to get the dissasembler named
Put this disassembler into the bin directory of your JRE: C:\Program Files\Java\jre6\bin
Or for your JDK: C:\Program Files\Java\jdk1.6.0_35\jre\bin

Then launch your java command with the following options:
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

Then you will get:
"C:\Program Files (x86)\Java\jdk1.6.0_23\jre\bin\java" -server
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly com.bempel.sandbox.TestJIT
Java HotSpot(TM) Server VM warning: PrintAssembly is enabled; turning on
DebugNonSafepoints to gain additional output
Loaded disassembler from hsdis-i386.dll
Decoding compiled method 0x02548708:
Code:
[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'run' '()V' in 'com/bempel/sandbox/TestJIT$C'
[...]
Enjoy !

Updated : To get intel syntax over AT&T, use -XX:PrintAssemblyOptions=intel

Tuesday, October 9, 2012

Volatile

Volatile Java keyword is already explained in many places around the web (like in What volatile Means in Java for example). So in this post I will not explain it once again, instead, I will focus on what it means in terms of JIT compiler native code production.

Safety tips: Do try this at home ! ;-)

So let's take a classic example of boolean variable used to go out of a thread loop:
 
public class TestJIT
{
    public static void main(String[] args) throws Exception
    {
        C c = new C();
        c.start();
        Thread.sleep(1*1000);
        c.stopNow();
        c.join();
    }
   
    private static class C extends Thread
    {
        private boolean running = true;
        private int i;
       
        @Override
        public void run()
        {
            while (running)
            {
                i++;
            }
            if (System.currentTimeMillis() == 0)
            {
                System.out.println(i);
            }
        }
       
        public void stopNow()
        {
            running = false;
        }
    }
}
Basically in this code, I start a thread, wait for 1 second and then notify the thread to stop by changing the running boolean to false.

If you compile and execute (in -server on windows x86 in my case) this code, you will observe that it never ends !

Easy, you think ! It's because boolean is not volatile and the thread loop does not see the variable changing because it is not yet read from the main memory.

Ok makes sense, let's see by ourself by disassembling the JIT code (I will explain how to do this in a next post):
[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'run' '()V' in 'TestJIT$C'
  0x02608ec0: int3  
  0x02608ec1: xchg   %ax,%ax
  0x02608ec4: mov    %eax,0xffffd000(%esp)
  0x02608ecb: push   %ebp
  0x02608ecc: sub    $0x18,%esp
  0x02608ed2: mov    (%ecx),%ebp
  0x02608ed4: mov    %ecx,(%esp)
  0x02608ed7: call   0x6dbadeb0         ;   {runtime_call}
  0x02608edc: mov    0x4(%ebp),%ecx     ; implicit exception: disp. to x02608f1d
  0x02608edf: cmp    $0x576e470,%ecx    ;   {oop('TestJIT$C')}
  0x02608ee5: jne    0x02608f11         ;*aload_0
                                        ; - TestJIT$C::run@3 (line 24)
  0x02608ee7: movzbl 0x70(%ebp),%ecx    ;*getfield running
                                        ; - TestJIT$C::run@14 (line 22)
  0x02608eeb: mov    $0x1,%edi
  0x02608ef0: add    0x6c(%ebp),%edi    ;*iadd
                                        ; - TestJIT$C::run@9 (line 24)
  0x02608ef3: mov    %edi,0x6c(%ebp)    ;*putfield i
                                        ; - TestJIT$C::run@10 (line 24)
  0x02608ef6: test   %ecx,%ecx
  0x02608ef8: je     0x02608f06         ; OopMap{ebp=Oop off=58}
                                        ;*ifne
                                        ; - TestJIT$C::run@17 (line 22)
  0x02608efa: test   %edi,0x110000      ;*ifne
                                        ; - TestJIT$C::run@17 (line 22)
                                        ;   {poll}
  0x02608f00: inc    %edi               ;*iadd
                                        ; - TestJIT$C::run@9 (line 24)
  0x02608f01: mov    %edi,0x6c(%ebp)    ;*putfield i
                                        ; - TestJIT$C::run@10 (line 24)
  0x02608f04: jmp    0x02608efa
  0x02608f06: mov    $0xffffff86,%ecx
  0x02608f0b: call   0x025ec700         ; OopMap{ebp=Oop off=80}
                                        ;*aload_0
                                        ; - TestJIT$C::run@13 (line 22)
                                        ;   {runtime_call}
  0x02608f10: int3  
  0x02608f11: mov    $0xffffffad,%ecx
  0x02608f16: nop   
  0x02608f17: call   0x025ec700         ; OopMap{ebp=Oop off=92}
                                        ;*aload_0
                                        ; - TestJIT$C::run@3 (line 24)
                                        ;   {runtime_call}
  0x02608f1c: int3                      ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
  0x02608f1d: mov    $0xfffffff6,%ecx
  0x02608f22: nop   
  0x02608f23: call   0x025ec700         ; OopMap{off=104}
                                        ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
                                        ;   {runtime_call}
  0x02608f28: int3                      ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
So highlighted in blue, you see the loop. There is an unconditional jump (jmp instruction). It means that the JIT produced in fact an infinite loop as a while (true) i++;

There is also no emitted code for the rest of the run method meaning that the test on System.currentTimeMillis & and the System.out.println lines were optimized away since they are now unreachable.

So let's add the volatile keyword to our running boolean and see what happens. Firstly the code now works. Secondly, look at the generated code:
[Disassembling for mach='i386']
[Entry Point]
[Verified Entry Point]
  # {method} 'run' '()V' in 'TestJIT$C'
  0x026d8f00: int3  
  0x026d8f01: xchg   %ax,%ax
  0x026d8f04: mov    %eax,0xffffd000(%esp)
  0x026d8f0b: push   %ebp
  0x026d8f0c: sub    $0x18,%esp
  0x026d8f12: mov    (%ecx),%ebp
  0x026d8f14: mov    %ecx,(%esp)
  0x026d8f17: call   0x6dbadeb0         ;   {runtime_call}
  0x026d8f1c: mov    0x4(%ebp),%ecx     ; implicit exception: disp. to 0x026d8f5d
  0x026d8f1f: cmp    $0x583e470,%ecx    ;   {oop('TestJIT$C')}
  0x026d8f25: jne    0x026d8f51         ;*aload_0
                                        ; - TestJIT$C::run@3 (line 24)
  0x026d8f27: incl   0x6c(%ebp)         ;*putfield i
                                        ; - TestJIT$C::run@10 (line 24)
  0x026d8f2a: jmp    0x026d8f37
  0x026d8f2c: nopl   0x0(%eax)
  0x026d8f30: mov    0x6c(%ebp),%ecx    ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
  0x026d8f33: inc    %ecx
  0x026d8f34: mov    %ecx,0x6c(%ebp)    ;*aload_0
                                        ; - TestJIT$C::run@13 (line 22)
  0x026d8f37: movzbl 0x70(%ebp),%ebx    ; OopMap{ebp=Oop off=59}
                                        ;*ifne
                                        ; - TestJIT$C::run@17 (line 22)
  0x026d8f3b: test   %edi,0x180000      ;   {poll}
  0x026d8f41: test   %ebx,%ebx
  0x026d8f43: jne    0x026d8f30         ;*ifne
                                        ; - TestJIT$C::run@17 (line 22)
  0x026d8f45: mov    $0x18,%ecx
  0x026d8f4a: nop   
  0x026d8f4b: call   0x026bc700         ; OopMap{ebp=Oop off=80}
                                        ;*invokestatic currentTimeMillis
                                        ; - TestJIT$C::run@20 (line 26)
                                        ;   {runtime_call}
  0x026d8f50: int3  
  0x026d8f51: mov    $0xffffffad,%ecx
  0x026d8f56: nop   
  0x026d8f57: call   0x026bc700         ; OopMap{ebp=Oop off=92}
                                        ;*aload_0
                                        ; - TestJIT$C::run@3 (line 24)
                                        ;   {runtime_call}
  0x026d8f5c: int3                      ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
  0x026d8f5d: mov    $0xfffffff6,%ecx
  0x026d8f62: nop   
  0x026d8f63: call   0x026bc700         ; OopMap{off=104}
                                        ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)
                                        ;   {runtime_call}
  0x026d8f68: int3                      ;*getfield i
                                        ; - TestJIT$C::run@5 (line 24)

Now we have something very different in the loop: we get the field i from memory into ecx register, we increment ecx, we move the result to the previous location, we load the running field from the memory into ebx register, and we test if it equals to 0, if not we jump to the first instruction of the loop.

In the comment you see mentioning a call to currentTimeMillis, in fact, there is the same address for the 2 next calls. Since my loop was compiled with OSR optimization, (you can check with -XX:+PrintCompilation) this is a call to C2I adapter (Compiler to Interpreter) that is emitted to go back to interpreter mode for the rest of the method since the whole method is not yet considered as a hot spot.

Note: movzbl is basically a move with zero padding.

So in conclusion, is there a bug into the server JIT (C2) ? I do no think so ! Without volatile keyword, C2 is free to optimize optimisticly (aggressively ?) the code since you do not put hard guarantee on your boolean as specified in the Memory Model of the JLS.


Wednesday, October 3, 2012

Introduction

The idea of this blog came from multiple discussions I had with my colleagues about how HotSpot JIT compiler is able to optimize some Java code and how good is the native code produced. My findings may be worth to share, so here they are !

I will also post about any performance topics on Java, C#/.NET, OS or hardware.

Let's rock !