Template Version: @(#)sac_nextcase 1.70 05/10/10 SMI This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. 1. Introduction 1.1. Project/Component Working Name: Performance Improvements for libmtmalloc 1.2. Name of Document Author/Supplier: Author: Rick Weisner 1.3 Date of This Document: 18 June, 2010 4. Technical Description SUMMARY Under the following two situations libmtmalloc has shown poor scalability. 1. When there are large numbers of allocating threads. (see CR6922229) and 2. When the allocation size is larger than 64 KB. (see CR6555149) We will remedy the above scalability issues by: 1) Using atomic operations to eliminate the cache lock in libmtmalloc. 2) Provide a mechanism whereby the parent lock can also be eliminated for threads whose id is less than 2* the number of cpus. 3) Make the maximum cacheable requestsize tunable via an environment variable. BACKGROUND libmtmalloc organizes avaiable address space into buckets. Each thread which calls malloc is assigned a bucket based upon its thread id. The per bucket parent lock controls the use of each bucket. Each bucket is a list of caches based on size. Each list is protected by a cache lock. Applications with a large number of allocating threads may have their performance limited by contention for these locks. These sort of applications are not unusual in the Telco space. Larger allocations sizes are also becoming more common. With 64 bit applications, terabytes of memory, and hundreds of threads it is advantageous to be able to adjust the maximum cacheable requestsize to better suit the needs of the application. PROBLEM A customer's application did not perform as needed on a Netra 5440. DTrace indicated lock contention relating to memory allocation in libmtmalloc. The customer provided some code that provided dramatic performance increases by eliminating the "cache" locks and "parent" locks from libmtmalloc and replacing them with atomic operations. The customer's code was not threadsafe in general but was promising. In a different case the customer states: We observed that db is hitting oversize_lock mutex due to the memory needed to be allocated is more than MAX_CACHED. Sometimes acquiring the oversize_lock mutex is taking more than 2sec, causing the db performance to degrade. (see 6555149) PROPOSAL 1) Eliminate the cache lock by using atomic operations. 2) Add a new option to mallocctl(3MALLOC) that activates the use of exclusive buckets for threads whose ID is < 2 * the number of CPUs. The value argument associated with the mallocctl option is ignored. Once the option has been called there is no facility to 'unset' it. A new environment variable will be introduced, MTMALLOC_OPTIONS. It will consist of a comma separated list of options in the style of umem_debug. We will support the following options: MTEXCLUSIVE=Y or y or yes or Yes, or anything that starts with y, to enable the use of exclusive buckets, MTMAXCACHE=16,17,18,19 or 20 (see below), and/or MYCHUNKSIZE=xx where xx is a number use to size the buckets. The default is 9. Invalid options are silently ignored. This feature is needed for situations where the source code is unavailable. This feature will also assist in performance analysis. 3) Introduce the option MTMAXCACHE in the environment variable, MTMALLOC_OPTIONS, which will set the maximum request size that is cached. It will have the values of 16 to 21. The default is 16 which means that requests less than 2^^16 are cached. With this value we can support up to 2mb (2^^21) request sizes in cache. If the value of MTMAXCACHE is set to something outside of the ranges then it will use either 16 or 21 (which ever bound has been broken by the value set). It is necessary to use an environment variable instead of a mallocctl interface because the MTMAXCACHE must be determined before malloc_init calls setup_caches. Here is an example MTMALLOC_OPTIONS: export MTMALLOC_OPTIONS="MTEXCLUSIVE=Y,MTMAXCACHE=17,MTCHUNKSIZE=64" DETAILS The code has been developed and tested in 64 bit mode on Solaris 10 u6 on a Netra T5440. The test harness uses a configurable number of allocation threads, a configurable sample count, a configurable "maximum" allocation size. Each allocation thread has a configurable number of ramdom or fixed size allocations between 8 and the requested "max" allocation size + 1/2 the "max" allocation size. A freeing thread then releases the allocations while the allocating thread performs a fresh set of allocations. In initial testing with "stock" libmtmalloc it was possible to do 6300 64 bit operations per sec on the N5440. With the "atomic" library this increases to 15000. COMMENTS Exported Interfaces: MTEXCLUSIVE Stable option for mallocctl(3MALLOC). MTMALLOC_OPTIONS Stable Shell environmet variable supporting the following options: MTEXCLUSIVE=Y MTMAXCACHE=16,17,18,19,20 or 21 MTCHUNKSIZE=xx where xx is a number Reference: 6922229 libmtmalloc would benefit from atomic operations 6555149 poor performance with libmtmalloc compared to libc 6956786 Provide a tunable to tweak the MAX_CACHED threshold in libmtmalloc DELIVERY VEHICLE Solaris RELEASE Patch COMMITMENT LEVEL 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack