Template Version: @(#)sac_nextcase 1.70 05/10/10 SMI
This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
1. Introduction
    1.1. Project/Component Working Name:
	 Performance Improvements for libmtmalloc
    1.2. Name of Document Author/Supplier:
	 Author:  Rick Weisner
    1.3  Date of This Document:
	18 June, 2010
4. Technical Description

    SUMMARY

        Under the following two situations libmtmalloc has shown 
        poor scalability. 

	1. When there are large numbers of allocating threads.
           (see CR6922229)

           and 

        2. When the allocation size is larger than 64 KB.
           (see CR6555149)

        We will remedy the above scalability issues by:

        1) Using atomic operations to eliminate the cache lock in 
        libmtmalloc.

	2) Provide a mechanism whereby the parent lock can also
        be eliminated for threads whose id is less than 2* the number
        of cpus.

        3) Make the maximum cacheable requestsize tunable via an
        environment variable.

    BACKGROUND
	libmtmalloc organizes avaiable address space into buckets.
        Each thread which calls malloc is assigned a bucket based
        upon its thread id. The per bucket parent lock controls 
        the use of each bucket. Each bucket is a list of caches
        based on size. Each list is protected by a cache lock. 
        Applications with a large number of allocating threads may
        have their performance limited by contention for these locks.
        These sort of applications are not unusual in the Telco space.

        Larger allocations sizes are also becoming more common. With
        64 bit applications, terabytes of memory, and hundreds of
        threads it is advantageous  to be able to adjust the
        maximum cacheable requestsize to better suit the needs
        of the application.

    PROBLEM
	A customer's application did not perform as needed on a 
	Netra 5440. DTrace indicated lock contention relating to
	memory allocation in libmtmalloc. The customer provided 
	some code that provided dramatic performance increases by 
	eliminating the "cache" locks and "parent" locks from 
	libmtmalloc and replacing them with atomic operations.
	The customer's code was not threadsafe in general but was
	promising.

        In a different case the customer states:

	We observed that db is hitting oversize_lock mutex due to the
	memory needed to be allocated is more than MAX_CACHED.
	Sometimes acquiring the oversize_lock mutex is taking more 
	than 2sec, causing the db performance to degrade. (see 6555149)
	

    PROPOSAL

	1) Eliminate the cache lock by using atomic operations.

	2) Add a new option to mallocctl(3MALLOC) that activates
	the use of exclusive buckets for threads whose ID is < 2 *
	the number of CPUs. 

	The value argument associated with the mallocctl option is
	ignored. Once the option has been called there is no facility
	to 'unset' it.  

        A new environment variable will be introduced, MTMALLOC_OPTIONS.
        It will consist of a comma separated list of options in the
        style of umem_debug. We will support the following options:
        MTEXCLUSIVE=Y or y or yes or Yes, or anything that starts with
        y, to enable the use of exclusive buckets,
        MTMAXCACHE=16,17,18,19 or 20 (see below), and/or
        MYCHUNKSIZE=xx where xx is a number use to size the buckets.
        The default is 9.
        Invalid options are silently ignored.

        This feature is needed for situations where the source code is
        unavailable. This feature will also assist in performance
        analysis.

	3) Introduce the option MTMAXCACHE in the environment variable,
	MTMALLOC_OPTIONS,  which will set the maximum request size that 
	is cached. It will have the values of 16 to 21. The default is 
	16 which means that requests less than 2^^16 are cached. 
	With this value we can support up to 2mb (2^^21) request sizes 
	in cache.

	If the value of MTMAXCACHE is set to something outside of the
	ranges then it will use either 16 or 21 (which ever bound
	has been broken by the value set).

        It is necessary to use an environment variable instead of
        a mallocctl interface because the MTMAXCACHE must be determined
        before malloc_init calls setup_caches.

        Here is an example MTMALLOC_OPTIONS:
	export MTMALLOC_OPTIONS="MTEXCLUSIVE=Y,MTMAXCACHE=17,MTCHUNKSIZE=64"

    DETAILS

	The code has been developed and tested in 64 bit mode on 
	Solaris 10 u6 on a Netra T5440. The test harness uses a
        configurable number of allocation threads, a configurable
        sample count, a configurable "maximum" allocation size.
        Each allocation thread has a configurable number of ramdom 
        or fixed size allocations between 8 and the requested "max"
        allocation size + 1/2 the "max" allocation size.

        A freeing thread then releases the allocations while the 
        allocating thread performs a fresh set of allocations.

        In initial testing with "stock" libmtmalloc it was possible to do 
	6300 64 bit operations per sec on the N5440. With the "atomic" 
	library this increases to 15000.

    COMMENTS
	Exported Interfaces:

	MTEXCLUSIVE	  Stable	option for mallocctl(3MALLOC).
        

	MTMALLOC_OPTIONS  Stable	Shell environmet variable supporting
					the following options:
					MTEXCLUSIVE=Y
					MTMAXCACHE=16,17,18,19,20 or 21
                                        MTCHUNKSIZE=xx where xx is a number

	Reference:
	6922229 libmtmalloc would benefit from atomic operations
	6555149 poor performance with libmtmalloc compared to libc
	6956786 Provide a tunable to tweak the MAX_CACHED threshold
		in libmtmalloc

    DELIVERY VEHICLE

	Solaris

    RELEASE

	Patch

    COMMITMENT LEVEL

6. Resources and Schedule
    6.4. Steering Committee requested information
   	6.4.1. Consolidation C-team Name:
		ON
    6.5. ARC review type: FastTrack