The release binding is Patch/Micro.  All file names are relative to
the materials directory in the case.


SUMMARY
=======

  This document explains the architecture of the observability and
  control tools needed for Memory Placement Optimization (MPO).  New
  tools, additions to existing tools, and corrections to previously
  proposed but not yet integrated tools are proposed.

  The new tools are:

    - lgrpinfo(1) for displaying the lgroup hierarchy (see
      lgrpinfo.1.txt and lgrpinfo_psarc.txt).

    - plgrp(1) for observing and affecting lgroup affinities for
      specified threads (see plgrp.1.txt).

    - A new Lgrp Perl module is introduced as a Perl interface to
      liblgrp(3LIB).  This is used by lgrpinfo(1) since lgrpinfo(1) is
      a Perl script.  (See Lgrp.1.txt and lgrp_mod_psarc.txt.)

  The additions to existing tools are:

    - New flags are proposed as additions to the existing ps(1) and
      prstat(1M) commands for displaying the home lgroup of all or
      active processes or listing processes or threads in a given
      lgroup.  (See ps.diffs and prstat.diffs.)

    - New pr_lgrp value is added to proc(4) and the dtrace proc
      provider.  (See proc.diffs and lwpsinfo_psarc.txt.)

  The corrections are:

    - Some minor corrections are needed to the output format of
      pmadvise(1), the -L option to pmap(1), and to the syntax of the
      -A option to pmap(1) (which were originally introduced in PSARC
      2004/484 and 2004/485 but have not been integrated into Solaris
      yet).  (See pmadvise.1.txt, pmap.1.diffs, and pmap_psarc.txt.)

  These new tools and changes are discussed below along with the our
  previously proposed tools as part of the MPO observability and
  control tools architecture and in any separate supporting documents
  needed to explain each tool in more detail.

BACKGROUND
==========
  As part of the Memory Placement Optimization feature in Solaris, we
  have added a "locality group" (lgroup) abstraction to tell what
  resources are near each other on a NUMA machine and a framework to
  optimize for performance through locality.

  Locality groups represent the set of CPU-like and memory-like
  hardware resources at most some latency apart from each other.  A
  Uniform Memory Access (UMA) machine will only be represented with
  one lgroup (the root lgroup).  A Non Uniform Memory Access (NUMA)
  machine is represented by a hierarchy of lgroups to show the
  corresponding levels of locality.  The lgroup hierarchy is organized
  to facilitate finding the nearest resources.  Each parent lgroup in
  the hierarchy contains the resources of its children plus the next
  nearest resources.

  Upon creation, each thread in the system is assigned to a "home"
  lgroup where the operating system will try to run the thread and
  allocate its memory and other resources to improve its performance
  via locality.  If the desired resources aren't available in the
  thread's home lgroup, the operating system will traverse the lgroup
  hierarchy from the thread's home lgroup to find the nearest
  available resources.

MOTIVATION
==========
  MPO tries to provide good performance by default.  This is expected
  to be the case for the majority of applications, but some small
  minority of applications may need more.  Tools can be provided to
  help make it easier to figure things out and tune performance over
  what is provided by default.

  Specifically, tools are needed to facilitate observability,
  diagnosability, and control of the Solaris lgroup framework and its
  optimizations for locality on NUMA machines. So far, the lgroup
  framework and APIs have been provided to allow some observability
  and control, but little to no tools have been provided.

  Basic tools are needed to at least display the lgroup hierarchy, its
  contents, and characteristics and to observe and affect thread and
  memory placement among lgroups since placement is essential to
  locality.

USERS
=====
  Most users should not need to know about the tools.  The intended
  consumers of the tools are system administrators, developers,
  performance engineers, systems programmers, and support engineers.
  These consumers may be interested in knowing more about the system,
  application(s), or both.

  We believe that most of the questions that the above consumers of
  the tools have usually boil down to one of the following questions:

	- What is the system configuration?
	- Are the system or application resources balanced or placed
	  well among lgroups?
        - Is MPO successful?
	- Why did that happen?

TOOLS
=====
  Basic observability and control tools are essential to addressing
  these fundamental questions of system configuration, balance or
  placement, success, and diagnosability.  These tools mostly help
  answer questions about system configuration and balance or
  placement, but they also provide the basic information and mechanism
  needed to determine whether MPO is successful and diagnose problems
  related to MPO.

  To answer the question of whether MPO is successful, it seems like
  profiling and statistics would be most helpful.  However, it is
  important to know the thread's affinities for lgroups (such as its
  home lgroup) and where its memory is allocated to determine whether
  MPO *should* be successful in providing good locality and
  subsequently good performance.  In addition to that, a tool that
  profiles where a given thread runs and which memory it accesses most
  (relative to lgroups) would be useful for determining whether MPO is
  *really* successful.

  For diagnosability and to understand why something is happening, one
  has to understand what happened first.  We have found that using our
  observability tools at least help to see what is happening and our
  tools to affect thread and memory placement provide a way to gain a
  deeper understanding of what an application is doing or needs
  through experimentation especially when the source isn't available.

  To really be able to find out why something happened (like a thread
  not running in its home lgroup or allocating local memory), we
  believe that dtrace(1M) and potentially some more instrumentation in
  the kernel will be needed.

  In this PSARC case, we would like to propose the basic observability
  and control tools needed for MPO.  As explained above, these tools
  are essential to observability, control, performance analysis, and
  diagnosability.  While they don't completely address the areas of
  performance analysis and diagnosability, they give what's needed to
  start and should be very useful now.  Moreover, we believe that the
  additional tools needed for performance analysis and diagnosability
  probably won't overlap the proposed tools very much if at all
  because they require different mechanisms.

  Here is a small table that shows what question/area is addressed by
  what tool(s) for observability and control:

------------------------------------------------------------------------------
			OBSERVE			CONTROL
------------------------------------------------------------------------------
CONFIGURATION		lgrpinfo(1)

PLACEMENT
- THREAD		plgrp(1), ps, prstat	plgrp(1)
- MEMORY		pmap(1)			pmadvise(1), madv.so.1(1)
------------------------------------------------------------------------------

LGROUP HIERARCHY
================
  The lgrpinfo(1) utility can be used to display the lgroup hierarchy,
  its contents, and characteristics and to easily determine the
  following:

	- Whether the system is an UMA or NUMA machine

	- Which CPUs are near each other, have memory near them, and
          how much

	- What the relative latencies are between the CPUs and
          different memory

	- How the operating system has organized these CPU and memory
          resources into a hierarchy to facilitate finding the nearest
          resources quickly

	- How each lgroup relates to the other lgroups

	- Lgroup thread and memory loads (eg. load average and amount
          of memory in use and free)

  It can be useful for the following:

	- Observing and verifying the lgroup hierarchy

	- Understanding the context in which the operating system is
          trying to optimize applications for locality

	- Observing whether system (CPU and memory) resources are well
          balanced or placed across lgroups

  Overall, the tool has been very helpful in understanding the system
  better and recognizing and diagnosing some problems at the system
  level.

  Please see lgrpinfo_psarc.txt for more discussion on lgrpinfo(1),
  the lgrpinfo(1) man page for its specification, lgrp_mod_psarc.txt
  for a discussion of the supporting liblgrp perl module, and the
  Lgrp(1) man page for its specification.

PLACEMENT
=========
  Thread and memory placement among lgroups are essential to
  optimizing for locality.  Thus, the ability to observe and affect
  how threads and memory are placed among lgroups is important for
  understanding and affecting the performance of the system and
  applications on NUMA machines.

THREAD
------
  The following tools are for observing and affecting the placement of
  threads among lgroups:

	- ps(1) for observing the home lgroup of every user process or
          thread in the system

	- prstat(1M) for observing the home lgroup of the active
          processes or threads in the system

	- plgrp(1) for observing and affecting thread placement among
          lgroups

  To provide a system view of how all user processes and threads are
  placed among lgroups, a new -H option is proposed for prstat(1M) to
  display the home lgroup of active user processes and threads and for
  ps(1) to show the home lgroup of all user processes and threads.
  Furthermore, a new -h option is proposed for ps(1) and prstat(1M) to
  see all user processes or threads which have a specified lgroup as
  their home.  A new "lgrp" format specifier is proposed for ps(1) to
  allow for custom output formatting.

  The new plgrp(1) tool is for observing and affecting the placement
  of threads among lgroups.  It can get and set the home lgroup and
  lgroup affinities of a given set of threads by using /proc to get
  information that /proc has or use the /proc agent LWP to make calls
  from within the target process on the tool's behalf.

  To facilitate observing the home lgroup of a thread in a live
  process or core file, a new pr_lgrp field has been added to
  lwpsinfo_t in /proc.  This structure is documented in proc(4) to
  contain the home lgroup of the corresponding thread.  Similiarly,
  this change was made to the dtrace proc provider to have its
  lwpsinfo_t include a new pr_lgrp field.

  Please see lwpsinfo_psarc.txt for more details on the changes to
  lwpsinfo_t and the proc(4), ps(1), prstat(1M), and plgrp(1) man
  pages for the specifications of these tools and/or changes to them.

MEMORY
------
  The tools for observing and affecting the placement of memory among
  lgroups are the following:

	- pmap(1) for observing memory placement among lgroups (PSARC
          2004/485)

	- pmadvise(1) for applying advice to virtual memory ranges,
          offering fine grain control of memory placement among
          lgroups through madvise(MADV_ACCESS_*) (PSARC 2004/484)

	- madv.so.1(1) for applying advice to all kinds of memory
          (eg. heap, stack, private, shared, mapped, anonymous memory)
          offering coarse grain control of memory placement among
          lgroups through madvise(MADV_ACCESS_*) (PSARC 2002/030)

  When the -L option is given, pmap(1) will display the lgroup that
  directly contains the physical memory backing some given virtual
  memory.  In addition, a new -A option was proposed in PSARC 2004/485
  to make it possible to specify a virtual address range of interest,
  since using the -L option can result in one line per page when
  contiguous physical pages don't back a given portion of the virtual
  address space.

  The pmadvise(1) tool is for affecting how memory is placed among
  lgroups.  It uses a /proc agent LWP to make calls to madvise(3C)
  with the MADV_ACCESS_* flags in the target process.  The
  madvise(MADV_ACCESS_*) calls give a hint to the operating system of
  how the application will access a specified virtual address range.
  On NUMA machines, the operating system will use this hint to
  determine how to allocate memory for the specified range.

  Besides pmadvise(1), madv.so.1(1) can also be used to affect how
  memory is placed among lgroups, but it uses a different mechanism to
  do so.  Instead of using /proc like pmadvise(1), madv.so.1(1) is an
  LD_PRELOAD library that interposes on system calls for allocating
  virtual memory (eg. brk(2), mmap(2), shmat(2), etc.) and calls
  madvise(3C) on the newly allocated memory after making the system
  call.

  Please see pmap_psarc.txt for an explanation of the changes needed
  to the previously proposed but not yet integrated -L and -A options
  to pmap(1), the pmap(1), and pmadvise(1) man pages for the
  specifications.

ISSUES
======
  Overall, the biggest issue for the tools is virtualization (eg. Xen,
  sun4v hypervisor aka LDOMs, etc.).  Virtualization can make it
  impossible or hard to determine which hardware resources are near
  each other in a NUMA machine.  It can change virtual hardware
  resources out from under a guest OS after the guest OS *thinks* that
  it knows how the hardware resources relate to each other.

  Currently, there is no lgroup platform support for with both Xen and
  sun4v hypervisor (LDOMs), so only one lgroup containing all the CPU
  and memory resources is created.  Consequently, the lgroup tools and
  liblgrp(3LIB) APIs will only export a single lgroup to applications
  and users which basically makes it appear as though the machine has
  Uniform Memory Access (UMA) instead of being NUMA.  This keeps
  virtualization from confusing anything or anyone trying to
  understand or optimize for NUMA using lgroups.

  In the future, we anticipate that the guest OS will need to become
  virtualization aware and/or the virtualization will need to become
  NUMA aware. Some cooperation between the guest OS and hypervisor
  will probably need to occur to be able to provide very good
  performance on NUMA machines.  When this happens, we may need to
  revisit how virtualization affects lgroups, its APIs, and tools, but
  it should be possible to export a reasonable lgroup abstraction or
  fallback to exporting a single lgroup as is done now.

CONCLUSION
==========
  The above text explained the architecture of the observability and
  control tools needed for MPO and refers to additional documentation
  for the individual tools as needed.  All of the proposed tools and
  changes to existing tools which export lgroups have a stability
  level of Unstable and a release binding of Patch/Micro.  This seems
  like the safe thing to do given some of the issues and that
  virtualization needs to be developed more to fully understand its
  ramifications on lgroups and NUMA.