Swap resource control; locked memory RM improvements Steve Lawrence SUMMARY: This case enhances Solaris Zones[1] and builds upon recent work to improve the integration between Zones and Solaris Resource Management[2]. The case addresses an existing RFE[6], which requests a mechanism to limit system swap reserved by a zone. The case also proposes extensions to [2], which will make swap reservation and locked memory resource controls easy to configure on a zone via zonecfg(1m). 1. This case proposes adding the following resource control: INTERFACE COMMITMENT BINDING "zone.max-swap" Committed Patch This control will limit the swap reserved by processes and tmpfs mounts within the global zone and non-global zones. This resource control serves to address the referenced RFE[6]. 2. To simplify the configuration of memory-related resource controls on zones, this case proposes adding the following properties to zonecfg(1M): INTERFACE COMMITMENT BINDING "swap" zonecfg property Committed Patch "locked" zonecfg property Committed Patch These properties will be added to the zonecfg "capped-memory" zonecfg resource introduced by [2]. 3. For observability of zone resource utilization and limits, this case proposes the addition of following kstats: INTERFACE COMMITMENT BINDING caps:{zoneid}:swapresv_zone_{zoneid} Uncommitted Patch caps:{zoneid}:lockedmem_zone_{zoneid} Uncommitted Patch To observe project resource utilization, this case also proposes the following kstat: INTERFACE COMMITMENT BINDING caps:{zoneid}:lockedmem_project_{projid} Uncommitted Patch The projid cannot be used as the instance number, as each zone has a unique project namespace. This means project 0 in the global zone is different from project 0 in each non global zone. The global zone will see kstats for all zones, while non global zones will only see kstats with matching zoneid. 4. prstat(1m) output changes to report swap reserved. INTERFACE COMMITMENT BINDING prstat(1m) output Uncommitted Patch This case proposes changing the "SIZE" column of "prstat -Z" zone output lines to "SWAP". The swap reported will be the total swap consumed by the zone's processes and tmpfs mounts. This value will assist administrators in monitoring the swap reserved by each zone, allowing them to choose a reasonable "zone.max-swap" settings. DETAIL: 1. "zone.max-swap" resource control. Limits swap consumed by user process address space mappings and tmpfs mounts within a zone. Currently a global or non-global zone can consume all swap resources available on the system, limiting the usefulness of zones as an application container. zone.max-swap provides a mechanism to limit swap consumption per zone. This will protect other zones from runaway memory leakers/consumers and/or tmpfs writers in a zone with zone.max-swap configured. Another solution to this problem would be a "swap set" [5] feature, which would allow the reservation of swap devices into sets to which zones could be bound. While "swap sets" would be useful, zone.max-swap provides a simple solution which is easier to administer, as it does not require the configuration of pools and swap devices/files. "zone.max-swap" is not incompatable with swap sets. In fact, a future addition of swap sets could be used in combination with zone.max-swap. For instance, several zones could be bound to the same set of swap devices, each with it's own individual zone.max-swap configured as a cap within that set. The implementation of "zone.max-swap" is also much less risky to make available via patch. zone.max-swap will be configurable on both the global zone, and non-global zones. The affect on processes in a zone reaching its zone.max-swap limit is the same as if all system swap is reserved. Callers of mmap(2) and sbrk(2) will receive EAGAIN. Writes to tmpfs will return ENOSPC, which is the same errno returned when a tmpfs mount reaches it's "size" mount option. The "size" mount option limits the quantity of swap that a tmpfs mount can reserve. While a low zone.max-swap setting for the global zone can lead to a difficult-to-administer global zone, the same problem exists today when configuring the zone.max-lwps resource control on the global zone, or when all system swap is reserved. The zonecfg(1m) enhancements detailed below will help administrators configure zone.max-swap safely. 2. "swap" and "locked" properties for zonecfg(1m) "capped_memory" resource. [2] added a new 'capped-memory' resource to zonecfg. This resource groups the properties used when capping memory for the zone. It currently has the 'physical' property which specifies the physical memory cap for the zone. We will add two new properties, 'swap' and 'locked' to the "capped-memory" resource. These properties will be added by using the rctl alias mechanism which is also described in [2]. swap: An unsigned decimal number with a required k, m, g, or t modifier. A value of '10m' means ten megabytes." This will be used to configure the zone.max-swap resource control, which limits swap consumed by processes and tmpfs mounts within a zone. locked: An unsigned decimal number with a required k, m, g, or t modifier. A value of '10m' means ten megabytes." This will be used to configure the zone.max-locked-memory[3,4] resource control, which limits locked physical memory (made non-pageable) by processes within a zone. Low swap settings for the global zone can impact system availability. Due to this, if zone.max-swap is configured (via zonecfg(1m)) on the global zone, a verbose warning will be printed: zonecfg:global> add capped-memory WARNING: Setting a global zone memory cap too strictly could deny service to even the root user; this could render the sytem impossible to administer. Please use caution. zonecfg:global:capped-memory> Similar warnings will be printed for setting other rctls on the global zone which can affect availability, such as zone.max-lwps. 3. For observability of zone resource utilization and limits, this case proposes the addition of following kstats: INTERFACE COMMITMENT BINDING caps:{zoneid}:swapresv_zone_{zoneid} Uncommitted Patch caps:{zoneid}:lockedmem_zone_{zoneid} Uncommitted Patch To observe project resource utilization, this case also proposes the following kstat: INTERFACE COMMITMENT BINDING caps:{zoneid}:lockedmem_project_{projid} Uncommitted Patch The projid cannot be used as the instance number, as each zone has a unique project namespace. This means project 0 in the global zone is different from project 0 in each non global zone. The global zone will see kstats for all zones, while non global zones will only see kstats with matching zoneid. Each kstat will have the statistics: usage: The current quantity of resource consumed. value: The current enforced cap. zonename: The name of the zone. A zone may change zoneid each time it boots, so this statistic helps to match the kstat to the zone. These kstats can be consumed by higher level tools/scripts to provide information about zone memory usage. Each kstats instance number matches the zoneid of the zone it represents. Non-global zones will only be able to read the kstat with matching zoneid. The global zone will be able to read all kstats. Additional kstats will be added in the future to report usage and cap for other rctls. Addressing existing rctls is outside the scope of this case. 4. prstat(1m) output changes to report swap reserved. INTERFACE COMMITMENT BINDING prstat(1m) output Uncommitted Patch This case proposes changing the "SIZE" column of "prstat -Z" zone output lines to "SWAP". The swap reported will be the total swap consumed by the zone's processes and tmpfs mounts. This value will assist administrators in monitoring the swap reserved by each zone, allowing them to choose a reasonable "zone.max-swap" settings. The "SIZE" column will also be changed to "SWAP" for prstat options a, T, and J, for users, tasks, and projects. The current "SIZE" column arbitrarily sums the address spaces of the processes in each zone. This sum include device mappings, but does not include NORESERVE segments. This sum does not map to real system resources, and therefore provides no meaningful information when summed across all processes belonging to a zone, project, task, or user. For the default prstat process listing, "SIZE" will not be changed to swap, as the virtual address space size for each process is a useful number. Detailed per process memory consumption reporting is outside the scope of this case, and would be better addressed by a case proposing a solution for 6487372[7]: RFE: prstat -x: Providing VSZ/RSS/ANON/LOCK Memory & CPU Usage This RFE requests displaying detailed memory usage per process. "SWAP" reservation certainly falls into this category. REFERENCES: [1] PSARC/2002/174 Virtualization and Namespace Isolation in Solaris http://sac.sfbay.sun.com/PSARC/2002/174 http://www.opensolaris.org/os/community/arc/caselog/2002/174/ [2] PSARC/2006/496 Improved Zones/RM Integration http://sac.sfbay.sun.com/PSARC/2006/496/ http://www.opensolaris.org/os/community/arc/caselog/2006/496/ [3] PSARC/2006/463 Amendment to zone/project.max-locked-memory Resource Controls http://sac.sfbay.sun.com/PSARC/2006/463/ http://www.opensolaris.org/os/community/arc/caselog/2006/463/ [4] PSARC/2004/580 zone/project.max-locked-memory Resource Controls http://sac.sfbay.sun.com/PSARC/2004/580/ http://www.opensolaris.org/os/community/arc/caselog/2004/580/ [5] PSARC/2002/181 Swap Sets http://sac.sfbay.sun.com/PSARC/2002/181/ http://www.opensolaris.org/os/community/arc/caselog/2002/181/ [6] 5103071 RFE: local zones can run the global zone out of swap http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5103071 [7] RFE: prstat -x: Providing VSZ/RSS/ANON/LOCK Memory & CPU Usage http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6487372