SUMMARY: This project enhances zones[1], pools[2-4] and resource caps[5,6] to improve the integration of zones with resource management (RM). It addresses existing RFEs[7-12] in this area and lays the groundwork for simplified, coherent management of the various RM features exposed through zones. These enhancements are targeted at "less experienced" users who are unfamiliar with all of the details and capabilities of Solaris RM. For users who need more complex RM configurations, the existing commands and procedures must still be used. However, we do feel that these enhancements will meet the needs of a large percentage of zones users. We will integrate some basic pool configuration with zones, implement the concept of "temporary pools" that are dynamically created/destroyed when a zone boots/halts and we will simplify the setting of resource controls within zonecfg. We will enhance rcapd so that it can cap a zone's memory while rcapd is running in the global zone. We will enable persistent RM configuration for the global zone. We will also make a few other changes to provide a better overall experience when using zones with RM. Patch binding is requested for these new interfaces and the stability of most of these interfaces is "Committed" (see interface table for complete list). PROBLEM: Although zones are fairly easy to configure and install, it appears that many users have difficulty setting up a good RM configuration to accompany their zone configuration. Understanding RM involves many new terms and concepts along with lots of documentation to understand. This leads to the problem that many customers either do not configure RM with their zones, or configure it incorrectly, leading them to be disappointed when zones, by themselves, do not provide all of the containment that they expect. This problem will just get worse in the near future with the additional RM features that are coming, such as cpu-caps[14], memory sets[15] and swap sets[16]. PROPOSAL: There are a set of relatively independent enhancements outlined below which will, taken together, address these problems and provide a simple, tightly integrated experience for configuring containers (zones with RM). This proposal is complicated by the fact that it tries to unify the RM concepts within zones, but not all of these RM features are available yet. Specifically, cpu-caps[14], memory sets[15], swap sets[16], and new rctls [17, 18] are under development but not yet integrated. This proposal will describe a framework for how current and future RM features will be integrated with zones. However, we do not want to make this project dependent on these future projects since there is a lot of value to be added in the meantime. Instead, the future enhancements described here will be dependent upon the integration of the underlying RM features. Once those have integrated, we will submit fast-tracks to update the zones/RM integration as described below. Items related to these future enhancements are noted within the body of the proposal. The future enhancements are also noted in the interface table for follow-on phases of this overall project. 1) "Hard" vs. "Soft" RM configuration within zonecfg We will enhance zonecfg(1M) so that the user can configure basic RM capabilities in a structured way. Various existing and upcoming RM features can be broken down into "hard" vs. "soft" partitioning of the system's resources. With "hard" partitioning, resources are dedicated to the zone using processor sets (psets) and memory sets (msets). With "soft" partitioning, resources are shared, but capped, with an upper limit on their use by the zone. Technologies used to provide partitioning Hard | Soft --------------------------------- cpu | psets | cpu-caps memory | msets | rcapd Within zonecfg we will organize these various RM features into four basic zonecfg resources so that it is simple for a user to understand and configure the RM features that are to be used with their zone. Note that zonecfg "resources" are not the same as the system's cpu & memory resources or "resource management". Within zonecfg, a "resource" is just the name of a top-level property group for the zone (see zonecfg(1M) for more information). The four new zonecfg resources are: dedicated-cpu capped-cpu (future, after cpu-caps are integrated) dedicated-memory (future, after memory sets are integrated) capped-memory Each of these zonecfg resources will have properties that are appropriate to the RM capabilities associated with that resource. Zonecfg will only allow one instance of each these resource to be configured and it will not allow conflicting resources to be added (e.g. dedicated-cpu and capped-cpu are mutually exclusive). The mapping of these new zonecfg resources to the underlying RM feature is: dedicated-cpu -> temporary pset dedicated-memory -> temporary mset capped-cpu -> cpu-cap rctl [14] capped-memory -> rcapd running in global zone Temporary psets and msets are described below, in section 2. Rcapd enhancements for running in the global zone are described below in section 4. The valid properties for each of these new zonecfg resources will be: dedicated-cpu ncpus importance capped-cpu ncpus dedicated-memory physical virtual importance capped-memory physical virtual The meaning of each of these properties is as follows: dedicated-cpu ncpus: This can be a positive integer or range. A value of '2' means two cpus, a value of '2-4' means a range of two to four cpus. This sets the 'pset.min' and 'pset.max' properties on the temporary pset. importance: This property is optional. It can be a positive integer. It sets the 'pool.importance' property on the temporary pool. capped-cpu This resource group and its property will not be delivered as part of this project since cpu-caps are still under development. However, our thinking on this is described here for completeness. ncpus: This is a positive decimal with two digits to the right of the decimal. The 'ncpus' property actually maps to the zone.cpu-cap rctl. This property will be implemented as a special case of the new zones rctl aliases which are described below in section 3. The special case handling of this property will normalize the value so that it corresponds to units of cpus and is similar to the 'ncpus' property under the dedicated-cpu resource group. However, it won't accept a range and it will accept a decimal number. For example, when using 'ncpus' in the dedicated-cpu resource group, a value of 1 means one dedicated cpu. When using 'ncpus' in the capped-cpu resource group, a value of 1 means 100% of a cpu is the cap setting. A value of 1.25 means 125%, since 100% corresponds to one full cpu on the system when using cpu caps. The idea here is to align the 'ncpus' units as closely as possible in these two cases (dedicated-cpu vs. capped-cpu), given the limitations and capabilities of the two underlying mechanisms (pset vs. rctl). The 'ncpus' rctl alias is described further in section 3 below. dedicated-memory This resource group and its properties are tentative at this point since msets are still under development. The properties will be finalized once msets [15] and swap sets [16] are completed. This resource group and its properties will not be delivered as part of this project. However, our thinking on this is described here for completeness. physical: A positive decimal number or a range with a required k, m, g, or t modifier. This will set the 'mset.min' and 'mset.max' properties on the temporary mset. A value of '10m' means ten megabytes. A value of '.5g-1.5g' means a range of 500 megabytes up to 1.5 gigabytes. virtual: This accepts the same numbers as the 'physical' property. This will set the 'mset.minswap' and 'mset.maxswap' properties on the temporary mset. The name 'virtual' is not necessarily the final name. One or the other of 'physical' and 'virtual' is optional but at least one must be specified. importance: This property is optional. It can be a positive integer. It sets the 'pool.importance' property on the temporary pool. The underlying code in zonecfg will refer to the same piece of data for importance in both the dedicated-cpu and dedicated-memory case. Thus, you can have a temporary pool with either a temporary pset, a temporary mset or both. There is only one value for the importance of the temporary pool. capped-memory physical: A positive decimal number with a required k, m, g, or t modifier. A value of '10m' means ten megabytes. This will be used by rcapd as the max-rss for the zone. The rcapd enhancement for capping zones is described below in section 4. virtual: This property is tentative at this point and will not be delivered as part of this project. However, our thinking on this is described here for completeness. In the future we would like to deliver a new rctl which would cap the virtual memory consumption of the zone. The name 'virtual' is not necessarily the final name. Zonecfg will be enhanced to check for invalid combinations. This means it will disallow a dedicated-cpu resource and the zone.cpu-shares rctl being defined at the same time. It also means that explicitly specifying a pool name via the 'pool' resource, along with either a 'dedicated-cpu' or 'dedicated-memory' resource is an invalid combination. These new zonecfg resource names (dedicated-cpu, capped-cpu, dedicated-memory & capped-memory) are chosen so as to be reasonably clear what the objective is, even though they do not exactly align with our existing underlying (and inconsistent) RM naming schemes. 2) Temporary Pools. We will implement the concept of "temporary pools" within the pools framework. To improve the integration of zones and pools we are allowing the configuration of some basic pool attributes within zonecfg, as described above in section 1. However, we do not want to extend zonecfg to completely and directly manage standard pool configurations. That would lead to confusion and inconsistency regarding which tool to use and where configuration data is stored. Temporary pools sidesteps this problem and allows zones to dynamically create a simple pool/pset configuration for the basic case where a sysadmin just wants a specified number of processors dedicated to the zone (and eventually a dedicated amount of memory). We believe that the ability to simply specify a fixed number of cpus (and eventually a mset size) meets the needs of a large percentage of zones users who need "hard" partitioning (e.g. to meet licensing restrictions). If a dedicated-cpu (and/or eventually a dedicated-memory) resource is configured for the zone, as described in section 1, then when the zone boots zoneadmd will enable pools if necessary and create a temporary pool dedicated for the zones use. Zoneadmd will dynamically create a pool & pset (and/or eventually a mset) and assign the number of cpus specified in zonecfg to that pset. The temporary pool & pset will be named 'SUNWtmp_{zonename}'. Zonecfg validation will disallow an explicit 'pool' property name beginning with 'SUNWtmp'. Zoneadmd will set the 'pset.min' and 'pset.max' pset properties, as well as the 'pool.importance' pool property, based on the values specified for dedicated-cpu's 'ncpus' and 'importance' properties in zonecfg, as described above in section 1. If the cpu (or memory) resources needed to create the temporary pool are unavailable, zoneadmd will issue an error and the zone won't boot. When the zone is halted, the temporary pool & pset will be destroyed. We will add a new boolean libpool(3LIB) property ('temporary') that can exist on pools and any pool resource set. The 'temporary' property indicates that the pool or resource set should never be committed to a static configuration (e.g. pooladm -s) and that it should never be destroyed when updating the dynamic configuration from a static configuration (e.g. pooladm -c). These temporary pools/resources can only be managed in the dynamic configuration (the existing commands already support that). Support for temporary pools will be implemented within libpool(3LIB) using the two new consolidation private functions listed in the interface table below. It is our expectation that most users will never need to manage temporary pools through the existing poolcfg(1M) commands. For users who need more sophisticated pool configuration and management, the existing 'pool' resource within zonecfg should be used and users should manually create a permanent pool using the existing mechanisms. 3) Resource controls in zonecfg will be simplified [8]. Within zonecfg rctls take a 3-tuple value where only a single component is usually of interest (the 'limit'). The other two components of the value (the 'priv' and 'action') are not normally changed but users can be confused if they don't understand what the other components mean or what values should be specified. Here is a zonecfg example: zonecfg:myzone> add rctl zonecfg:myzone:rctl> set name=zone.cpu-shares zonecfg:myzone:rctl> add value (priv=privileged,limit=5,action=none) zonecfg:myzone:rctl> end Within zonecfg we will introduce the idea of rctl aliases. The alias is a simplified name and template for the existing rctls. Behind the scenes we continue to store the data using the existing rctl entries in the XML file. Thus, the alias always refers to the same underlying piece of data as the full rctl. The purpose of the rctl alias is to provide a simplified name and mechanism to set the rctl 'limit'. For each rctl/alias pair we will "know" the expected values for the 'priv' and 'action' components of the rctl value. If an rctl is already defined that does not match this "knowledge" (e.g. it has a non-standard 'action' or there are multiple values defined for the rctl), then the user will not be allowed to use an alias for that rctl. Here are the aliases we will define for the rctls: alias rctl priv action ----- ---- ---- ------ max-lwps zone.max-lwps privileged deny cpu-shares zone.cpu-shares privileged none Here are the aliases coming in the near future, once the associated projects integrate [14, 17, 18] alias rctl priv action ----- ---- ---- ------ cpu-cap zone.cpu-cap privileged deny max-locked-memory zone.max-locked-memory privileged deny max-shm-memory zone.max-shm-memory privileged deny max-shm-ids zone.max-shm-ids privileged deny max-msg-ids zone.max-msg-ids privileged deny max-sem-ids zone.max-sem-ids privileged deny Here is an example of the max-lwps alias usage within zonecfg: zonecfg:myzone> set max-lwps=500 zonecfg:myzone> info ... [max-lwps: 500] ... rctl: name: zone.max-lwps value: (priv=privileged,limit=500,action=deny) In the example, you can see the use of the alias when setting the value and you can also see the full rctl output within the 'info' command. The alias is "flagged" in the output with brackets as a visual indicator that the property corresponds to the full rctl definition printed later in the output. If you update the rctl value through the 'rctl' resource then the corresponding value in the aliased property would also be updated since both the rctl and its alias refer to the same piece of data. If an rctl was already defined that did not match the expected value (e.g. it had 'action=none' or multiple values), then the alias will be disabled. An attempt to set the limit via the alias would print the following error: "An incompatible rctl already exists for this property" This rctl alias enhancement is fully backward compatible with the existing rctl syntax. That is, zonecfg output will continue to display rctl settings in the current format (in addition to the new aliased format) and zonecfg will continue to accept the existing input syntax for setting rctls. This ensures full backward compatibility for any existing tools/scripts that parse zonecfg output or configure zones. Also, the rctl data will continue to be printed in the output from the 'export' subcommand using the existing syntax. Future rctls added to zonecfg will also provide aliases following the pattern described here (e.g. [17, 18]). In section 1 we described the special case 'ncpus' rctl alias as a property under the capped-cpu resource group. This property is really just another rctl alias for the zone.cpu-cap rctl, with one exception; the limit value is scaled up by 100 so that the value can be specified in cpu units and aligned with the 'ncpus' property under the dedicated-cpu resource group. Thus, a value of 2 will really set the zone.cpu-cap rctl limit to 200, which means the cpu cap is 200%. This alias is being described here but will not actually be delivered in the first phase of this project since cpu-caps [14] are not yet completed. As part of this rctl syntax simplification we also need to simplify the syntax for clearing the value of an rctl. In fact, this is actually a general problem in zonecfg [12]. The 'remove' syntax in zonecfg is currently defined as: Global Scope remove resource-type property-name=property-value [,...] or Resource Scope remove property-name property-value That is, from the top-level in zonecfg, there is currently no way to clear a simple, top-level property and, to clear a resource, it must be qualified with one or more property name/value pairs. To address this problem, we will add a new 'clear' command so that you can clear a top-level property. For example, 'clear pool' will clear the value for the pool property. You could clear a 'max-lwps' rctl alias using 'clear max-lwps'. We will also eliminate the requirement to qualify resources on the 'remove' command. So, instead of saying 'remove net physical=bge0', you could just say 'remove net'. If there is only a single 'net' resource defined, it will be removed. If there are multiple 'net' resources, you will be prompted to confirm that all of them should be removed: Are you sure you want to remove ALL 'net' resources (y/[n])? We will add a '-F' option to the 'remove' command so that you can force the removal of resources when running on the CLI. For example. '# zonecfg -z foo remove -F net'. The existing syntax is still fully supported so you can continue to qualify removal of a single instance of a resource. 4) Enable rcapd to limit zone memory while running in the global zone [9] Currently, to use rcapd(1M) to limit zone memory consumption, the rcapd process must be run within the zone. While useful in some configurations, in situations where the zone administrator is untrusted, this is ineffective, since the zone administrator could simply change the rcapd limit. We will enhance rcapd so that it can limit each zone's memory consumption while it is running in the global zone. This closes the rcapd loophole described above and allows the global zone administrator to set memory caps that can be enforced by a single, trusted process. This enhancement will also enable physical memory capping for branded zones [19] that are running some other OS user-land (e.g. Linux) where rcapd cannot run. The rcapd limit for a zone will be configured using the new 'capped-memory' resource and 'physical' property within zonecfg, as described in section 1. When a zone with 'capped-memory' boots, zoneadmd will automatically enable the rcap service in the global zone, if necessary. Capping of the zone's physical memory consumption (rss) will be enforced by rcapd. In the future, we also plan on adding a virtual memory cap which would be implemented as a new resource control in the kernel. This would be specified using the 'virtual' property within the capped-memory resource group, as described above in section 1. However, that will be a future enhancement so it is not described here. As part of this overall project, we will be enhancing the internal rcapd rss accounting so that rcapd will have a more accurate measurement of the overall rss for each zone, particularly when accounting for shared-memory pages. This will address the primary issue in bug [13]. We will add a new '-R' option to rcapadm(1M) which will cause it to refresh the in-kernel max-rss settings for all of the running zones. We will also add two new options, -p & -z, to rcapstat(1), to specify if it should show zones or projects: % rcapstat [ -z | -p ] For compatibility, -p is the default if neither -p or -z are specified. The -p option corresponds to the current behavior of reporting on capped projects. The -z option will report information on capped zones. 5) Use FSS when zone.cpu-shares is set [8]. Although the zone.cpu-shares rctl can be set on a zone, the Fair Share Scheduler (FSS) is not the default scheduling class so this rctl has no effect unless the user also sets FSS as the default scheduling class or changes the zones processes to use FSS with the priocntl(1M) command. This means that users can easily think they have configured their zone for a behavior that they are not actually getting. We will enhance zoneadmd so that if the zone.cpu-shares rctl is set and FSS is not already the scheduling class for the zone, zoneadmd will set the scheduling class to be FSS for processes in the zone. We will also print a warning if FSS is not the default scheduling class so that the sysadmin will know that the full FSS behavior is not configured. 6) Add a scheduling class property for zones. As described in section 5 above, when the zone.cpu-shares rctl is set, zoneadmd will now configure FSS as the scheduling class for the zone. However, if FSS is not the default scheduling class and the zone is not using the zone.cpu-shares rctl, then the zone administrator will not be able to set up projects within the zone so that the project.cpu-shares rctl takes effect. This is because the non-global zone administrator does not have the privileges needed to set the scheduling class for the zone. We will add an optional 'scheduling-class' property to zonecfg. When the zone boots, if this property is set, zoneadmd will set the specified scheduling class for processes in the zone. If the zone.cpu-shares rctl is set and the scheduling-class property is set to something other than FSS, a warning will be issued. 7) Enable zones to manage RM for the Global Zone There are several issues with RM in the global zone. Most RM features cannot be persistently configured for the global zone as a whole. Instead, RM is usually configured on a per-project basis in the global zone: project rctls (global zone-wide rctls don't make sense here) project memory cap project pool It is possible to configure a global zone rctl but it is not persistent across reboots [11] and the syntax is complex. For example, # prctl -s -n zone.max-lwps -t priv -v 1000 -i zoneid global Although the global zone defaults to 1 cpu-share when using the FSS scheduling class, and you can dynamically change the global zone's cpu-share using prctl, there is no way for this to be persistently specified [11]. As a result, various ad-hoc solutions have been developed by our users. There is currently no way to cap global zone memory consumption as a whole. There is currently no way to persistently specify a pool for the global zone. The poolbind command can be used to bind the global zone to a pool but that does not persist across reboots. Overall, we have the same issues with RM complexity in the global zone as we have with non-global zones. To address this, we will enable zonecfg to be used to configure the RM settings for the global zone. Currently, running 'zonecfg -z global' is an error. We will enhance zonecfg so that it is legal to use it on the global zone but only the RM related subset of the zonecfg properties/resources will be visible and allowed: pool rctls new RM-related resources (includes temporary pools & rcapd setting) The 'info' subcommand will only display this subset of the properties/resources, along with the zonename ('global'). The global zone does not boot like a non-global zone and there is no zoneadmd managing the global zone. Instead, we will use SMF to apply the global zone RM settings. None of the existing SMF services are a fit for applying all of the global zone RM settings. We will add a new SMF service (svc:/system/resource-mgmt) which will apply the global zone RM configuration when this service starts. This service will have the following dependencies: require_all/none svc:/system/filesystem/minimal optional_all/none svc:/system/scheduler optional_all/none svc:/system/pools and this dependent: optional_all/none svc:/system/rcap This service will set the global zones pool, rctls, any tmp pool configuration and the rcapd setting. If a tmp pool is configured, but cannot be created, then a warning will be issued but the global zone will boot (as opposed to the non-global zones, which will not boot if their tmp pool cannot be created when the zone attempts to boot). In this case, the system/resource-mgmt service will go into maintenance so that the sysadmin can more easily start to root cause the problem. There are several tricky interactions with updating the global zone RM settings while there are running projects. This service will not provide a 'refresh' method in the first phase. Instead, the sysadmin would have to reboot to apply the settings or they would have to manually update the global zone RM settings using the existing commands (poolcfg, poolbind, prctl & rcapadm). A 'refresh' method is a future enhancement under consideration. 8) Add RM templates for zone creation Zonecfg already supports templates on the 'create' subcommand using the '-t' option. We will update the documentation which currently states that a template must be the name of an existing zone. We already deliver two existing templates (SUNWblank and SUNWdefault). Providing zone configuration templates with some basic RM settings will make it even easier to setup a good zone/RM combination. We will eventually deliver at least four new templates that configure reasonable default properties for the four primary combinations of the new resources in zonecfg: dedicated-cpu & dedicated-memory dedicated-cpu & capped-memory capped-cpu & dedicated-memory capped-cpu & capped-memory We may also deliver other templates that only pre-configure one of the new resources (e.g. only configures dedicated-cpu and leaves memory with the default handling). We will enhance the 'create' help command to briefly describe the templates and why you would use one vs. another. The names of all new templates will begin with SUNW. This namespace was already reserved by [1]. As we add templates we will file "closed approved automatic" fast-tracks to register their names. This change will primarily impact the documentation. 9) Pools system objective defaults to weighted-load (wt-load)[4] Currently pools are delivered with no system objective set. This usually means that if you enable the poold(1M) service, nothing will actually happen on your system. As part of this project, we will set weighted load (system.poold.objectives=wt-load) to be the default objective. Delivering this objective as the default does not impact systems out of the box since poold is disabled by default. EXPORTED INTERFACES [The proposed interfaces for future phases of this project are listed separately after this interface table. Those are the interfaces that are mentioned above as futures but that won't be part of the first phase of this project.] New zonecfg resource & property names dedicated-cpu Committed ncpus Committed importance Committed capped-memory Committed physical Committed scheduling-class Committed New zonecfg rctl alias names max-lwps Committed cpu-shares Committed Future rctl aliases will follow this pattern with the zone.{name} rctl name being shortened to {name}. zonecfg subcommand changes clear Committed remove behavior Committed remove -F Committed Temporary pool & resource names SUNWtmp_{zonename} Committed New libpool(3LIB) pool & resource boolean properties 'pool.temporary' Committed 'pset.temporary' Committed New libpool(3LIB) functions pool_set_temporary Consolidation private pool_rename_temporary Consolidation private Ability to use rcapd to cap zones physical memory Committed New rcapadm -R option Committed New rcapstat -p & -z options Committed Use of zonecfg to configure global zone RM properties Committed New service svc:/system/resource-mgmt Committed wt-load as default Committed IMPORTED INTERFACES libpool(3LIB) Unstable PSARC 2000/136 & libpool(3LIB) PLANNED FUTURE EXPORTED INTERFACES (informational only at this time) New zonecfg resource & property names capped-cpu Committed ncpus Committed capped-memory virtual Committed dedicated-memory Committed physical Committed virtual Committed importance Committed The capped-cpu and dedicated-memory resource names are in anticipation of the future integration of the cpu-caps[14] and memory sets[15] projects. The name 'virtual' is not necessarily the final property name we will use. New libpool(3LIB) resource boolean property 'mset.temporary' Committed svc:/system/resource-mgmt refresh method Committed REFERENCES 1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris 2. PSARC 2000/136 Administrative support for processor sets and extensions 3. PSARC 1999/119 Tasks, Sessions, Projects and Accounting 4. PSARC 2002/287 Dynamic Resource Pools 5. PSARC 2002/519 rcapd(1MSRM): resource capping daemon 6. PSARC 2003/155 rcapd(1M) sedimentation 7. 6421202 RFE: simplify and improve zones/pool integration http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421202 8. 6222025 RFE: simplify rctl syntax and improve cpu-shares/FSS interaction http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6222025 9. 5026227 RFE: ability to rcap zones from global zone http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5026227 10. 6409152 RFE: template support for better RM integration http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6409152 11. 4970603 RFE: should be able to persistently specify global zone's cpu shares http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4970603 12. 6442252 zonecfg's "unset" syntax is not documented and confusing http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6442252 13. 4754856 *prstat* prstat -atJTZ should count shared segments only once http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4754856 14. PSARC 2004/402 CPU Caps 15. PSARC 2000/350 Physical Memory Control 16. PSARC 2002/181 Swap Sets 17. PSARC 2004/580 zone/project.max-locked-memory Resource Controls 18. PSARC 2006/451 System V resource controls for Zones 19. PSARC 2005/471 BrandZ: Support for non-native zones