1. Introduction 1.1. Project/Component Working Name: Fast Crash Dump 1.2. Name of Document Author/Supplier: Dave Plauger Steve Sistare 1.3. Date of This Document: August 11, 2009 1.3.1. Date this project was conceived: August 2008 1.4. Name of Major Document Customer(s)/Consumer(s): 1.4.1. The PAC or CPT you expect to review your project: Solaris PAC 1.4.2. The ARC(s) you expect to review your project: PSARC 1.4.3. The Director/VP who is "Sponsoring" this project: Mike Sanfratello 1.4.4. The name of your business unit: Solaris Platform Software/CSW 1.5. Email Aliases: 1.5.1. Responsible Manager: Scott.Michael@Sun.COM 1.5.2. Responsible Engineer: Dave.Plauger@Sun.COM 1.5.3. Marketing Manager: Dan Roberts 1.5.4. Interest List: Scalability team (scalability-proj-interest@Sun.COM) 2. Project Summary 2.1. Project Description: Newer systems have ever larger memory sizes. Some systems are now 1TB and larger. One consequence is that it takes a much longer time to save crash images, and it takes a greater amount of disk space in order to record them. This affects availability: should a panic occur, it takes a longer time to save a memory dump before the system restarts. It also affects serviceability, for two reasons: 1) the larger core files require more reserved disk space, and 2) it takes a longer time to transfer core files through the network for off site analysis. The second is more important, because most core file analysis usually takes place off site, especially for Sun customers. This project approaches the problem in three dimensions: First, the time factor is reduced by introducing parallelism. In the existing implementation, during a crash, all other CPUs except for the one processing the panic are idled. This means there is a huge resource available in untapped processing power. These otherwise idle processors will be employed during the memory compression phase of a core dump. Second, the time to save a crash dump is limited by the I/O rate. This I/O time is reduced by increasing the compression factor. Less bytes to record means less time to save them. Third, crash dumps will be recorded in a compressed format. Compressed dumps require less disk space, and they can be transmitted through the network in shorter time. There is always a price: the better compression algorithms also require much more compute time, and they also require much more memory buffering. These new requirements are balanced by taking advantage of the idle processors. Also, there is much memory that is usually not dumped during a crash. This memory is available for additional buffering. This project scales from the smallest to the largest systems. When much memory and many processors are available, this project will take advantage. On smaller systems the minimum parallelism will be used. 2.2. Risks and Assumptions: Greater compression requires much more CPU time and memory buffering. There is a risk of longer dump times for some cases. These regressions, if they occur, should be a few minutes, not longer. This risk will be minimized by a-priori testing of compression under a range of conditions, and using those results at crash time to select the algorithm that will perform best. 3. Business Summary 3.1. Problem Area: Availability and serviceability of large memory systems after a crash. 3.2. Market/Requester: Data center and mid-range servers. 3.3. Business Justification: This project improves both serviceability and availability for all Sun platforms (Sparc and x86). The time to take a crash dump should speed up by 2 to 10 times, depending on the platform. The amount of disk space required in order to save crash dumps will be reduced by the same factors. More importantly, compressed crash dumps can be moved over the net more quickly, and thereby help Sun personnel better service its customers. Also, saving compressed crash dumps instead of full dumps requires less reserved disk space. 3.4. Competitive Analysis: Not available. 3.5. Opportunity Window/Exposure: PTL#6696: snv_125 and S10U9_02 This enhancement is desired for an S10 update because large system customers are more conservative and will stay on S10 for years. S10U9 may be the last S10 update, so that is the target. 3.6. How will you know when you are done?: Reboot timings show >2X improvement. 4. Technical Description: 4.1. Details: New command line flags to dumpadm(1M) control how core files are to be saved. The setting is saved across reboots in /etc/dumpadm.conf. The new default behavior is to save files in compressed format instead of always uncompressing them, as it does currently. If saving compressed, savecore(1M) copies the core file from the dump device to vmdump.N, where N is the usual dump integer. Copying a core file is much faster than uncompressing it into unix.N and vmcore.N images; and it takes up much less disk space. On systems that dump to the swap area there is less risk that the core image will be over-written by swap activity before it can be extracted. savecore(1M) performance has been improved by reading the dump file with fread(3C) instead of pread(2). The compressed core dump format is largely unchanged. The dump header and the dump version number are unchanged. However, there are changes to the way memory pages are saved within the compressed file. A compressed dump file must be uncompressed first, by manually running savecore(1M) a second time, before it can be used by other tools. Therefore, there is no impact on mdb(1) due to the changes in compression methods. Once the core dump has been uncompressed, the resulting unix.N and vmcore.N files are in the same format as before. This project has introduced a bzip2 compression library [2] into Solaris common code, where it is shared by savecore(1M) and the kernel. The current lzjb compression algorithm is much faster, but also much weaker. The bzip2 library requires much more memory and compute resources. If these resources are available during panic, the kernel will save memory pages with bzip2 instead of lzjb, and savecore(1M) will use the same bzip2 library in order to uncompress the pages. The kernel function dumpsys() does most of the work in creating core dump images. The section that saves memory pages has been expanded to support parallelism. Most kernel services are not available during panic. Instead, CPUs spinning in panic_idle() call into dumpsys() and coordinate via memory. These helper CPUs copy pages, compress them, and produce streams of compression data that savecore(1M) can uncompress. The panic CPU acts as the master and does all page mapping and I/O operations. There are two compression modes in this implementation. The older method, lzjb is the default on smaller systems. With up to 4 CPUs, it usually speeds up dump by 2-4 times. The new bzip2 library is employed on large systems with many spare CPUs and memory. This can speed up dumps by 4-10 times. The mode is chosen at crash time based on processor type, number of CPUs, and available free memory for buffers. The existing savecore -L (live dump) option creates a dump image on a running system. This option is available only when there is a dedicated dump device. In this case, the dump helpers in the kernel run as system tasks. file(1) can detect the new compressed format. For example, # file vmcore.0 vmcore.0: SunOS 5.11 snv_81 64-bit SPARC crash dump from 'oaf415' # file vmdump.0 vmdump.0: SunOS 5.11 snv_81 64-bit SPARC compressed crash dump from 'oaf415' For more information, see the project page [1], and the RFE [3]. 4.2. Bug/RFE Number(s): RFE 6828976 Fast Crash Dump 4.3. In Scope: The scope of this project is limited to the amount of time it takes in order to produce a crash dump, and to the amount of disk space needed in order to preserve crash dumps. Also, there are new man pages for dumpadm(1M) and savecore(1M). 4.4. Out of Scope: Not applicable. 4.5. Interfaces: New command line flags to dumpadm(1M). And additions to the meaning of existing flags to savecore(1M). Micro/patch binding requested. INTERFACE COMMITMENT LEVEL COMMENT dumpadm -z (1M) Committed Enables save compressed, or not. 4.6. Doc Impact: New man pages for dumpadm(1M) and savecore(1M). See attached dumpadm.1m.diffmk.txt and savecore.1m.diffmk.txt. 4.7. Admin/Config Impact: New options for crash dump. 4.8. HA Impact: None. 4.9. I18N/L10N Impact: None. 4.10. Packaging & Delivery: Small modification of existing kernel packages. 4.11. Security Impact: None. 4.12. Dependencies: None. 5. Reference Documents: [1] Project Twiki Page http://agares.central.sun.com/twiki/bin/view/Scalability/FastCrashDump [2] bzip2 and libbzip2 http://www.bzip.org/ [3] 6828976 Fast Crash Dump http://monaco.sfbay/detail.jsf?cr=6828976 6. Resources and Schedule: 6.1. Projected Availability: snv_125 and S10U9_02. 6.2. Cost of Effort: This project is estimated to require the following: 1 software engineer 8 months development and testing. 6.3. Cost of Capital Resources: Will use existing shared lab machines. 6.4. Product Approval Committee requested information: 6.4.1. Consolidation or Component Name: OS/Net 6.4.3. Type of CPT Review and Approval expected: N/A 6.4.4. Project Boundary Conditions: No critical bugs. No regression on existing platforms. 6.4.5. Is this a necessary project for OEM agreements: No. 6.4.6. Notes: None. 6.4.7. Target RTI Date/Release: Putback to snv_125 and S10U9_02. 6.4.8. Target Code Design Review Date: The target date for the code review is TBD. 6.4.9. Update approval addition: None. 6.5. ARC review type: PSARC 6.6. ARC Exposure: Open. 7. Prototype Availability: 7.1. Prototype Availability: Available now. 7.2. Prototype Cost: 4 engineer-months development time.