1. Introduction

   1.1. Project/Component Working Name: 

		Fast Crash Dump

   1.2. Name of Document Author/Supplier:

		Dave Plauger
		Steve Sistare

   1.3. Date of This Document:

		August 11, 2009
	
	1.3.1. Date this project was conceived:

		August 2008

   1.4. Name of Major Document Customer(s)/Consumer(s):

	1.4.1. The PAC or CPT you expect to review your project:

		Solaris PAC

	1.4.2. The ARC(s) you expect to review your project:

		PSARC

	1.4.3. The Director/VP who is "Sponsoring" this project:

		Mike Sanfratello

	1.4.4. The name of your business unit:

		Solaris Platform Software/CSW

   1.5. Email Aliases:

    	1.5.1. Responsible Manager:

		Scott.Michael@Sun.COM

    	1.5.2. Responsible Engineer:

		Dave.Plauger@Sun.COM

    	1.5.3. Marketing Manager:

		Dan Roberts

	1.5.4. Interest List:

		Scalability team (scalability-proj-interest@Sun.COM)


2. Project Summary

   2.1. Project Description:


	Newer systems have ever larger memory sizes. Some systems are
	now 1TB and larger. One consequence is that it takes a much
	longer time to save crash images, and it takes a greater
	amount of disk space in order to record them. This affects
	availability: should a panic occur, it takes a longer time to
	save a memory dump before the system restarts. It also affects
	serviceability, for two reasons: 1) the larger core files
	require more reserved disk space, and 2) it takes a longer
	time to transfer core files through the network for off site
	analysis. The second is more important, because most core file
	analysis usually takes place off site, especially for Sun
	customers.

	This project approaches the problem in three dimensions:
	First, the time factor is reduced by introducing
	parallelism. In the existing implementation, during a crash,
	all other CPUs except for the one processing the panic are
	idled. This means there is a huge resource available in
	untapped processing power. These otherwise idle processors
	will be employed during the memory compression phase of a core
	dump. Second, the time to save a crash dump is limited by the
	I/O rate. This I/O time is reduced by increasing the
	compression factor. Less bytes to record means less time to
	save them. Third, crash dumps will be recorded in a compressed
	format. Compressed dumps require less disk space, and they can
	be transmitted through the network in shorter time.

	There is always a price: the better compression algorithms
	also require much more compute time, and they also require
	much more memory buffering. These new requirements are
	balanced by taking advantage of the idle processors. Also,
	there is much memory that is usually not dumped during a
	crash. This memory is available for additional buffering.

	This project scales from the smallest to the largest
	systems. When much memory and many processors are available,
	this project will take advantage. On smaller systems the
	minimum parallelism will be used.


   2.2. Risks and Assumptions:

	Greater compression requires much more CPU time and memory
	buffering. There is a risk of longer dump times for some
	cases. These regressions, if they occur, should be a few
	minutes, not longer.

	This risk will be minimized by a-priori testing of compression
	under a range of conditions, and using those results at crash
	time to select the algorithm that will perform best.


3. Business Summary

   3.1. Problem Area:
	
	Availability and serviceability of large memory systems after
	a crash.

   3.2. Market/Requester:

	Data center and mid-range servers.

   3.3. Business Justification:

	This project improves both serviceability and availability for
	all Sun platforms (Sparc and x86). The time to take a crash
	dump should speed up by 2 to 10 times, depending on the
	platform. The amount of disk space required in order to save
	crash dumps will be reduced by the same factors.

	More importantly, compressed crash dumps can be moved over the
	net more quickly, and thereby help Sun personnel better
	service its customers. Also, saving compressed crash dumps
	instead of full dumps requires less reserved disk space.

   3.4. Competitive Analysis:

	Not available.

   3.5. Opportunity Window/Exposure:

	PTL#6696: snv_125 and S10U9_02

	This enhancement is desired for an S10 update because large
	system customers are more conservative and will stay on S10
	for years. S10U9 may be the last S10 update, so that is the
	target.

   3.6. How will you know when you are done?:

	Reboot timings show >2X improvement.

4. Technical Description:

    4.1. Details:

	New command line flags to dumpadm(1M) control how core files
	are to be saved. The setting is saved across reboots in
	/etc/dumpadm.conf. The new default behavior is to save files
	in compressed format instead of always uncompressing them, as
	it does currently.

	If saving compressed, savecore(1M) copies the core file from
	the dump device to vmdump.N, where N is the usual dump
	integer. Copying a core file is much faster than uncompressing
	it into unix.N and vmcore.N images; and it takes up much less
	disk space. On systems that dump to the swap area there is
	less risk that the core image will be over-written by swap
	activity before it can be extracted.

	savecore(1M) performance has been improved by reading the dump
	file with fread(3C) instead of pread(2).

	The compressed core dump format is largely unchanged. The dump
	header and the dump version number are unchanged. However,
	there are changes to the way memory pages are saved within the
	compressed file. A compressed dump file must be uncompressed
	first, by manually running savecore(1M) a second time, before
	it can be used by other tools. Therefore, there is no impact
	on mdb(1) due to the changes in compression methods. Once the
	core dump has been uncompressed, the resulting unix.N and
	vmcore.N files are in the same format as before.

	This project has introduced a bzip2 compression library [2]
	into Solaris common code, where it is shared by savecore(1M)
	and the kernel. The current lzjb compression algorithm is much
	faster, but also much weaker. The bzip2 library requires much
	more memory and compute resources. If these resources are
	available during panic, the kernel will save memory pages with
	bzip2 instead of lzjb, and savecore(1M) will use the same
	bzip2 library in order to uncompress the pages.

	The kernel function dumpsys() does most of the work in
	creating core dump images. The section that saves memory pages
	has been expanded to support parallelism. Most kernel services
	are not available during panic. Instead, CPUs spinning in
	panic_idle() call into dumpsys() and coordinate via
	memory. These helper CPUs copy pages, compress them, and
	produce streams of compression data that savecore(1M) can
	uncompress. The panic CPU acts as the master and does all page
	mapping and I/O operations.

	There are two compression modes in this implementation. The
	older method, lzjb is the default on smaller systems. With up
	to 4 CPUs, it usually speeds up dump by 2-4 times. The new
	bzip2 library is employed on large systems with many spare
	CPUs and memory. This can speed up dumps by 4-10 times. The
	mode is chosen at crash time based on processor type, number
	of CPUs, and available free memory for buffers.

	The existing savecore -L (live dump) option creates a dump
	image on a running system. This option is available only when
	there is a dedicated dump device. In this case, the dump
	helpers in the kernel run as system tasks.

	file(1) can detect the new compressed format. For example,
	  # file vmcore.0
	  vmcore.0: SunOS 5.11 snv_81 64-bit SPARC crash dump from 'oaf415'
	  # file vmdump.0
	  vmdump.0: SunOS 5.11 snv_81 64-bit SPARC compressed crash dump from 'oaf415'

	For more information, see the project page [1], and the RFE [3].


    4.2. Bug/RFE Number(s):

	RFE 6828976 Fast Crash Dump
    
    4.3. In Scope:

	The scope of this project is limited to the amount of time it
	takes in order to produce a crash dump, and to the amount of
	disk space needed in order to preserve crash dumps.

	Also, there are new man pages for dumpadm(1M) and savecore(1M).

    4.4. Out of Scope:

	Not applicable.

    4.5. Interfaces:

	New command line flags to dumpadm(1M). And additions to the
	meaning of existing flags to savecore(1M).

	Micro/patch binding requested.

	INTERFACE           COMMITMENT LEVEL        COMMENT
	
	dumpadm -z (1M)	Committed	Enables save compressed, or not.

   	 
    4.6. Doc Impact:
    
	New man pages for dumpadm(1M) and savecore(1M).

	See attached dumpadm.1m.diffmk.txt and savecore.1m.diffmk.txt.


    4.7. Admin/Config Impact: 

	New options for crash dump.

    4.8. HA Impact:

	None.

    4.9. I18N/L10N Impact:

	None.
    
    4.10. Packaging & Delivery:

	Small modification of existing kernel packages.
    
    4.11. Security Impact:

	None.
    
    4.12. Dependencies:

	None.


5. Reference Documents:

[1] Project Twiki Page
    http://agares.central.sun.com/twiki/bin/view/Scalability/FastCrashDump

[2] bzip2 and libbzip2
    http://www.bzip.org/

[3] 6828976 Fast Crash Dump
    http://monaco.sfbay/detail.jsf?cr=6828976

6. Resources and Schedule:

   6.1. Projected Availability:
	
	snv_125 and S10U9_02.

   6.2. Cost of Effort:

	This project is estimated to require the following: 1 software
	engineer 8 months development and testing.

   6.3. Cost of Capital Resources:

	Will use existing shared lab machines.

   6.4. Product Approval Committee requested information:

   	6.4.1. Consolidation or Component Name:

		OS/Net

	6.4.3. Type of CPT Review and Approval expected:

		N/A

        6.4.4. Project Boundary Conditions:

		No critical bugs.
		No regression on existing platforms.		

	6.4.5. Is this a necessary project for OEM agreements:

		No.

	6.4.6. Notes:

		None.

	6.4.7. Target RTI Date/Release:

		Putback to snv_125 and S10U9_02.

	6.4.8. Target Code Design Review Date:

		The target date for the code review is TBD.

	6.4.9. Update approval addition:

		None.

   6.5. ARC review type:

	PSARC

   6.6. ARC Exposure:

	Open.

7. Prototype Availability:

   7.1. Prototype Availability:

	Available now.

   7.2. Prototype Cost:

	4 engineer-months development time.