1. Introduction

   1.1. Project/Component Working Name: 

	Fast Crash Dump

   1.2. Name of Document Author/Supplier:

	Dave Plauger
	Steve Sistare

   1.3. Date of This Document:

	August 11, 2009


2. Reference Documents:

[1] Project Twiki Page
    http://agares.central.sun.com/twiki/bin/view/Scalability/FastCrashDump

[2] bzip2 and libbzip2
    http://www.bzip.org/

[3] 6828976 Fast Crash Dump
    http://monaco.sfbay/detail.jsf?cr=6828976

[4] bzip2 Open Source Review
    https://opensourcereview.central.sun.com/app?action=ViewReq&traq_num=11163

[5] comparison of compression ratios
    http://agares.central.sun.com/twiki/pub/Scalability/FastCrashDump/cores.ratios.ods

[6] Fast Crash scalability model		
    http://agares.central/twiki/pub/Scalability/FastCrashDump/crash_model.ods 


3. Overview

	This document describes the design for Fast Crash Dump.

	The goal is to decrease the time it takes to save memory dumps
	during panic, and to reduce the amount of disk space required
	in order to save them. This improves Solaris availability,
	because the system will recover faster after a crash. It also
	improves serviceability, because smaller crash files can be
	moved over the network more quickly for off site analysis, and
	less reserved disk space is needed for saving crash files.

	Crash dump time is dominated by disk write time.  To reduce
	this, the stronger compression method bzip2 is applied to
	reduce the dump size and hence reduce I/O time.  However,
	bzip2 is much more computationally expensive than the existing
	lzjb algorithm, so to avoid increasing compression time, CPUs
	that are otherwise idle during panic are employed to
	parallelize the compression task.  Many helper CPUs are needed
	to prevent bzip2 from being a bottleneck, and on systems with
	too few CPUs, the lzjb algorithm is parallelized instead.
	Lastly, I/O and compression are performed by different CPUs,
	and are hence overlapped in time, unlike the current code.
	These optimizations reduce crash dump time on large machines
	by 3X to 10X depending on processor speed, CPU count, and data
	compression.

	After the system reboots, savecore(1M) runs and the compressed
	crash dump is saved in a compressed format in the file-system.
	Copying a compressed image from the dump device to the file
	system is much faster than uncompressing the image right away.
	A compressed dump image can be copied over the network and
	uncompressed at a later time for analysis. If crash dumps are
	saved in the swap area, then copying them quickly means that
	there is less risk of the dump image being corrupted by paging
	and swapping.

	New command line flags have been added to dumpadm(1M) in order
	to control the new feature. There are additions to
	/etc/dumpadm.conf to save the settings across reboots. Also,
	savecore(1M) can copy the compressed image from the dump
	device to the file-system. At a later time, savecore(1M) can
	be invoked by a user in order to uncompress the image. There
	are no new savecore(1M) flags needed; the existing flags are
	sufficient.

	During panic, all but one of the CPUs is idle. These CPUs are
	used as helpers working in parallel to copy and compress
	memory pages. During a panic, however, these processors cannot
	call any kernel services. This is because mutexes become
	no-ops during panic, and, cross-call interrupts are inhibited.
	Therefore, during panic dump the helper CPUs communicate with
	the panic CPU using memory variables. All memory mapping and
	I/O is performed by the panic CPU.

	Several compression methods were compared before bzip2 was
	chosen. With enough CPUs to parallelize the compression, bzip2
	is a win. The problem becomes I/O bound, and the higher
	compression achieved by bzip2 reduces the amount of data to
	write to disk, giving a speedup. See the first page of this
	spreadsheet for graphs of parallel lzjb vs bzip2 vs NCPU on
	various platforms [6].

4. User commands

4.1 dumpadm(1M)

	New options to dumpadm(1M) allow administrators to control the
	new features. The additional flags are:

		dumpadm [-z y|n]

4.1.1 dumpadm -z y

	Save dump compressed.

	Modify the dump configuration such that on the next live dump
	or reboot after a crash, savecore(1M) will save a compressed
	dump image to savecore-dir/vmdump.N. This is the new default
	setting.

	Creates an entry in /etc/dumpadm.conf:
		DUMPADM_CSAVE=y

	Note: savecore-dir is the value of DUMPADM_SAVDIR stored in
	/etc/dumpadm.conf. It is set via dumpadm -s*.

4.1.2 dumpadm -z n

	Save dump uncompressed.

	Modify the dump configuration such that on the next live dump
	or reboot after a crash, savecore(1M) will save an
	uncompressed dump image to savecore-dir/vmcore.N and
	savecore-dir/unix.N.  This is the old behavior.

	Creates an entry in /etc/dumpadm.conf:
		DUMPADM_CSAVE=n

4.2 mdb(1)

	There are no new options to mdb(1). However, mdb tries to give
	helpful messages should a user try to analyze a compressed
	dump by mistake.

4.2.1 mdb /var/crash/hostname/vmdump.N

	This gives a usage message stating the savecore -f* must be
	run first in order to decompress the dump file. For example:

	# mdb /crash/vmdump.0
	mdb: cannot open compressed dump; use savecore -f /crash/vmdump.0

	mdb "N" is also legal; it means load unix.N and vmcore.N. If
	vmdump.N exists and the uncompressed files do not, then mdb
	prints an error message. For example:

	# mdb 0
	mdb: cannot open compressed dump; use savecore -f vmdump.0

4.2.2 Could mdb be enhanced to uncompress the dump on the fly?

	That idea was considered and discarded. Decompression is slow,
	and there would be a large delay as the first mdb commands are
	issued, even if we try to be clever and compartmentalize the
	compressed sections so we can more selectively decompress.
	Also, common mdb commands such as walking all threads and
	stacks will access many pages throughout the system, causing
	most of the file to be decompressed, which would take a long
	time, and manifest itself to the user as a "slow mdb command".
	Also, the decompression overhead would be incurred every time
	you invoke mdb. Better to simply decompress the whole file
	once, save the result to a file, and be done.


4.3 savecore(1M)

	There are no new options to savecore(1M). The existing options
	are used to control the new features.

4.3.1 savecore savecore-dir

	Invoked by the system start up script svc_dumpadm. Saves the
	crash dump, if any. It is called twice upon start up. The
	first time it checks for a crash dump in the swap area. The
	second time it checks the configured dump device.

	Without a -f* option, savecore opens /dev/dump and learns the
	name of the dump device via ioctl DIOCGETDEV.

	Reads /etc/dumpadm.conf and checks DUMPADM_CSAVE.

	If DUMPADM_CSAVE=y, savecore copies the dump device to a
	file named savecore-dir/vmdump.N.

	N is the usual bounds number, a small integer that gets
	incremented after every dump is saved.

	Sends a console message explaining that savecore -f vmdump.N
	must be run.

	If DUMPADM_CSAVE=n, savecore decompresses the dump into
	savecore-dir, making vmcore.N and unix.N. This is the old
	behavior.

4.3.2 savecore -f vmdump.N [destination-directory]

	The default destination directory is DUMPADM_SAVDIR
	(savecore-dir), which is set in /etc/dumpadm.conf.

	A user runs this to decompress vmdump.N and make unix.N and
	vmcore.N in the destination directory.


4.4 file(1)

	There are no new options to file(1). But, it has been modifed in
	order to distinguish between the compressed and uncompressed
	formats.

4.4.1 file vmcore.0

	No change to this command. It gives information about an
	uncompressed core file. For example:

	vmcore.0: SunOS 5.11 snv_81 64-bit SPARC crash dump from 'oaf415'

4.4.2 file vmdump.0

	If the new flag DF_COMPRESSED is set in the dump header then
	add the string "compressed" to the output. For example:

	vmdump.0: SunOS 5.11 snv_81 64-bit SPARC compressed crash dump from 'oaf415'


5. Kernel interfaces

5.1 dumphdr_t

	Defined a new bit in the dump_flags member:

		DF_COMPRESSED=0x00000008

	This marks the dump image as being in a compressed
	format. This is set initially when a dump file is created on
	the dump device. It remains set in vmdump.N. It is cleared in
	vmcore.N. The file(1) and mdb(1) commands access this flag in
	order to determine the dump type.

5.2 dumpdatahdr_t

	This new header follows the dump header at the end of the dump
	file. It only exists in a compressed dump file. It is used to
	communicate compression information from dumpsys() in the
	kernel to savecore. This header is removed by savecore when it
	creates vmcore.N.

	/*
	 * The dump data header is placed after the dumphdr in the compressed
	 * image. It is not needed after savecore runs and the data pages have
	 * been decompressed.
	 */
	typedef struct dumpdatahdr {
		uint32_t dump_datahdr_magic;    /* data header presence */
		uint32_t dump_datahdr_version;  /* data header version */
		uint64_t dump_data_csize;       /* compressed data size */
		uint32_t dump_maxcsize;         /* compressed data max block size */
		uint32_t dump_maxrange;         /* max number of pages per range */
		uint16_t dump_nstreams;         /* number of compression streams */
		uint16_t dump_clevel;           /* compression level (0-9) */
	} dumpdatahdr_t;
	
	#define DUMP_DATAHDR_MAGIC      0x64686472U
	#define DUMP_DATAHDR_VERSION    1
	#define DUMP_CLEVEL_LZJB        1       /* parallel lzjb compression */
	#define DUMP_CLEVEL_BZIP2       2       /* parallel bzip2 level 1 */

	An alternative design was considered and discarded. That is,
	add the above members to the dump header. However, that would
	require changing DUMP_VERSION, and doing so breaks backward
	compatibility. With some code changes, newer mdb versions
	could read older dump versions, but otherwise compatible dumps
	made by a newer kernel would not be readable by an older
	mdb. The current design maintains dump compatibility across
	mdb versions: creating an old format dump with the new kernel
	is still an option, and the new savecore can still read the
	old dump file format.

	The other reason to create a separate dump data header is that
	the information is only needed by savecore, but not by mdb. It
	will easier to extend the kernel/savecore interface in order
	to add new compression methods without impacting mdb.

5.3 dumpcsize_t

	In the previous lzjb format, each compressed data page is
	encoded in the dump file with a 32-bit csize followed by
	'csize' bytes of data. The value of csize is between 1 and
	pagesize.

	In the new format, CPUs run in parallel compressing groups of
	pages. The output from each CPU is not synchronized in any
	way. Rather, the output of each CPU is divided into blocks,
	and the blocks are interleaved in the dump file. The
	compression data from a CPU is called a "stream". A stream is
	identified with a unique number called a "tag".

	dumpcsize_t extends the meaning of the 32-bit csize by
	defining higher order bits. It is an encoding of the csize
	word used to provide meta information between dumpsys and
	savecore.

	typedef struct dumpcsize {
		uint32_t tag:12, size:20;
	} dumpcsize_t;

	tag     size
	1-4095  1..dump_maxcsize        stream block
	0       1..pagesize             one lzjb page
	0       0                       marks end of data

	The maximum number of streams is 4095 (DUMP_MAX_TAG). Each
	helper CPU has its own stream.

	The maximum size is 2^20-1 (DUMP_MAX_CSIZE). This is the
	number of bytes in the stream block that follows. The format
	is described in section 6.

5.4 dumpstreamhdr_t

	A range of pages is defined by the pair:
		<starting pfn, number of pages>.

 	typedef struct dumpstreamhdr {
		char	stream_magic[8];	/* "StrmHdr" */
		pfn_t	stream_pagenum;         /* starting pfn */
		pgcnt_t	stream_npages;          /* number of pages */
	} dumpstreamhdr_t;

	Stream headers are included in a compressed stream just before
	each new range of pages. The magic number is included in order
	to detect corruption of the dump file.

5.5 bzip2 library

	This is bzip2/libbzip2 version 1.0.5 of 10 December 2007
	downloaded from www.bzip.org (see [2].) Minor modifications
	were made to eliminate compilation and lint errors.

	The library is in usr/src/uts/common/bzip2.

	This is third party code, and has been approved for inclusion
	in Solaris. See the Open Source Review case at [4].

	The following new interfaces were added for use by the kernel
	function dumpsys() and the user command savecore(1M):

5.5.1 BZ_EXTERN const char * BZ_API(BZ2_bzErrorString)(int error_code)

	This function gets an error message string given a bzlib2
	library error code.

5.5.2 int BZ_API(BZ2_bzCompressReset)(bz_stream *strm)

	This is a variant of BZ2_bzCompressInit. It resets internal
	state without allocating memory. It is used as a replacement
	for the sequence BZ2_bzCompressEnd / BZ2_bzCompressInit.

	Called before compressing a new stream.

5.5.3 int BZ_API(BZ2_bzDecompressReset)(bz_stream* strm)

	Reset internal state without allocating memory.

	Called before decompressing a new stream.

5.5.4 int BZ_API(BZ2_bzCompressInitSize)(int blockSize100k)

	Return the size of memory that will be allocated when
	BZ2_bzCompressInit is called. This is needed for sizing memory
	before actual allocation is done. This function adds a
	rounding factor to the size of each data structure. This
	allows enough extra space so that the dumpbzalloc() callback
	during BZ2_bzCompressInit can align each item.

	The rounding factor defined for this purpose is another new
	interface:

		#define BZ2_BZALLOC_ALIGN	(64)

5.6 dump ioctl

	Ioctl commands for /dev/dump defined in common/sys/dumpadm.h.

5.6.1 DIOCSETCONF

	Modified slightly to allow the setting of the new dumpadm
	configuration flag DUMP_PARALLEL, to control parallel dumps.

	#define DUMP_PARALLEL 0x00000002

5.7 hat layer extensions

	This function is provided for helper CPUs to flush
	translations.

	Each helper CPU takes a 4MB range (mapped with 2MB or 4MB
	translations) and looks for pages to dumped within the
	range. These mappings are reused for different groups of 4MB
	ranges. In order to reuse a mapping for a different physical
	translation it is necessary to invalidate any previous
	translations in that range. This function is provided for
	helper CPUs to flush those translations.

	Most kernel services are not available for helper CPUs during
	panic. This function is safe to call during panic from any
	CPU.

	This function is implemented in i86pc/vm/hat_i86.c and
	sfmmu/vm/hat_sfmmu.c.

		void hat_flush_range(hat, addr, size)

	Invalidate a virtual address translation for the local CPU.

5.8 panic_idle(void)

	When the system panics, all CPUs except the panic CPU
	spin-wait in panic_idle(). This spin loop has been changed to
	call into dumpsys in order to help with dump parallelism.

	There are two new globals, a flag and an entry point:
	        extern volatile int dumpsys_helpers_wanted;
	        extern void dumpsys_helper(void);

	This is implemented in sun4u/os/mach_cpu_states.c, and
	sun4v/os/mach_cpu_states.c:

		for (;;) {
			if (dumpsys_helpers_wanted)
				dumpsys_helper();
		}

	The implementation is slightly different in
	i86pc/os/machdep.c:

		for (;;) {
			if (dumpsys_helpers_wanted)
				dumpsys_helper();
		#ifndef __xpv
			else
				i86_halt();
		#endif
		}

	The flag dumpsys_helpers_wanted is set true initially. It is
	set false before dumpsys() exits. Therefore, i86_halt() will
	not be entered until dumpsys() completes.

5.9 dumpsys()

	This function is called during panic to produce a core
	dump. It is largely unchanged, except for the section that
	saves data pages, which is very different.

	Dumping data pages previously had this basic flow:

		create a bitmap of pages to dump
		for each bit set in the bitmap
			map the page
			copy it, trapping Uncorrectable Errors
			unmap the page
			compress the copy with lzjb
			write the 32-bit size (csize)
			write csize bytes

	This has been restructured to support parallelism. The master
	thread does the following:

		create a bitmap of pages to dump
		initialize data structures: helpers and queues
		set dumpsys_helpers_ready to enable helper CPUs
		if "live" dump create system tasks to run the helpers
		while not done
			map some pages and pass to a helper
			get blocks of compressed pages from a helper
			write compressed pages
			unmap pages
		call platform hook dump_plat_data, which always uses lzjb
		write end marker for all streams
		write dump data header
		write initial and terminal dump headers

	The helper threads do the following:

		while not done
			get input pages to be compressed from master
			get an output buffer from master
			copy and compress pages, trapping Uncorrectable Errors
			send compressed output buffer to master


5.9.1 new data structures

	helper_t helpers: contains the context for a stream. Has a
	unique tag. Includes bzip context for compressing streams.

	cbuf_t buffer controllers: used for both input and
	output.

	The buffer state indicates how it is being used:

	CBUF_FREEMAP: 4MB virtual address range is available for
	mapping input pages.

	CBUF_INREADY: 4MB of input pages are mapped and ready for
	compression by a helper.

	CBUF_USEDMAP: 4MB mapping has been consumed by a helper. Needs
	unmap.

	CBUF_FREEBUF: 128KB output buffer, which is available.

	CBUF_WRITE: 128KB block of compressed pages from a helper,
	ready to write out.

	CBUF_ERRMSG: 128KB block of error messages from a helper
	(reports UE errors.)

	cqueue_t queues: a uni-directional channel for communication
	from the master to helper tasks or vice-versa using put and
	get primitives. Both mappings and data buffers are passed via
	queues. Producers close a queue when done. The number of
	active producers is reference counted so the consumer can
	detect end of data. Concurrent access is mediated by atomic
	operations for panic dump, or mutex/cv for live dump.

	There a four queues, used as follows:

	Queue		Dataflow		NewState
	--------------------------------------------------
	mainq		master -> master	FREEMAP
	master has initialized or unmapped an input buffer
	--------------------------------------------------
	helperq		master -> helper	INREADY
	master has mapped input for use by helper
	--------------------------------------------------
	mainq		master <- helper	USEDMAP
	helper is done with input
	--------------------------------------------------
	freebufq	master -> helper	FREEBUF
	master has initialized or written an output buffer
	--------------------------------------------------
	mainq		master <- helper	WRITE
	block of compressed pages from a helper
	--------------------------------------------------
	mainq		master <- helper	ERRMSG
	error messages from a helper (memory error case)
	--------------------------------------------------
	writerq		master <- master	WRITE
	non-blocking queue of blocks to write
	--------------------------------------------------


5.9.2 new tasks

	Helper tasks block in three places, first to get input, second
	to get an output buffer, and third (rarely) to get an error
	message buffer. Deadlock is avoided because the number of
	buffers is a multiple of the number of helpers. There are at
	least 1X buffers for input, and 2X buffers for output.

	Helper CPUs do not have access to the console log. Therefore,
	any error messages are accumulated in a cbuf_t buffer and
	queued for writing by the main task as type CBUF_ERRMSG.

	Error messages can occur while copying pages. The page copy
	function uses on_trap() to capture any uncorrectable memory
	errors (UE). If a UE is caught, the bad bytes are replaced by
	the pattern (0xbadecc), and an error message is generated in a
	buffer.

	The main task maps 4MB at a time and adds the mappings to the
	helperq. However, not all pages within the 4MB page are
	expected to be dumped. Therefore, each helper task accesses
	the bit mask of pages, and only copies and compresses pages
	set in the bitmap.

	Main task:

		bitnum = 0
		loop
			report progress (x% done message on console)
			while mainq is empty and writerq not empty
				get cbuf_t from writerq
				write to dump file
				put cbuf_t on freebufq
			get cbuf_t from mainq (block here)
			switch on buffer state:
			case CBUF_FREEMAP:
				if bitnum >= bitmap size
					break (drops maps from the queue)
				advance bitnum to next bit
				if bitnum >= bitmap size
					close helperq
					break (drops the map)
				derive pfn from bitnum
				map the containing 4MB page
				advance bitnum by 4MB in pages
				open helperq
				if dump level 0
					for each bit set within 4MB page
						copy page (trap UE)
						compress with lzjb
						write 32-bit size
						write compressed bytes
					put mapping on mainq
					break
				put mapping on helperq
				if bitnum >= bitmap size
					close helperq (doing the last page)
				break
			case CBUF_USEDMAP:
				unmap the 4MB page
				put map VA on mainq as CBUF_FREEMAP
				close mainq
				count number of pages done
				break
			case CBUF_WRITE:
				move to writerq
				break
			case CBUF_ERRMSG:
				write error message on console
				put cbuf_t on freebufq
				break
		end loop

	Helper task for parallel lzjb:

		outbuf = NULL
		open mainq
		loop
			get mapping from helperq (block here)
			if NULL
				break (helperq is closed)
			foreach bit set within 4MB page
				copy page (trap UE)
				compress with lzjb
				if not enough room left in outbuf
					put outbuf on mainq as CBUF_WRITE
					outbuf = NULL
				if outbuf is NULL
					get outbuf from freebufq (block here)
					set block tag in outbuf
				append 32-bit size to outbuf
				append compressed bytes to outbuf
			end foreach
		end loop
		put any partial outbuf on mainq as CBUF_WRITE
		close mainq

	Helper task for parallel bzip2:

		stream pagenum = -1
		open mainq
		loop
			get mapping from helperq (block here)
			if NULL
				break (helperq is closed)
			if pfn in mapping != stream pagenum
				create stream header
				compress stream header with bzip2
			for each bit set within 4MB page
				copy page (trap UE)
				compress with bzip2
		end loop
		finish the bzip2 stream
		close mainq


6.0 Dump file format

	A crash image is formed on the system dump device in several
	sections. The dump header is located at a fixed offset from
	the end of the dump device (end - DUMP_OFFSET=65536). The dump
	header has offsets and sizes for each of the sections (except
	for the total size of the compressed data pages,
	unfortunately.)  The new data header was introduced in order
	to pass information about the compression method and number of
	streams. It also includes the size of the compressed data
	pages, which is not included in the dump header.

	The dump header contains a count of the number of data pages
	saved. This is how savecore knows when all data pages have
	been uncompressed.

	Savecore reads the dump header (F) and the new data header (G)
	in order to find all of the sections. (See A-G below.)

	When saving a compressed copy, savecore reads each section and
	writes it to the file vmump.N. It concatenates each
	section, updating the offsets in the dump header and core
	header as it goes. Thus, vmdump.N is a copy of the dump
	image with empty spaces removed. It otherwise has the same
	layout as the dump device. This file is usually much smaller
	than a 'dd' image of the dump device.

	A crash image contains two copies of the dump header: (F) and
	(A). The reason for this duplication is that often the swap
	area is also the dump device. The old behavior was that
	savecore would uncompress the dump at start up. This can take
	a long time, and so there is a risk that the image can be
	overwritten due to paging and swapping before savecore
	finishes. Savecore compares the two headers when it is done to
	see if an overwrite has occurred.

	In the new method, savecore simply copies the dump image into
	a file without uncompressing it. This takes much less time to
	accomplish, which lowers the risk of the image being corrupted
	before it is saved. It still compares the headers afterward,
	checking for corruption.

	The dump file sections are:

		A: core header (dumphdr_t)
		B: compressed symbol table
		C: pfn table
		D: virtual to physical dump map
		E: compressed data pages
		F: dump header (dumphdr_t)
		G: data header (dumpdatahdr_t)

	A: core header

	The core header is located at the front of the dump image
	(lowest offset.) It is located at offset 0 in the uncompressed
	file (vmcore.N).

	A dumphdr_t records offsets and sizes for each of the
	sections, as well as information about the kernel.

	The core header A is initally a copy of the dump header F when
	dumpsys() creates the dump file on the dump device. The core
	header and dump header are compared when the dump file is
	saved. This is done for the case where the dump file is saved
	in the swap area. Should swapping or paging occur while
	savecore is uncompressing the dump, it is possible that the
	core header could be overwritten.

	The new default is to save the dump file in compressed
	format. In this case, the core header is updated with offsets
	into the compressed file (vmdump.N) as the sections are
	copied from the dump device to a file. This updated core
	header is written to the beginning and end of the compressed
	dump file. That way the compressed dump file format matches
	the dump device format.

	When uncompressed (unix.N, vmcore.N) the core header has
	offsets to the uncompressed sections. mdb(1) and file(1)
	access the core header at offset 0 in vmcore.N.

	B: compressed symbol table

	The kernel symbol table (namelist) is uncompressed with lzjb
	and written to unix.N.
	
	C: pfn table

	This is an array of physical page numbers (pfn). One pfn_t for
	each page dumped. The array corresponds with the bitmap that
	dumpsys() constructs.

	D: virtual to physical dump map

	This is an array of virtual to physical translations. savecore
	looks up the pfn in the pfn table and creates a hash map of
	translations.

	E: compressed data pages

	The older format, now called single-threaded lzjb, is simple:

		32-bit size0 (PFN 0)
		size0 bytes

		32-bit size1 (PFN 1)
		size1 bytes

		...

		32-bit sizeN (PFN N-1)
		sizeN bytes

	PFN is actually an index into the pfn table C. The size is in
	the range 1..pagesize. The lzjb compression method is used on
	one page at a time. The compressed size is guaranteed to be no
	larger than the input size.

	The number of pages N comes from the dump header F. The size
	is between 1 and pagesize, where pagesize is 4K on x86, 8K on
	Sparc. Each group of bytes is decompressed with lzjb.  The
	decompressed result must total pagesize bytes (otherwise it is
	an error.)

	This has been extended in order to support parallel lzjb or
	parallel bzip2. Either compression format can be used (but not
	a mixture), as indicated in the dump data header G. CPUs run
	in parallel at dump time; each CPU creates a single stream of
	compression data.  Stream data is divided into 128KB blocks.
	The blocks are written in order within a stream. But, blocks
	from multiple streams can be interleaved. Each stream is
	identified by a unique tag.

	In the new format, the 32-bit size can have a value greater
	than pagesize. If so, it is interpreted as 2 fields:

		typedef struct dumpcsize {
		        uint32_t tag:12, size:20;
		} dumpcsize_t;

	The new format is an array of stream blocks. A stream block
	always has a non-zero value for the tag field. A tag has a
	value in the range 1..4095. Therefore, a stream block is
	always distinguishable from a single lzjb page, because the
	32-bit value is always greater than pagesize.

		tag=Tx size=N
		one block of N bytes for stream Tx

		tag=Ty size=N
		one block of N bytes for stream Ty

		...

		tag=Tx size=N
		additional block of N bytes for stream Tx

		...

		tag=0 size=1..pagesize
		optional platform data: lzjb data for one page

		...

		tag=0 size=0
		(end of data marker; nothing follows)

	The purpose is to encode multiple streams in parallel, such
	that each stream can be reassembled and decompressed.

	The logical form of a stream is a series of objects. There are
	2 types of objects: a stream header, and individual pages.

		header for range of pages: <starting pfn, number of pages>
		compressed data for pages in the range
		...

	The above is repeated within each stream. Each range of pages
	is disjoint over all streams.  The above may fit within a
	single "block of N bytes" as shown previously, or it may be
	split across several blocks.  The representation within each
	block is different for each compression method.

	Parallel lzjb blocks hold a series of objects:

		dumpstreamhdr_t: <starting PFN, N number of pages>

		32-bit size = 1..pagesize
		lzjb data for PFN+0

		32-bit size = 1..pagesize
		lzjb data for PFN+1

		...

		32-bit size = 1..pagesize
		lzjb data for PFN+N-1


	Objects are not split over block boundaries.

	Parallel bzip2 blocks have a very different format:

		bzip2 data

	The stream header along with the data pages in the range are
	compressed together. As the bzip2 data is decompressed, the
	stream header is extracted first, followed by the pages. Once
	a bzip2 stream is decompressed, it produces these objects:

		dumpstreamhdr_t: <starting PFN, N number of pages>

		pagesize bytes: PFN+0

		pagesize bytes: PFN+1

		...

		pagesize bytes: PFN+N-1

	Block boundaries can occur anywhere.

	F: dump header (dumphdr_t)

	The dump header is always located 64K (DUMP_OFFSET) from the
	end of the dump device, or the end of the compressed copy
	(vmdump.N). It has an offset to the beginning core header.

	G: data header (dumpdatahdr_t)

	The data header is located after the dump header. It has
	compression information that savecore needs to decompress the
	file. (See 5.2 above.)

7.0 Memory space requirements

	Memory usage varies with each compression method. However,
	each method performs the same basic steps, which are:

	A: Map a page to be dumped, according to the dump bitmap.

	B: Copy the mapped page to a another buffer, removing
	UE's. Report UE's on the console.

	C: Compress the copy into an output buffer.

	D: Unmap the the original page.

	E: Write the compressed page to the dump file.

	These virtual and physical memory resources are allocated when
	dumpadm(1m) runs. dumpadm runs automatically at start-up; it
	can also be run manually by the system administrator.
	Pre-allocation is done for two reasons: 1) when the system is
	crashing it is probably unrealiable. Therefore, pre-reserving
	resources for use later at dump time is more robust. And, 2)
	even if the system were reliable, it is desirable to preserve
	the kernel state as much as possible. Allocating resources at
	dump time would affect the kernel state, perhaps obscuring the
	original fault.

	The new parallel dump methods require much more VA range and
	memory buffering, because each CPU runs in parallel and must
	have its own resources. In this case, there is one master and
	multiple slaves. The master CPU is the one chosen to process
	the crash, and the slaves (helpers) are chosen among the CPUs
	spinning in panic_idle. The above five steps are modified for
	the parallel case as follows:

	A: Master: map 4MB ranges, each of which contains one or more
	pages to be dumped, according to the dump bitmap.

	B: Helpers: copy individual pages from a 4MB range that must
	be dumped, according to the dump bitmap. Filter any UE's and
	record diagnostics in an error message buffer.

	C: Helpers: compress the pages and concatenate them into an
	output buffer.

	D: Master: unmap the 4MB ranges. Helpers: invalidate the 4MB
	ranges.

	E: Master: write output to the dump file. Report any UE's on
	the console log from error message buffers.

	The strategy is to pre-allocate a minimum amount of VA range
	and memory, initially. Very small systems will still do the
	old-style dump. Most systems will allocate enough memory to do
	a parallel lzjb dump. This will ensure a successful dump no
	matter the conditions when a crash occurs. However, there are
	usually many non-kernel pages that will not be dumped, and
	these pages will be found at crash time and harvested to
	create additional buffers. Should a sufficent amount of
	buffering be available for bzip2, dumpsys will re-configure
	itself to do the optimum dump.

7.1 single-threaded lzjb

	This is the old method. It has minimal requirements. One CPU
	processes the dump file at crash time. The memory required is:

	A: 1 small page for mapping.
	B: 1 pagesize copy buffer for UE detection.
	C: 1 pagesize buffer for compressed output.

7.2 parallel lzjb

	This is the old compression method. But there are N helpers
	running in parallel. N=4 is often optimum for this method.

	A: N*2 4MB VA ranges for mapping. That is, 32MB VA ranges.
	Doubling the map buffers allows the master to keep ahead of
	the helpers.

	B: N pagesize copy buffers for UE detection.

	C: N*4 128KB buffers used for output or for error messages.
	That is, 2MB buffering for output. UE's are rare, therefore
	error messages are rare. Most of this buffering is used for
	output.

7.3 parallel bzip2

	This is the new compression method. It requires a large amount
	of state, and is much slower, but gives much better
	compression. There are N helpers. The optimal value for N
	varies with processor speed but is large; N=100 is used for
	the calculations below.

	A: N*2 4MB VA ranges for mapping. That is, 200 * 4MB VA
	ranges. Doubling the map buffers allows the master to keep
	ahead of the helpers.

	B: N pagesize copy buffers for UE detection.

	C: N*4 128KB buffers for output or for error messages. In
	addition, there is N*800KB needed for bzip2 state. Thus the
	total for buffers and state for N=100 is about 128 MB.

8.0 Compression methods

	There are two parallel compression methods in this
	implementation.  The method is chosen at dump time based upon
	available memory, CPU architecture, and the number of
	available CPUS. See [5] for a comparison of compression
	ratios.

	Architecture	NCPU		Algorithm
	sun4u		< 24		parallel lzjb
	sun4u		>= 24		parallel bzip2(*)
	sun4v		< 64		parallel lzjb
	sun4v		>= 64		parallel bzip2(*)
	x86		< 16		parallel lzjb
	x86		>= 16		parallel bzip2(*)
	32-bit		N/A		single-threaded lzjb

	(*) bzip2 is only chosen if there is sufficient available
	memory for buffers at dump time.