1. Introduction 1.1. Project/Component Working Name: Fast Crash Dump 1.2. Name of Document Author/Supplier: Dave Plauger Steve Sistare 1.3. Date of This Document: August 11, 2009 2. Reference Documents: [1] Project Twiki Page http://agares.central.sun.com/twiki/bin/view/Scalability/FastCrashDump [2] bzip2 and libbzip2 http://www.bzip.org/ [3] 6828976 Fast Crash Dump http://monaco.sfbay/detail.jsf?cr=6828976 [4] bzip2 Open Source Review https://opensourcereview.central.sun.com/app?action=ViewReq&traq_num=11163 [5] comparison of compression ratios http://agares.central.sun.com/twiki/pub/Scalability/FastCrashDump/cores.ratios.ods [6] Fast Crash scalability model http://agares.central/twiki/pub/Scalability/FastCrashDump/crash_model.ods 3. Overview This document describes the design for Fast Crash Dump. The goal is to decrease the time it takes to save memory dumps during panic, and to reduce the amount of disk space required in order to save them. This improves Solaris availability, because the system will recover faster after a crash. It also improves serviceability, because smaller crash files can be moved over the network more quickly for off site analysis, and less reserved disk space is needed for saving crash files. Crash dump time is dominated by disk write time. To reduce this, the stronger compression method bzip2 is applied to reduce the dump size and hence reduce I/O time. However, bzip2 is much more computationally expensive than the existing lzjb algorithm, so to avoid increasing compression time, CPUs that are otherwise idle during panic are employed to parallelize the compression task. Many helper CPUs are needed to prevent bzip2 from being a bottleneck, and on systems with too few CPUs, the lzjb algorithm is parallelized instead. Lastly, I/O and compression are performed by different CPUs, and are hence overlapped in time, unlike the current code. These optimizations reduce crash dump time on large machines by 3X to 10X depending on processor speed, CPU count, and data compression. After the system reboots, savecore(1M) runs and the compressed crash dump is saved in a compressed format in the file-system. Copying a compressed image from the dump device to the file system is much faster than uncompressing the image right away. A compressed dump image can be copied over the network and uncompressed at a later time for analysis. If crash dumps are saved in the swap area, then copying them quickly means that there is less risk of the dump image being corrupted by paging and swapping. New command line flags have been added to dumpadm(1M) in order to control the new feature. There are additions to /etc/dumpadm.conf to save the settings across reboots. Also, savecore(1M) can copy the compressed image from the dump device to the file-system. At a later time, savecore(1M) can be invoked by a user in order to uncompress the image. There are no new savecore(1M) flags needed; the existing flags are sufficient. During panic, all but one of the CPUs is idle. These CPUs are used as helpers working in parallel to copy and compress memory pages. During a panic, however, these processors cannot call any kernel services. This is because mutexes become no-ops during panic, and, cross-call interrupts are inhibited. Therefore, during panic dump the helper CPUs communicate with the panic CPU using memory variables. All memory mapping and I/O is performed by the panic CPU. Several compression methods were compared before bzip2 was chosen. With enough CPUs to parallelize the compression, bzip2 is a win. The problem becomes I/O bound, and the higher compression achieved by bzip2 reduces the amount of data to write to disk, giving a speedup. See the first page of this spreadsheet for graphs of parallel lzjb vs bzip2 vs NCPU on various platforms [6]. 4. User commands 4.1 dumpadm(1M) New options to dumpadm(1M) allow administrators to control the new features. The additional flags are: dumpadm [-z y|n] 4.1.1 dumpadm -z y Save dump compressed. Modify the dump configuration such that on the next live dump or reboot after a crash, savecore(1M) will save a compressed dump image to savecore-dir/vmdump.N. This is the new default setting. Creates an entry in /etc/dumpadm.conf: DUMPADM_CSAVE=y Note: savecore-dir is the value of DUMPADM_SAVDIR stored in /etc/dumpadm.conf. It is set via dumpadm -s*. 4.1.2 dumpadm -z n Save dump uncompressed. Modify the dump configuration such that on the next live dump or reboot after a crash, savecore(1M) will save an uncompressed dump image to savecore-dir/vmcore.N and savecore-dir/unix.N. This is the old behavior. Creates an entry in /etc/dumpadm.conf: DUMPADM_CSAVE=n 4.2 mdb(1) There are no new options to mdb(1). However, mdb tries to give helpful messages should a user try to analyze a compressed dump by mistake. 4.2.1 mdb /var/crash/hostname/vmdump.N This gives a usage message stating the savecore -f* must be run first in order to decompress the dump file. For example: # mdb /crash/vmdump.0 mdb: cannot open compressed dump; use savecore -f /crash/vmdump.0 mdb "N" is also legal; it means load unix.N and vmcore.N. If vmdump.N exists and the uncompressed files do not, then mdb prints an error message. For example: # mdb 0 mdb: cannot open compressed dump; use savecore -f vmdump.0 4.2.2 Could mdb be enhanced to uncompress the dump on the fly? That idea was considered and discarded. Decompression is slow, and there would be a large delay as the first mdb commands are issued, even if we try to be clever and compartmentalize the compressed sections so we can more selectively decompress. Also, common mdb commands such as walking all threads and stacks will access many pages throughout the system, causing most of the file to be decompressed, which would take a long time, and manifest itself to the user as a "slow mdb command". Also, the decompression overhead would be incurred every time you invoke mdb. Better to simply decompress the whole file once, save the result to a file, and be done. 4.3 savecore(1M) There are no new options to savecore(1M). The existing options are used to control the new features. 4.3.1 savecore savecore-dir Invoked by the system start up script svc_dumpadm. Saves the crash dump, if any. It is called twice upon start up. The first time it checks for a crash dump in the swap area. The second time it checks the configured dump device. Without a -f* option, savecore opens /dev/dump and learns the name of the dump device via ioctl DIOCGETDEV. Reads /etc/dumpadm.conf and checks DUMPADM_CSAVE. If DUMPADM_CSAVE=y, savecore copies the dump device to a file named savecore-dir/vmdump.N. N is the usual bounds number, a small integer that gets incremented after every dump is saved. Sends a console message explaining that savecore -f vmdump.N must be run. If DUMPADM_CSAVE=n, savecore decompresses the dump into savecore-dir, making vmcore.N and unix.N. This is the old behavior. 4.3.2 savecore -f vmdump.N [destination-directory] The default destination directory is DUMPADM_SAVDIR (savecore-dir), which is set in /etc/dumpadm.conf. A user runs this to decompress vmdump.N and make unix.N and vmcore.N in the destination directory. 4.4 file(1) There are no new options to file(1). But, it has been modifed in order to distinguish between the compressed and uncompressed formats. 4.4.1 file vmcore.0 No change to this command. It gives information about an uncompressed core file. For example: vmcore.0: SunOS 5.11 snv_81 64-bit SPARC crash dump from 'oaf415' 4.4.2 file vmdump.0 If the new flag DF_COMPRESSED is set in the dump header then add the string "compressed" to the output. For example: vmdump.0: SunOS 5.11 snv_81 64-bit SPARC compressed crash dump from 'oaf415' 5. Kernel interfaces 5.1 dumphdr_t Defined a new bit in the dump_flags member: DF_COMPRESSED=0x00000008 This marks the dump image as being in a compressed format. This is set initially when a dump file is created on the dump device. It remains set in vmdump.N. It is cleared in vmcore.N. The file(1) and mdb(1) commands access this flag in order to determine the dump type. 5.2 dumpdatahdr_t This new header follows the dump header at the end of the dump file. It only exists in a compressed dump file. It is used to communicate compression information from dumpsys() in the kernel to savecore. This header is removed by savecore when it creates vmcore.N. /* * The dump data header is placed after the dumphdr in the compressed * image. It is not needed after savecore runs and the data pages have * been decompressed. */ typedef struct dumpdatahdr { uint32_t dump_datahdr_magic; /* data header presence */ uint32_t dump_datahdr_version; /* data header version */ uint64_t dump_data_csize; /* compressed data size */ uint32_t dump_maxcsize; /* compressed data max block size */ uint32_t dump_maxrange; /* max number of pages per range */ uint16_t dump_nstreams; /* number of compression streams */ uint16_t dump_clevel; /* compression level (0-9) */ } dumpdatahdr_t; #define DUMP_DATAHDR_MAGIC 0x64686472U #define DUMP_DATAHDR_VERSION 1 #define DUMP_CLEVEL_LZJB 1 /* parallel lzjb compression */ #define DUMP_CLEVEL_BZIP2 2 /* parallel bzip2 level 1 */ An alternative design was considered and discarded. That is, add the above members to the dump header. However, that would require changing DUMP_VERSION, and doing so breaks backward compatibility. With some code changes, newer mdb versions could read older dump versions, but otherwise compatible dumps made by a newer kernel would not be readable by an older mdb. The current design maintains dump compatibility across mdb versions: creating an old format dump with the new kernel is still an option, and the new savecore can still read the old dump file format. The other reason to create a separate dump data header is that the information is only needed by savecore, but not by mdb. It will easier to extend the kernel/savecore interface in order to add new compression methods without impacting mdb. 5.3 dumpcsize_t In the previous lzjb format, each compressed data page is encoded in the dump file with a 32-bit csize followed by 'csize' bytes of data. The value of csize is between 1 and pagesize. In the new format, CPUs run in parallel compressing groups of pages. The output from each CPU is not synchronized in any way. Rather, the output of each CPU is divided into blocks, and the blocks are interleaved in the dump file. The compression data from a CPU is called a "stream". A stream is identified with a unique number called a "tag". dumpcsize_t extends the meaning of the 32-bit csize by defining higher order bits. It is an encoding of the csize word used to provide meta information between dumpsys and savecore. typedef struct dumpcsize { uint32_t tag:12, size:20; } dumpcsize_t; tag size 1-4095 1..dump_maxcsize stream block 0 1..pagesize one lzjb page 0 0 marks end of data The maximum number of streams is 4095 (DUMP_MAX_TAG). Each helper CPU has its own stream. The maximum size is 2^20-1 (DUMP_MAX_CSIZE). This is the number of bytes in the stream block that follows. The format is described in section 6. 5.4 dumpstreamhdr_t A range of pages is defined by the pair: . typedef struct dumpstreamhdr { char stream_magic[8]; /* "StrmHdr" */ pfn_t stream_pagenum; /* starting pfn */ pgcnt_t stream_npages; /* number of pages */ } dumpstreamhdr_t; Stream headers are included in a compressed stream just before each new range of pages. The magic number is included in order to detect corruption of the dump file. 5.5 bzip2 library This is bzip2/libbzip2 version 1.0.5 of 10 December 2007 downloaded from www.bzip.org (see [2].) Minor modifications were made to eliminate compilation and lint errors. The library is in usr/src/uts/common/bzip2. This is third party code, and has been approved for inclusion in Solaris. See the Open Source Review case at [4]. The following new interfaces were added for use by the kernel function dumpsys() and the user command savecore(1M): 5.5.1 BZ_EXTERN const char * BZ_API(BZ2_bzErrorString)(int error_code) This function gets an error message string given a bzlib2 library error code. 5.5.2 int BZ_API(BZ2_bzCompressReset)(bz_stream *strm) This is a variant of BZ2_bzCompressInit. It resets internal state without allocating memory. It is used as a replacement for the sequence BZ2_bzCompressEnd / BZ2_bzCompressInit. Called before compressing a new stream. 5.5.3 int BZ_API(BZ2_bzDecompressReset)(bz_stream* strm) Reset internal state without allocating memory. Called before decompressing a new stream. 5.5.4 int BZ_API(BZ2_bzCompressInitSize)(int blockSize100k) Return the size of memory that will be allocated when BZ2_bzCompressInit is called. This is needed for sizing memory before actual allocation is done. This function adds a rounding factor to the size of each data structure. This allows enough extra space so that the dumpbzalloc() callback during BZ2_bzCompressInit can align each item. The rounding factor defined for this purpose is another new interface: #define BZ2_BZALLOC_ALIGN (64) 5.6 dump ioctl Ioctl commands for /dev/dump defined in common/sys/dumpadm.h. 5.6.1 DIOCSETCONF Modified slightly to allow the setting of the new dumpadm configuration flag DUMP_PARALLEL, to control parallel dumps. #define DUMP_PARALLEL 0x00000002 5.7 hat layer extensions This function is provided for helper CPUs to flush translations. Each helper CPU takes a 4MB range (mapped with 2MB or 4MB translations) and looks for pages to dumped within the range. These mappings are reused for different groups of 4MB ranges. In order to reuse a mapping for a different physical translation it is necessary to invalidate any previous translations in that range. This function is provided for helper CPUs to flush those translations. Most kernel services are not available for helper CPUs during panic. This function is safe to call during panic from any CPU. This function is implemented in i86pc/vm/hat_i86.c and sfmmu/vm/hat_sfmmu.c. void hat_flush_range(hat, addr, size) Invalidate a virtual address translation for the local CPU. 5.8 panic_idle(void) When the system panics, all CPUs except the panic CPU spin-wait in panic_idle(). This spin loop has been changed to call into dumpsys in order to help with dump parallelism. There are two new globals, a flag and an entry point: extern volatile int dumpsys_helpers_wanted; extern void dumpsys_helper(void); This is implemented in sun4u/os/mach_cpu_states.c, and sun4v/os/mach_cpu_states.c: for (;;) { if (dumpsys_helpers_wanted) dumpsys_helper(); } The implementation is slightly different in i86pc/os/machdep.c: for (;;) { if (dumpsys_helpers_wanted) dumpsys_helper(); #ifndef __xpv else i86_halt(); #endif } The flag dumpsys_helpers_wanted is set true initially. It is set false before dumpsys() exits. Therefore, i86_halt() will not be entered until dumpsys() completes. 5.9 dumpsys() This function is called during panic to produce a core dump. It is largely unchanged, except for the section that saves data pages, which is very different. Dumping data pages previously had this basic flow: create a bitmap of pages to dump for each bit set in the bitmap map the page copy it, trapping Uncorrectable Errors unmap the page compress the copy with lzjb write the 32-bit size (csize) write csize bytes This has been restructured to support parallelism. The master thread does the following: create a bitmap of pages to dump initialize data structures: helpers and queues set dumpsys_helpers_ready to enable helper CPUs if "live" dump create system tasks to run the helpers while not done map some pages and pass to a helper get blocks of compressed pages from a helper write compressed pages unmap pages call platform hook dump_plat_data, which always uses lzjb write end marker for all streams write dump data header write initial and terminal dump headers The helper threads do the following: while not done get input pages to be compressed from master get an output buffer from master copy and compress pages, trapping Uncorrectable Errors send compressed output buffer to master 5.9.1 new data structures helper_t helpers: contains the context for a stream. Has a unique tag. Includes bzip context for compressing streams. cbuf_t buffer controllers: used for both input and output. The buffer state indicates how it is being used: CBUF_FREEMAP: 4MB virtual address range is available for mapping input pages. CBUF_INREADY: 4MB of input pages are mapped and ready for compression by a helper. CBUF_USEDMAP: 4MB mapping has been consumed by a helper. Needs unmap. CBUF_FREEBUF: 128KB output buffer, which is available. CBUF_WRITE: 128KB block of compressed pages from a helper, ready to write out. CBUF_ERRMSG: 128KB block of error messages from a helper (reports UE errors.) cqueue_t queues: a uni-directional channel for communication from the master to helper tasks or vice-versa using put and get primitives. Both mappings and data buffers are passed via queues. Producers close a queue when done. The number of active producers is reference counted so the consumer can detect end of data. Concurrent access is mediated by atomic operations for panic dump, or mutex/cv for live dump. There a four queues, used as follows: Queue Dataflow NewState -------------------------------------------------- mainq master -> master FREEMAP master has initialized or unmapped an input buffer -------------------------------------------------- helperq master -> helper INREADY master has mapped input for use by helper -------------------------------------------------- mainq master <- helper USEDMAP helper is done with input -------------------------------------------------- freebufq master -> helper FREEBUF master has initialized or written an output buffer -------------------------------------------------- mainq master <- helper WRITE block of compressed pages from a helper -------------------------------------------------- mainq master <- helper ERRMSG error messages from a helper (memory error case) -------------------------------------------------- writerq master <- master WRITE non-blocking queue of blocks to write -------------------------------------------------- 5.9.2 new tasks Helper tasks block in three places, first to get input, second to get an output buffer, and third (rarely) to get an error message buffer. Deadlock is avoided because the number of buffers is a multiple of the number of helpers. There are at least 1X buffers for input, and 2X buffers for output. Helper CPUs do not have access to the console log. Therefore, any error messages are accumulated in a cbuf_t buffer and queued for writing by the main task as type CBUF_ERRMSG. Error messages can occur while copying pages. The page copy function uses on_trap() to capture any uncorrectable memory errors (UE). If a UE is caught, the bad bytes are replaced by the pattern (0xbadecc), and an error message is generated in a buffer. The main task maps 4MB at a time and adds the mappings to the helperq. However, not all pages within the 4MB page are expected to be dumped. Therefore, each helper task accesses the bit mask of pages, and only copies and compresses pages set in the bitmap. Main task: bitnum = 0 loop report progress (x% done message on console) while mainq is empty and writerq not empty get cbuf_t from writerq write to dump file put cbuf_t on freebufq get cbuf_t from mainq (block here) switch on buffer state: case CBUF_FREEMAP: if bitnum >= bitmap size break (drops maps from the queue) advance bitnum to next bit if bitnum >= bitmap size close helperq break (drops the map) derive pfn from bitnum map the containing 4MB page advance bitnum by 4MB in pages open helperq if dump level 0 for each bit set within 4MB page copy page (trap UE) compress with lzjb write 32-bit size write compressed bytes put mapping on mainq break put mapping on helperq if bitnum >= bitmap size close helperq (doing the last page) break case CBUF_USEDMAP: unmap the 4MB page put map VA on mainq as CBUF_FREEMAP close mainq count number of pages done break case CBUF_WRITE: move to writerq break case CBUF_ERRMSG: write error message on console put cbuf_t on freebufq break end loop Helper task for parallel lzjb: outbuf = NULL open mainq loop get mapping from helperq (block here) if NULL break (helperq is closed) foreach bit set within 4MB page copy page (trap UE) compress with lzjb if not enough room left in outbuf put outbuf on mainq as CBUF_WRITE outbuf = NULL if outbuf is NULL get outbuf from freebufq (block here) set block tag in outbuf append 32-bit size to outbuf append compressed bytes to outbuf end foreach end loop put any partial outbuf on mainq as CBUF_WRITE close mainq Helper task for parallel bzip2: stream pagenum = -1 open mainq loop get mapping from helperq (block here) if NULL break (helperq is closed) if pfn in mapping != stream pagenum create stream header compress stream header with bzip2 for each bit set within 4MB page copy page (trap UE) compress with bzip2 end loop finish the bzip2 stream close mainq 6.0 Dump file format A crash image is formed on the system dump device in several sections. The dump header is located at a fixed offset from the end of the dump device (end - DUMP_OFFSET=65536). The dump header has offsets and sizes for each of the sections (except for the total size of the compressed data pages, unfortunately.) The new data header was introduced in order to pass information about the compression method and number of streams. It also includes the size of the compressed data pages, which is not included in the dump header. The dump header contains a count of the number of data pages saved. This is how savecore knows when all data pages have been uncompressed. Savecore reads the dump header (F) and the new data header (G) in order to find all of the sections. (See A-G below.) When saving a compressed copy, savecore reads each section and writes it to the file vmump.N. It concatenates each section, updating the offsets in the dump header and core header as it goes. Thus, vmdump.N is a copy of the dump image with empty spaces removed. It otherwise has the same layout as the dump device. This file is usually much smaller than a 'dd' image of the dump device. A crash image contains two copies of the dump header: (F) and (A). The reason for this duplication is that often the swap area is also the dump device. The old behavior was that savecore would uncompress the dump at start up. This can take a long time, and so there is a risk that the image can be overwritten due to paging and swapping before savecore finishes. Savecore compares the two headers when it is done to see if an overwrite has occurred. In the new method, savecore simply copies the dump image into a file without uncompressing it. This takes much less time to accomplish, which lowers the risk of the image being corrupted before it is saved. It still compares the headers afterward, checking for corruption. The dump file sections are: A: core header (dumphdr_t) B: compressed symbol table C: pfn table D: virtual to physical dump map E: compressed data pages F: dump header (dumphdr_t) G: data header (dumpdatahdr_t) A: core header The core header is located at the front of the dump image (lowest offset.) It is located at offset 0 in the uncompressed file (vmcore.N). A dumphdr_t records offsets and sizes for each of the sections, as well as information about the kernel. The core header A is initally a copy of the dump header F when dumpsys() creates the dump file on the dump device. The core header and dump header are compared when the dump file is saved. This is done for the case where the dump file is saved in the swap area. Should swapping or paging occur while savecore is uncompressing the dump, it is possible that the core header could be overwritten. The new default is to save the dump file in compressed format. In this case, the core header is updated with offsets into the compressed file (vmdump.N) as the sections are copied from the dump device to a file. This updated core header is written to the beginning and end of the compressed dump file. That way the compressed dump file format matches the dump device format. When uncompressed (unix.N, vmcore.N) the core header has offsets to the uncompressed sections. mdb(1) and file(1) access the core header at offset 0 in vmcore.N. B: compressed symbol table The kernel symbol table (namelist) is uncompressed with lzjb and written to unix.N. C: pfn table This is an array of physical page numbers (pfn). One pfn_t for each page dumped. The array corresponds with the bitmap that dumpsys() constructs. D: virtual to physical dump map This is an array of virtual to physical translations. savecore looks up the pfn in the pfn table and creates a hash map of translations. E: compressed data pages The older format, now called single-threaded lzjb, is simple: 32-bit size0 (PFN 0) size0 bytes 32-bit size1 (PFN 1) size1 bytes ... 32-bit sizeN (PFN N-1) sizeN bytes PFN is actually an index into the pfn table C. The size is in the range 1..pagesize. The lzjb compression method is used on one page at a time. The compressed size is guaranteed to be no larger than the input size. The number of pages N comes from the dump header F. The size is between 1 and pagesize, where pagesize is 4K on x86, 8K on Sparc. Each group of bytes is decompressed with lzjb. The decompressed result must total pagesize bytes (otherwise it is an error.) This has been extended in order to support parallel lzjb or parallel bzip2. Either compression format can be used (but not a mixture), as indicated in the dump data header G. CPUs run in parallel at dump time; each CPU creates a single stream of compression data. Stream data is divided into 128KB blocks. The blocks are written in order within a stream. But, blocks from multiple streams can be interleaved. Each stream is identified by a unique tag. In the new format, the 32-bit size can have a value greater than pagesize. If so, it is interpreted as 2 fields: typedef struct dumpcsize { uint32_t tag:12, size:20; } dumpcsize_t; The new format is an array of stream blocks. A stream block always has a non-zero value for the tag field. A tag has a value in the range 1..4095. Therefore, a stream block is always distinguishable from a single lzjb page, because the 32-bit value is always greater than pagesize. tag=Tx size=N one block of N bytes for stream Tx tag=Ty size=N one block of N bytes for stream Ty ... tag=Tx size=N additional block of N bytes for stream Tx ... tag=0 size=1..pagesize optional platform data: lzjb data for one page ... tag=0 size=0 (end of data marker; nothing follows) The purpose is to encode multiple streams in parallel, such that each stream can be reassembled and decompressed. The logical form of a stream is a series of objects. There are 2 types of objects: a stream header, and individual pages. header for range of pages: compressed data for pages in the range ... The above is repeated within each stream. Each range of pages is disjoint over all streams. The above may fit within a single "block of N bytes" as shown previously, or it may be split across several blocks. The representation within each block is different for each compression method. Parallel lzjb blocks hold a series of objects: dumpstreamhdr_t: 32-bit size = 1..pagesize lzjb data for PFN+0 32-bit size = 1..pagesize lzjb data for PFN+1 ... 32-bit size = 1..pagesize lzjb data for PFN+N-1 Objects are not split over block boundaries. Parallel bzip2 blocks have a very different format: bzip2 data The stream header along with the data pages in the range are compressed together. As the bzip2 data is decompressed, the stream header is extracted first, followed by the pages. Once a bzip2 stream is decompressed, it produces these objects: dumpstreamhdr_t: pagesize bytes: PFN+0 pagesize bytes: PFN+1 ... pagesize bytes: PFN+N-1 Block boundaries can occur anywhere. F: dump header (dumphdr_t) The dump header is always located 64K (DUMP_OFFSET) from the end of the dump device, or the end of the compressed copy (vmdump.N). It has an offset to the beginning core header. G: data header (dumpdatahdr_t) The data header is located after the dump header. It has compression information that savecore needs to decompress the file. (See 5.2 above.) 7.0 Memory space requirements Memory usage varies with each compression method. However, each method performs the same basic steps, which are: A: Map a page to be dumped, according to the dump bitmap. B: Copy the mapped page to a another buffer, removing UE's. Report UE's on the console. C: Compress the copy into an output buffer. D: Unmap the the original page. E: Write the compressed page to the dump file. These virtual and physical memory resources are allocated when dumpadm(1m) runs. dumpadm runs automatically at start-up; it can also be run manually by the system administrator. Pre-allocation is done for two reasons: 1) when the system is crashing it is probably unrealiable. Therefore, pre-reserving resources for use later at dump time is more robust. And, 2) even if the system were reliable, it is desirable to preserve the kernel state as much as possible. Allocating resources at dump time would affect the kernel state, perhaps obscuring the original fault. The new parallel dump methods require much more VA range and memory buffering, because each CPU runs in parallel and must have its own resources. In this case, there is one master and multiple slaves. The master CPU is the one chosen to process the crash, and the slaves (helpers) are chosen among the CPUs spinning in panic_idle. The above five steps are modified for the parallel case as follows: A: Master: map 4MB ranges, each of which contains one or more pages to be dumped, according to the dump bitmap. B: Helpers: copy individual pages from a 4MB range that must be dumped, according to the dump bitmap. Filter any UE's and record diagnostics in an error message buffer. C: Helpers: compress the pages and concatenate them into an output buffer. D: Master: unmap the 4MB ranges. Helpers: invalidate the 4MB ranges. E: Master: write output to the dump file. Report any UE's on the console log from error message buffers. The strategy is to pre-allocate a minimum amount of VA range and memory, initially. Very small systems will still do the old-style dump. Most systems will allocate enough memory to do a parallel lzjb dump. This will ensure a successful dump no matter the conditions when a crash occurs. However, there are usually many non-kernel pages that will not be dumped, and these pages will be found at crash time and harvested to create additional buffers. Should a sufficent amount of buffering be available for bzip2, dumpsys will re-configure itself to do the optimum dump. 7.1 single-threaded lzjb This is the old method. It has minimal requirements. One CPU processes the dump file at crash time. The memory required is: A: 1 small page for mapping. B: 1 pagesize copy buffer for UE detection. C: 1 pagesize buffer for compressed output. 7.2 parallel lzjb This is the old compression method. But there are N helpers running in parallel. N=4 is often optimum for this method. A: N*2 4MB VA ranges for mapping. That is, 32MB VA ranges. Doubling the map buffers allows the master to keep ahead of the helpers. B: N pagesize copy buffers for UE detection. C: N*4 128KB buffers used for output or for error messages. That is, 2MB buffering for output. UE's are rare, therefore error messages are rare. Most of this buffering is used for output. 7.3 parallel bzip2 This is the new compression method. It requires a large amount of state, and is much slower, but gives much better compression. There are N helpers. The optimal value for N varies with processor speed but is large; N=100 is used for the calculations below. A: N*2 4MB VA ranges for mapping. That is, 200 * 4MB VA ranges. Doubling the map buffers allows the master to keep ahead of the helpers. B: N pagesize copy buffers for UE detection. C: N*4 128KB buffers for output or for error messages. In addition, there is N*800KB needed for bzip2 state. Thus the total for buffers and state for N=100 is about 128 MB. 8.0 Compression methods There are two parallel compression methods in this implementation. The method is chosen at dump time based upon available memory, CPU architecture, and the number of available CPUS. See [5] for a comparison of compression ratios. Architecture NCPU Algorithm sun4u < 24 parallel lzjb sun4u >= 24 parallel bzip2(*) sun4v < 64 parallel lzjb sun4v >= 64 parallel bzip2(*) x86 < 16 parallel lzjb x86 >= 16 parallel bzip2(*) 32-bit N/A single-threaded lzjb (*) bzip2 is only chosen if there is sufficient available memory for buffers at dump time.