Solaris Fast Reboot (Version @(#)fastboot.txt 1.28 08/06/17) 1. Introduction Solaris has always strived to be the most reliable and available operating system. Many technologies have been invented to achieve this goal, the notable ones include Dynamic Reconfiguration (DR), Fault Management Architecture (FMA), SMF, ZFS, just to name a few. The objectives of all these projects are to keep the systems up and running correctly in the face of unexpected hardware and software failures with as little down time as possible. System boot/reboot time is considered system down time. The less time a system spends in the boot phase, the more useful work it can do. High availability is extremely important to most, if not all, of our customers. Shorter reboot time also reduces the test turnaround time, thus improves developers' productivity. 1.1 Background The Solaris boot and reboot path involves the following basic steps: On x86 systems: (Hardware reset) -> BIOS -> grub -> dboot -> kernel On SPARC systems: (Hardware reset) -> POST -> OBP -> dboot -> kernel As computer systems become more complex, the time they spend in the BIOS/POST phase to test and initialize hardware gets longer. In the next 12 months we expect to see x86 systems with 1TB of memory. Memory initialization alone will take over 1/2 hour. It becomes more and more desirable to short circuit the reboot path so that the firmware and bootloaders can be bypassed. 1.2 Enabling Technology The key responsibilities of the firmware is to test and initialize hardware components. If an operating system can be made to perform the same tasks, then the firmware can be bypassed. In Solaris, there are several such enabling technologies, - Dynamic device discovery: newly added devices can be discovered by the OS without going through BIOS. - Fault Management Architecture: faulty components can be identified and offlined while the OS is running. - Flexible memory and disk scrubbing: low priorities threads that scrub memory and disk in the background to reduce the impact of memory and disk errors. The recent unification of the boot architecture for SPARC and x86 provided a common framework to further enable direct loading of a new Solaris image bypassing the firmware. 1.3 Competitive Analysis Linux has a feature called "kexec" where the running kernel loads a new kernel, and jumps to it directly. To avoid potential litigation, no "kexec" source code has been studied. The observed limitations of "kexec" are - The kernel has to be completely functional - It can only support same mode transition (32-bit -> 32-bit or 64-bit -> 64-bit). - It cannot handle systems where the physical memory address can't fit in an unsigned long, which limits the maximum amount of memory it can support to 4GB. - Many device drivers can't restart correctly. Due to the many limitations, the tool has never been truly adopted. The issue is not so much technical, but that engineers who are capable of fixing the kernel code and drivers work with the distros and are not directly involved in qualifying platforms for third party system vendors. On the other hand the situation of Sun qualified hardware and Sun supported Solaris all being developed by a single company puts us at a competitive advantage in that all the parties required to make this work are motivated by the same bottom line. Even in the era of OpenSolaris, 3rd-party driver developers tend to be heavily influenced by the high engineering standards held by Sun developers, therefore more likely to produce compliant software. 1.4 Project Objectives The goal of the Solaris Fast Reboot project is to get to login prompt from "rebooting..." within seconds (assuming boot archive has been updated). The Solaris implementation will support systems with arbitrarily large amount of memory, and provide flexibility to reboot to 32-bit or 64-bit kernels. In phase 1 of the project, we will support fast reboot to any kernel on x86 systems released by Sun. In the second phase, fast reboot post panic will be supported on x86. The subsequent phase will be to extend the support to SPARC systems and cold boot. In the case of panic reboot or cold boot, the system will be booted with minimum amount of clean memory. The rest of the memory will be tested first and dynamically configured in. 1.5 Dependencies The Fast Reboot project has strong dependencies on drivers being able to quiesce() on the down path to stop DMA and interrupts generation, and re-attach without going through BIOS reset and initialization. Drivers must implement the DDI quiesce(9E) entry point to quiesce hardware (see 6.1 for details). 2. x86 Fast Reboot On x86 systems, upon startup or reset, the BIOS code performs hardware testing and initialization, then jumps to "grub" the boot loader. Grub loads dboot, unix text, data and the boot archive into memory, then calls dboot. dboot does necessary initialization, such as building the initial page tables, or loading kernel text and data to a different location, then jumps to the kernel. The fast reboot code will act as an in-kernel boot loader that loads the kernel into memory and switches to it. The new kernel in the context of this write-up includes the dboot that gets tacked on during build time. 2.1 Software Components 2.1.1 Data Structures The new kernel will be represented by the fastboot_info_t structure. typedef struct _fastboot_info { uint32_t fi_magic; fastboot_file_t fi_files[FASTBOOT_MAX_FILES_TOTAL]; int fi_has_pae; uintptr_t fi_pagetable_va; paddr_t fi_pagetable_pa; paddr_t fi_last_table_pa; paddr_t fi_new_mbi_pa; int fi_valid; uintptr_t fi_next_table_va; paddr_t fi_next_table_pa; uint_t *fi_shift_amt; uint_t fi_ptes_per_table; uint_t fi_lpagesize; int fi_top_level; } fastboot_info_t; typedef struct _fastboot_file { uintptr_t fb_va; x86pte_t *fb_pte_list_va; paddr_t fb_pte_list_pa; uintptr_t fb_dest_pa; size_t fb_size; uintptr_t fb_next_pa; fastboot_section_t fb_sections[MAX_ELF32_LOAD_SECTIONS]; int fb_sectcnt; } fastboot_file_t; typedef struct fastboot_section { uint32_t fb_sec_offset; uint32_t fb_sec_paddr; uint32_t fb_sec_size; uint32_t fb_sec_bss_size; } fastboot_section_t; 2.1.2 New "-f" and "-e" options to reboot(1M) An additional flag "-f" is added to the reboot(1M) command. When the "-f" flag is specified, the fast reboot code will be executed. For example, # reboot -f -- '/platform/i86pc/mykernel/amd64/unix -k' to boot to new kernel "mykernel", or # reboot -f to fast reboot with the same boot arguments. The boot archive location and name is derived from the kernel name. In case of any failure in the fast reboot path, such as out of memory, the normal reset path will be taken. To facilitate the "-f" flag, a new uadmin function AD_FASTREBOOT is added to the current function list. It is recognized by commands using these function numbers. For example, # uadmin 2 8 will reset the system using the current boot arguments via the fast reboot path. In addition, rebooting to a different UFS boot disk or ZFS root pool will be supported by fast reboot only: # reboot -f -- '/dev/dsk/c0t0s3 /platform/i86pc/mykernel/amd64/unix -k' # reboot -f -- 'rootpool/root1 /platform/i86pc/mykernel/amd64/unix -k' The "-e" option can be used to specify a different boot environment. # reboot -f -e second_be -e flag is only valid when -f is specified to indicate an alternate BE. If a second disk or root pool has been mounted, it can be specified as following: # reboot -f -- '/mnt/platform/i86pc/kernel/amd64/unix -k' 2.1.3 In-kernel image loader The in-kernel image loader is invoked by mdpreboot() when fast reboot is requested. if (fcn == AD_FASTREBOOT && fastreboot_capable) { load_kernel(mdep); } "fastreboot_capable" is set to 0 on xVM and non-global zones. It will also be used to disable fast reboot if drivers are identified not to support quiesce() DDI functions. The following sub-sections cover the main steps performed by load_kernel(). 2.1.3.1 Process reboot arguments Process the new kernel name if provided. If the request is to reboot to a 64-bit kernel, it also validates that we are running on a 64-bit capable system. 2.1.3.2 Load new kernel and boot_archive Once the new kernel and boot archive have been processed to obtain the sizes, kernel memory will be allocated using contig_alloc(). The dma_attr_sgllen needs to be set to the right size will allow non-contiguous physical memory that fits in the lower and upper memory bound range. dma_attr.dma_attr_sgllen = (fsize / PAGESIZE) + (((fsize % PAGESIZE) == 0) ? 0 : 1); The calculation of dma_attr_sgllen is essential for removing the restriction of physical memory being contiguous, which is nearly impossible on system that has been up and running for a while, while making sure that it will be in high physical memory to avoid overlapping with the final target address range for the kernel and boot-archive. Once memory has been successfully allocated, the files will be read into memory. The "Copy and Switch" routine, which will be described in 2.1.4 is copied to low memory at address 0x5000. The address is selected following the convention set by the implementation of do_bios_call(). 2.1.3.3 Construct physical page list The next step is to construct the list of physical pages that the new kernel and boot archive span. The physical page list will later be used to build virtual to physical page translations if the system supports PAE, or to be used directly as source addresses otherwise. 2.1.3.4 Construct multiboot_info_t Before passing control to the switch routine, a new multiboot_info structure is constructed to reflect the new location of the boot archive. Once this step is completed, the valid flag in the data structure is set. The magic field will also be set to "FAST" which can be used by the new kernel if it needs to determine whether this is a boot post fast reboot. 2.1.4 Mode switcher In mdboot(), if AD_FASTREBOOT flag is set, and the new kernel is valid, fast_reboot() will be invoked. fast_reboot() immediately jumps to the switch code fb_swtch() with a pointer to the new kernel data structure as argument. fb_swtch() resides in low memory 0x5000 and must be compiled with absolute address so that it will not access text or data beyond the function. It performs the following tasks: - Sets up control registers - Transitions to 32-bit protected mode - If the system supports PAE, it will enable PAE and turn on paging. Then it will copy the new unix and boot archive from high memory to low memory using a virtual address. If the system does not support PAE, the new unix and boot archive will be copied using the physical address directly. - If booting 32-bit, the kernel text and data will be loaded to the final target address. - Jump to dboot with the pointer to the new multiboot_info data structure as argument. 2.1.5 Re-entering dboot and kernel dboot is called with the new boot argument, and eventually switch to the new kernel. 3. x86 Fast Core Dump and Panic Reboot (Phase II.0) If the system panicked, we will switch a previously loaded good kernel. Core dump will be performed by the newly loaded kernel. To support fast panic reboot the system must have sufficient memory to keep a good copy of the kernel loaded at all times. In addition, it must still be able to call drivers' quiesce(9E) entry point. As such, drivers must implement quiesce(9E) in such a way that it is safe to be called in panic context. Since we can't be sure of the reason for panic, it might be necessary to boot the new kernel with the bare minimum amount of known clean memory. The new kernel will test the rest of the memory and dynamically configure it in. 4. x86 Fast Cold Boot (Phase II.1) For Cold Boot to be fast, BIOS must recognize that it is booting a Fast Boot capable kernel, and initialize the bare minimum amount of memory and devices. The new kernel will be booted with the small amount of clean memory. It then will perform the rest of the memory testing, and dynamically configure the rest of the memory into the system. 5. SPARC fast boot/reboot (Phase II.2) Extending the support to SPARC is a natural next step. 6. Device Quiescent With the current reboot implementation where pc_reset() is used to reset the system, devices are typically not shut down. Solaris relies on the firmware to reset and reinitialize these devices before the next OS instance runs. In addition, most drivers are written expecting a pristine hardware state set by the BIOS. They might not work correctly when the starting hardware state is not at post BIOS state. For Fast Reboot to work, devices that don't get shut down and could continue to generate interrupts or access memory must be quiesced. 6.1 Options explored During the design and prototyping of this project, the following options were explored. 6.1.1 Leverage DDI_DETACH If all the drivers can cleanly detach, calling detach(9E) with DDI_DETACH on the down path would be a clean solution. There will also be absolutely no additional work for the drivers. The path would have been well tested and supported. However, there are numerous ways for driver detach to fail. For example, as Thirumalai pointed out, with STREAMS based DLPI driver, or GLDv3 driver, the driver detach would fail if there are upstream clients. I have not found a way to make detach to work on the down path. There are simply too many failure conditions in detach() for DDI_DETACH. 6.1.2 Leverage DDI_SUSPEND Leveraging DDI_SUSPEND would pose no additional fast reboot requirements or work for the drivers. Since "Suspend to RAM" is a committed deliverable for the Solaris organization, the driver support will more likely to be there. However, once a device has been suspended, it expects to be resumed, not attached. This requires the currently running kernel to communicated to the next kernel what all the suspended devices are, and the next kernel must use DDI_RESUME for this set of devices. If the next running kernel does not understand this requirement, for example, it is a pre-fast-reboot kernel, it will simply use DDI_ATTACH to configure the devices, and the system will most likely hard hang when the device does not respond to PCI config cycles. When experimenting with this option where DDI_ATTACH instead of DDI_RESUME is used to configure the devices, all test systems hard hang during boot. I also experimented with using DDI_RESUME, but the implementation has proven to be very messy because there is no single place for invoking device attach. Some devices are attached in post_startup() via ndi_devi_online() if forceattch is set in /etc/system, some are attached in main() via i_ddi_forceattach_drivers() if the forceattach property is set, some are attached via by devfs via devi_attach_node(). The deal-breaker for this implementation is that, I don't want to have such dependencies on the next kernel. I am certain that somebody will not read whatever flagday or release notes we put out there to warn them not to fast reboot to a certain kernel. When they hard hang the system, it still be a service call to Sun. I would rather fail the fast reboot attempt than having the chance of the next boot hanging. 6.1.3 Disable memory, IO and bus master bits in PCI_CONF_COMM register If we could turn off Master Enable and Memory Access Enable for the devices in the PCI config space, it would be the ideal solution. There will be no additional requirements or work on drivers. It is simple, clean, elegant. In my opinion it's the perfect solution. However, I tried turning off various combinations of the ME, MAE and IO bits in the PCI_CONF_COMM register, and systems all hand hang when booting the next kernel. I will need to do a lot more research to understand the implications of turning off these bits in the PCI config space while the devices are active. It is not clear to me what happens if a device is doing DMA, and in the middle of it the PCI memory access bit is turned off. I gave up this option mainly due to my lack of knowledge about the PCI devices and my inability to debug hard hangs in the early boot stage. If someone can help debug the hangs, I still think this would the best solution. 6.1.4 Resurrecting reset(9E) There already exists a call today named reset_leaves() in the reboot path which calls the driver's reset() entry point if exists. Even though according to the "Device Driver's Guide" the reset() DDI interface is obsolete, many drivers, such as uhci and ehci, have to implement this interface to correctly reset hardware for warm reboot to work. The advantages of this implementation are: - The mechanism to invoke the interface already exists. - We can deterministically identify all the drivers that should implement this interface, and therefore fall back to regular reboot if not all drivers have implemented reset(). - The implementation is largely the same as suspend() minus the actual powering off the device. - There are no dependencies on the next kernel to be loaded. The biggest disadvantage of this implementation is that reset() would be an addition interface that drivers have to support just for Fast Reboot, which means that the political, customer and managerial pressure for such support will be far less than that for "suspend to RAM". Another concern is that, if a 3rd party drivers implements a bogus reset() that does not quiesce device correctly, the next running kernel could be corrupted. But this is arguably a bug and should be fixed. So far this is the only option I have gotten to work, and have rebooted over 7000 times on 10 types of test system (x4100, x4200, x4500, x4600, v65x, Ultra 20, Intel Clovertown, Dell PowerEdge, AMD anaheim white box, Andromeda blade). 6.1.5 Introducing quiesce(9E) This option is largely the same as 6.1.4, but instead of resurrecting reset(9E), we will add a new interface called quiesce(9E). For non-pseudo devices, ie, devices with real hardware associated with them, the device drivers must implement quiesce() in such a way that no more interrupts will be generated and no more memory access will occur post quiesce(); for drivers do have a DMA engine or do not generate interrupts, they can set quiesce() to ddi_quiesce_not_needed(), which returns DDI_SUCCESS, as an indication to the framework that the devices don't need to be quiesced. 6.1.6 Introducing DDI_QUIESCE and DDI_QUIESCE_CAPABLE to detach(9E) This implementation has the advantage of not reving the devops data structure. It will following the tradition of grouping similar functionalities under the same devops entry, such as DDI_ATTACH and DDI_RESUME, DDI_DETACH and DDI_SUSPEND. The driver implementation will be the same as with quiesce(9E). After discussions with various DDI maintainers, the Power Management team, driver developers and Solaris generalists, we have decided that while there are precedents, the precedents are bad practices because they complicate the interfaces and debugging. 6.2 Plan of record After discussions with various DDI maintainers, the Power Management team, driver developers and Solaris generalists, we have decided that this is the cleanest way to go. Once the PSARC case is approved, the following steps will need to be taken to implement this interface: - For drivers in ON, we (mostly I) will manually go through all the source code. If a driver has DMA mappings or generate interrupts, I will contact driver owners to add quiesce() implementation. If we don't have resources to implement it yet, we will set quiesce() entry point to ddi_quiesce_not_supported() to indicate that the driver should eventually support quiesce. For drivers that don't need to be quiesced, their devo_quiesce entry will be set to ddi_quiesce_not_needed(). - In place of reset_leaves(), we will implement a routine called quiesce_reset_devices() that does the following: * if the driver is down-rev'ed, or quiesce() is NULL or nodev or ddi_quiesce_not_implemented() for non-pseudo driver, set fastreboot_capable to 0 and continue the reboot as if a regular reboot; * if a driver has quiesce(), call quiesce(); * if fastreboot_capable is 0, if there is no quiesce(), but has reset(), call reset(). Proposed man page for quiesce(9E) is at http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/quiesce.man.txt 7. Existing Issues Drivers need to implement quiesce(9E). http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/todo.txt 8. Misc One pager is at http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/onepager.txt Man pages are at http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/reboot.man.txt http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/uadmin-2.man.txt http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/quiesce.man.txt http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/dev_ops.man.txt Project gate is at /net/girltalk.sfbay/export/fastboot/fastboot-gate Preliminary webrev is at http://jurassic-x4600.sfbay/~sherrym/fastboot-0520-webrev All tests that must be performed are listed at http://jurassic-x4600.sfbay/~sherrym/projects/fastboot/testing.txt