Table of Content 1 General decription 1.1 Background Diagram 1.2 Traditional 512 byte sector size disk 1.3 Proposal on large sector size disk 1.4 Proposal on emulation mode 2 Technical description 2.1 The designs are based on the following assumptions 2.2 Change summary 3 Proposed solution in detail 3.1 Disk label 3.1.1 EFI label 3.1.1.1 Label format 3.1.1.2 Implementation 3.1.2 VTOC label 3.1.2.1 Label format 3.1.2.2 MBR 3.2 Utilities 3.2.1 Format 3.2.1.1 Description 3.2.1.2 Data and Structure Definition 3.2.1.3 Solution 3.2.2 Fdisk 3.2.2.1 Description 3.2.2.2 Implementation 3.3 File System 3.3.1 ZFS 3.3.2 UFS 3.4 The sd driver 3.4.1 Proposal 3.4.2 Implementation 3.5 Xen 3.6 RMW (Read Modify Write) 3.6.1 Description 3.6.2 Design Specification 3.6.2.1 Basic operation 3.6.2.2 Configuration in sd.conf 3.6.2.3 Console message 3.6.3 Removable media 1. General description This project is to enable Solaris to support disks with variable sector size in powers of 2 starting at 512 bytes and upto 4096 bytes, i.e: 1 KByte, 2KB or 4KB. The 4kB sector size disk will be released soon and Solaris needs to support it. The project team would like to deliver Solaris large disk sector support in 2 phase: Phase I > Correctly label on large sector size disk. > Can do I/O (raw & block). > ZFS support as non root disk. Phase II > Support UFS > Can install and boot on large sector size disk. 1.1 Background Diagram +--------------+ +--------+ +--------+ +-------+ | Application/ | | sd | T10 | disk | | | | filesystem |---buf---| driver |---SCSI----| device | ~~| media | +--------------+ +--------+ CDB +--------+ +-------+ Application/filesystem This including user applications, filesystems that will generate I/O requests, for example, dd(1M), specfs, UFS(7FS), ZFS(1M), etc. buf(9S) This is the block I/O data transfer structure. All I/O requests generated by application and filesystem will be carried by buf structure and sent to the disk driver. For raw interface for block disk devices, the I/O requests will be eventually translated into buf structure through physio(9F). In buf(9S) structure, there are field to identify the logical block on the device and field to specify the number of bytes to be transfered. sd(7D) driver This is the SCSI disk device driver. sdstrategy(9E) is the entry point for block I/O request and sd(a)read, sd(a)write are the entry point for raw I/O interfaces. sdstrategy(9E) takes buf(9S) as input and is responsible for build the SCSI CDB according to the buf(9S) structure and transport the packet to disk drives. For raw I/O interfaces, sd(9D) will call physio(9F) and in turn sdstrategy(9E) in the end. T10 SCSI CDB This is the SCSI Command Descriptor Block. For I/O path, we care the READ and WRITE command, including READ(6), WRITE(6), READ(10), WRITE(10), etc. In READ and WRITE CDB, there are LBA and TRANSFER LENGTH. The LBA field specifies the logical block address accessed by this READ or WRITE command. The TRANSFER LENGTH field specifies the number of contiguous logical blocks of data that shall be transferred starting withe the LBA. sd(9D) converts the b_blkno/b_lblkno block number on device and b_bcount number of bytes to transfer in buf(9S) structure to the LBA and TRANSFER LENGTH in CDB. disk device and media The I/O request will be eventually carried by CDB and transfered to the disk device. The data will be put on or read from the media. The media has its physical block size and the disk device can export to the initiator with a different logical block size. There are 3 kinds of disk device and media nowadays: a) disk with 512 bytes sector size. pbs(physical block size)=512B; lbs(logical block size)=512B. b) disk with large sector(1K, 2K, 4K). pbs=1KB,2KB,4KB; lbs=1KB,2KB,4KB. c) disk with large sector size(1KB, 2KB, 4KB) but running in emulation mode that export to the initiator with 512 bytes logical block size. pbs=1KB,2KB,4KB; lbs=512B 1.2. Traditional 512 byte sector size disk +--------------+ +--------+ +--------+ +-------+ | Application/ | | sd | T10 | disk | | media | | filesystem |---buf---| driver |---SCSI----| device | ~~| (512B)| +--------------+ (512B) +--------+ CDB +--------+ +-------+ (512B) Currently, except some removable disk and CD-ROM device, Solaris can only support 512 byte sector size disk. This size is hard coded as a global micro as DEV_BSIZE. The I/O flow on 512 byte sector size disk: Applications and filesystems generate I/O requests and send them to sd(7D) through buf(9S) structure. The b_blkno/b_lblkno in buf(9S) represent the block number on device. The size of block is 512 byte. sd(7D) converts the block number to the absolute block address on disk which is also 512 byte one block and build CDB based on the block number. Under this situation, the layout of the disk labels (fdisk table, VTOC label and EFI label) are based on 512 byte per block. And the applications and filesystem can only send I/O request which is aligned with 512 bytes. Otherwise the I/O will fail. 1.3. Proposal on large sector size disk +--------------+ +--------+ +--------+ +-------+ | Application/ | | sd | T10 | disk | | media | | filesystem |---buf---| driver |---SCSI----| device | ~~| (4KB) | +--------------+ (512B) +--------+ CDB +--------+ +-------+ (4KB) IDEMA and its members have advocated for a new industry standard for large-block sectors. Disks with large sector size will report to initiator the larger LOGICAL BLOCK LENGTH IN BYTES through READ CAPCITY command instead of the traditional 512 bytes. Our proposal to let Solaris support large sector size disk: On large sector size disk, the layout of the disk label is changed to accommodate to large sector size. Refer to disk label session for details. In buf(9S) structure, b_blkno/b_lblkno still represent the block number on device and the block size is still 512 byte. However, this block size is no longer equal to the sector size on disk but a logical DEV_BSIZE(512 byte) concept to applications and file- system. sd(7D), is responsible for converting the logical 512 block address to the physical large sector address. Refer to sd session for details. Under this situation, applications and filesystems should have knowledge of the physical sector size because any I/O that is not aligned with the physical sector size will be rejected by sd. They can get the knowledge by issuing DKIOCGMEDIAINFO dkio(7I). For backward compatibility, sd(7D) will do RMW (Read Modify Write) to support non-aligned I/O. But the performance will be very low. Refer to RMW session for details. 1.4. Proposal on emulation mode. +--------------+ +--------+ +--------+ +-------+ | Application/ | | sd | T10 | disk | | media | | filesystem |---buf---| driver |---SCSI----| device | ~~| (4KB) | +--------------+ (512B) +--------+ CDB +--------+ +-------+ (512B) Vendors have different implementations for large physical sector size. For example, disks with native sector size of 4096 will present to its attached OS a logical sector size of traditional 512 bytes. Basically, there is no need to change anything before Solaris can run on large sector size disk with emulation mode. However, if applications or filesystems keep sending I/O request that are not aligned with the physical sector size, the disk firmware has to do a lot of RMW which will heavily impact the performance. The newest SBC-3 spec (sbc3r17) extended READ CAPACITY (16) command to not only report the initiator LOGICAL BLOCK LENGTH IN BYTES but also report LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT which is the 2^n logical blocks per physical block. So the proposal here is to extend the DKIOCGMEDIAINFO dkio(7I) to report the physical block size. Applications and filesystems can decide the physical block size by issuing DKIOCGMEDIAINFO and can send I/O requests that are aligned with the physical block size to improve the I/O performance. Change to the argument to DKIOCGMEDIAINFO /* * Used for media info or profile info */ struct dk_minfo { uint_t dki_media_type; /* Media type or profile info */ uint_t dki_lbsize; /* Logical blocksize of media */ diskaddr_t dki_capacity; /* Capacity as # of dki_lbsize blks */ | uint_t dki_pbsize; /* Physical blocksize of media */ }; For disks that don't support reporting LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT, customers can edit [s]sd-config-list in sd.conf to config the logical blocks per physical block exponent to let sd report the configured value to DKIOCGMEDIAINFO. This configuration is per device type. Refer to PSARC 2008/465 for [s]sd-config-list. physical-block-size Committed UINT32 2. Technical description 2.1 The designs are based on the following assumptions * To operate correctly on large sector size disk, applications may need to query for physical sector size by issuing DKIOCGMEDIAINFO dkio(7I). * Applications and kernel modules that are sensitive to disk's physical sector size should be changed to reflect the large sector size disk drives. * buf(9S) is the I/O interface sd(7D) exported. Due to the extensive usage of buf(9S) in many solaris modules, buf(9S) structure is kept intact. sd(7D) exports a logical block size as 512 bytes in buf(9S) and sd(7D) is responsible for translating the logical start block address and size in the I/O request carried by buf(9S) to physical disk's block address and size. 2.2 Change summary * Disk label changes > EFI: PMBR 1 sector, GPT Header 1 sector, 9 partitions are put into one sector, still 128 partitions in which others are reserved. > VTOC: Label is expanded from 512 bytes to large sector size. On x86 MBR is expanded from 512 bytes to large sector size. * I/O path Convert blkno in buf(9s) from 512 bytes to physical sector size. This conversion happens in xbuf in sd(9D) driver. xp->xb_blkno = xp->xp_blkno * 512 / sector_size If sd is configured not to support RMW > raw: Add alignment check in sdread, sdwrite, sdaread, sdawrite and return error if the I/O request is not sector size aligned. > block: Add alignment check in sdstrategy. Otherwise, do RMW (Read Modify Write). * ZFS ZFS can automatically run on lager sector size disk after the label change and I/O path change. * UFS Since UFS is tightly related to disk geometry, so even the label and I/O path are changed, there are still a lot of work to do to support UFS on large sector size disk. Although a prototype has been made that a UFS can be mounted on a 4k disk, the task itself is huge and risky. So the propose is to put the support of UFS on large disk sector to phase II. Check will be made on the first time the user tries to build a UFS. If the physical disk sector size is not 512, the building fails. * SD > dkio(7I) DKIOCGMEDIAINFO returns the physical sector size in dk_minfo structure. If the disk drive is in emulation mode and it supports the newest SBC spec (sbc3r17), DKIOCGMEDIAINFO will return the logical sector size on the disk device and physical sector size on the media. Please diagram doc for details. > Add alignment check for the incoming I/O request against the physical sector size. > Convert the I/O requests' block address and size to the physical block address and block count. > RMW (Read Modify Write) is supported for backward compatibility. But the performance will be very low for misaligned I/O. Users can configure sd(7D) to not support RMW or support RMW without warning message. Refer to RMW session for details. * CMLB > Changed to recognize, read, write, new label on large sector size disk. * Disk utilities (format, fdisk, fmthard, prtvtoc, diskscan) > Fully support the new disk label and MBR on large sector size disk, including label read, write, modify, convert. > Change format(1M) to use real disk sector size when doing I/O instead of 512 bytes. * dd(1M) The dd (1M) command can send either raw I/O or block I/O. Raw I/O requires the specified block size is in alignment with the physical disk sector size. Otherwise, the operation fails. However, block I/O has no restriction on that. The large sector size disk maintains the same behavior when running dd (1M) command, and the default block size which is 512 bytes is kept intact. The only change is that users need to specify block size to be aligned with the disk's sector size to correctly do raw I/O. * Xen Phase I, large sector size disk could be attached to Domu as data disk. Phase II, Domu could be installed and boot from the large sector size disk. * LDom Phase I, large sector size disk could be exported to guest domain as data disk. Phase II, guest domain could be installed and boot from the large sector size disk. * Others Find out other modules which will send I/O request to sd(7D). For modules that are not sensitive to disk's sector size, they should work on large sector size disk after the change to label and I/O path. For modules that are sensitive to disk's sector size, either change them to work on large sector size disk or prevent them from sending I/O on large sector size disk at phase I and support them on phase II. The modules that have been investigated so far are listed below sharefs(no need to change) mntfs (no need to change) hsfs (no need to change) nfs (no need to change) specfs (no need to change) devfs (no need to change) tunefs (no need to change) ctfs (no need to change) lofi (no need to change) dump (support at phase I) lvm (support at phase II) cluster(support at phase II) pcfs (support at phase II) 3. Proposed solution in detail 3.1 Disk label 3.1.1 EFI label 3.1.1.1 Label format EFI label format will be compatible with UEFI(Unified Extensible Firmware Interface) Specification and is the same on x86 and sparc. First sector (LBA 0) is still Legacy Master Boot Record(MBR). Byte offset 510 is 0xaa55 (byte 510 contains 0x55 and byte 511 contains 0xaa. Byte offest 512 to (sector size - 512) are reserved. LBA 1 is still GPT Partition Table Header. Byte offset 92 to 4095 are reserved. Starting from LBA 2 (or from the LBA that is indicated by PartitionEntryLBA in GPT Header) is the GUID Partition Entry Array. Each entry takes 128 bytes. Solaris creates 9 partition entries and reserves another 119 entries, 128 entries in total. For 512 byte sector size, those 128 entries are stored in 32 sectors. But for large sector size, those 128 entries are stored in (128 entries)/ ((sector size)/(128 bytes)) sectors. Partition 8 is reserved partition which takes 8M bytes. For 512 bytes sector size, the reserved partition is stored in 16384 sectors while in large sector size system, it is stored in (8M bytes) / (sector size) sectors right before the backup Partition Table. The layout of backup Partition Table is similar with the primary Table. The following tables show the layout of what primary EFI label looks like in a 512 bytes sector size system and a large sector size system (4096 bytes for example). Byte Sector 0 ____________________________ | | | PMBR | 0 512 |____________________________| | | | GPT Header | 1 1024 |____________________________| | | | Partition Entry 0 - 3 | 2 1536 |____________________________| | | | Partition Entry 4 - 7 | 3 2048 |____________________________| | Partition Entry 8 | | Partition Entry 9 - 11 | 4 1536 |____________________________| | | | Partition Entry 12 - 15 | 5 2560 |____________________________| | | | ..................... | 16896 |____________________________| | | | Partition Entry 124-127 | 33 17408 |____________________________| 512-byte block size disk Byte Sector 0 ____________________________ | | | PMBR | 0 4096 |____________________________| | | | GPT Header | 1 8192 |____________________________| | Partition Entry 0 - 8 | | Partition Entry 9 - 31 | 2 12288 |____________________________| | | | Partition Entry 32 - 63 | 3 16384 |____________________________| | | | Partition Entry 64 - 95 | 4 20480 |____________________________| | | | Partition Entry 96 - 127 | 5 24576 |____________________________| 4k block size disk Where partition Entry from 9 to 127 are empty and reserved. 3.1.1.2 Implementation Macro EFI_LABEL_SIZE remains 512 thus the size of struct efi_gpt remains 512. The reason for that is we want to support variable sector size instead of the fixed sector size. By doing this, we have to use the real lba size read from the EFI label or get from DKIOCGETMEDIAINFO instead of sizeof(efi_gpt_t) when we try to locate the physical location of partition table entry on the disk. 3.1.2 VTOC label 3.1.2.1 Label format Our proposal has not much influence on vtoc label, either on vtoc8 nor vtoc16. The vtoc label which is represented by struct dk_label is 512 bytes. This struct and size is not changed. On 512 byte system, the vtoc label is stored in one sector; while on large sector system, the vtoc label is stored in the first 512 bytes of one sector. Other parts of the sector are reserved and are filled with zero. 3.1.2.2 MBR On x86 platform which has MBR (fdisk table), large sector disk doesn't change anything of the first 512 bytes, in which the MBR signature are still located at byte offset 510 (510 55h and 511 AAh). Other parts of MBR are reserved and are filled with zero. 3.2 Utilities 3.2.1 Format 3.2.1.1 Description Format (1M) is changed to handle disk with sector size other than 512 bytes. The disk I/O can be implemented through uscsi (7I); The type of the disk can be defined; The disk can be labeled as VTOC or EFI and can be converted between the two label types; Partition table can be correctly described and modified with the right capacity. After modification, format utility can support identifying both 512B block size and non-512B case. For most of format sub functions which are disk block size aware, they need to be modified. But from the users' view, they may not be aware of such changes which means there might be no new user interface provide. Format will handle the difference between disk block size. 3.2.1.2 Data and Structure Definition struct disk_type { char *dtype_asciilabel; int dtype_flags; ... | uint_t lbasize; uint64_t capacity; } int block_size; 3.2.1.3 Solution When format(1M) starts to probe existing disk, the sector size is queried from the disk by DKIOCGEMEDIAINFO and the sector size is given to global variable block_size and disk_type. format(1M) can use the global variable and disk_type to represent the disk's sector size afterwards, where block_size can be used to replace micro DEV_BSIZE, SECSIZE, etc. format(1M) will not check the sector size and alert users to modify the sector size to 512 bytes on non-512 disk. Instead, format will use block_size to build the geometry and label. Please referring label chapter for how the label looks like on large sector disk. Another thing needs to be noticed is low-level format. Since it not all depends on format, if hardware supports the block size conversion, this sub command can provide the functionality of switch the block size between 512 and non-512. format (1M) use uscsi(7I) to perform disk I/O for SCSI disk and read(2) write(2) for others. The read/write interfaces are described in other chapter. For uscsi, block_size is used to calculate the bufaddr and buflen instead of micro SECSIZE; Partition capacity, say MB, was calculated by the following statement ((n / 1024.0) * DEV_BSIZE) Now it is replaced by ((n / 1024.0) * block_size) where n is the number of blocks. 3.2.2 Fdisk 3.2.2.1 Description fdisk (1M) is used to create and modify a fdisk partition table as well as install the master boot record that is put in the first sector of the fixed disk, on x86 system only. Fdisk (1M) is changed to handle disk with sector size other than 512 bytes. After the modification, fdisk can create and modify fdisk partitions on both 512B and non-512B sector size disk. 3.2.2.2 Implementation fdisk (1M) is changed to use the disk block size queried from the disk by DKIOCGEMEDIAINFO other than hard coded 512B, read/write the master boot record with the sector size, and do computation with the sector size other than DEV_BSIZE, SECSIZE, etc. The cmlb driver needs to do relative changes to read/write the master boot record with the buf length set to the disk block size. 3.3 File System 3.3.1 ZFS ZFS can automatically run on lager sector size disk after the label change and I/O path change. 3.3.2 UFS 2.3.2.1 Phase I Not support UFS on large sector size disk. 2.3.2.2 Phase II Full support UFS on large sector size disk. In order to mount ufs onto a large sector size disk, the ufs related utilities like newfs/mkfs/fstyp et al will be changed to have the knowledge of disk physical sector size. The main idea is to change the currently hard-coded DEV_BSIZE to real disk sector size, after such modification, the position and number of super block, cylinder blocks et al remains the same in block view, but the size of block is changed to large sector compliance. Since the file system block size is 8K now, if the disk block size switches to large sector size, say 4K, the minimal frag size of ufs should be 4K which is at least equals to the physical disk sector size. Accordingly, utility like fsck, fstyp should also be changed. 3.4 The sd driver 3.4.1 Proposal sd is the one who knows target block size well, and the only I/O interface sd provided is buf (raw I/O will eventually calls sdstrategy). Now, the blkno in buf structure represents the start block address for 512 bytes per block; bcount represents the bytes of transfer. So both raw I/O and block I/O can be treated in the following method. Firstly, blkno is multiplied by 512, then divided by target block size, the start address and r/w block number is based on target block size now. This simple approach is safe and efficient. But be aware that the b_bcount and b_blkno should both be target block size aligned. Otherwise EIO is returned. Also, alignment check will be made in raw I/O interfaces sdread, sdwrite, sdaread, sdawrite as well. 3.4.2 Implementation The implementation is only to sd driver currently. Two members in buf structure need converting here, they are b_blkno which represents the block number on device and b_bcount stands for the transfer count. typedef struct buf { ... lldaddr_t _b_blkno; /* block # on device (union) */ size_t b_bcount; /* transfer count */ ... } buf_t; sd_xbuf is defined for transferring purpose. In sd_xbuf_init, b_blkno is assigned to xbuf. Then before transferring, b_blkno is assigned to scsi_pkt and b_bcount is converted to block count according to the target block size. The converted block number and block count will be filled into LBA and THANSFER LENGTH when sd building the SCSI CDB. ---------------- | | | sdstrategy | | | ---------------- | V ---------------- | | | sd_xbuf_init | LBA = (b_blkno * 512) / tgt_blocksize | | ---------------- | V ---------------- | | | sd_start_cmds| TRANSFER LENGTH = b_bcount / tgt_blocksize | | ---------------- 3.5 Xen Previously, the sector size of Xen is hard-coded to 512 byte which considering the fact that the backend of Xen can either be a file or a real disk. Now when introducing non 512 byte sector size disk, the design is going to change to have the knowledge to physical sector size while real disk is attached. Firstly, xdf(Xen front end) will make use of "sector-size" property provided by xdb (Xen backend) which previously hard coded to 512B. If the backend is a file, the "sector-size" is default to 512B, and if it is a real disk, the physical sector size will be used. That is to say xdf will have the real disk capacity and sector size information. Second, since the interface between xdb and sd is bdev_strategy, it should follow our assumption that the size of each block I/O is 512B when goes through sdstrategy. The interface will not be changed. Third, xdf and xdb handle the R/W block address conversion to satisfy sd driver's design of large sector size awareness. The above design is for Solaris Domu and Solaris Dom0. Other pairs like linux Domu, Solaris Dom0 or Windows Dom0, Solaris Domu are under investigation. Since Linux doesn't support large sector size disk, So the cases related to Linux may not be a problem. For Windows, as it claims that the need for third party participation to support 4K sector size disk, we may not consider support such cases now. 3.6 RMW (Read Modify Write) 3.6.1 Description By default, if I/O requests which are aligned with the physical sector size of the disk are sent to sd(7D), they will be transmitted directly; If they are not aligned, sd(7D) will do Read Modify Write (RMW) with warning message. However, customers could configure sd to not do RMW or do RMW without reporting warning message. 3.6.2 Design Specification There are two key elements which describe each I/O. One is the size of the I/O, the other is the starting block number. If the I/O size is indivisible by physical sector size, or the start block number is indivisible by the quotient of physical sector size and logical block size, we call it misaligned I/O. 3.6.2.1 Basic operation If I/O requests which are aligned with the physical sector size of the disk are sent to sd(7D), they will be transmitted directly; If they are misaligned, they will be transmitted through RMW with warning message on console. Error message looks like "I/O reqest is not aligned with %d disk sector size. It is transmitted through Read Modify Write but the performance is very low.\n" 3.6.2.2 Configuration in sd.conf Customers can edit [s]sd-config-list in sd.conf to configure the behavior for the given device when handling misaligned I/O. rmw-type Committed UINT32 #define SD_RMW_TYPE_DEFAULT 0 /* do rmw with warning message */ #define SD_RMW_TYPE_NO_WARNING 1 /* do rmw without warning message */ #define SD_RMW_TYPE_RETURN_ERROR 2 /* do not do rmw and return error */ The default value is set to 0 which means do RMW with warning message. EINVAL is chosen as the returning value when rmw-type is configured as 1. This value is in concordance of the returning value of sd(a)read/sd(a)write when I/O is misaligned. 3.6.2.3 Console message If every RMWed I/O because of misaligned request triggers one console message, there could be too many of them which cause debug messages flooding. An event driven timer controlled console message mechanism is proposed. The first misaligned I/O will triggers a console message and a timer (say 10 seconds). The succeeding misaligned I/O will not triger console message until the timer is timeout. And while logging the console message, a new timer will be started. Now the console message will contain the number of RMWed I/O, say "%d I/O reqests are not aligned with %d disk sector size. they are transmitted through Read Modify Write but the performance is very low.\n" 3.6.3 Removable media Removable media includes CDROM, floppy, flash card reader, USB disk, etc. In PSARC 2000/278 and PSARC 2005/731 have mentioned support on non-512-byte blocksize removable device. This project doesn't change the exiting support for non-512-byte block size removable device. So the above rmw-type is only applied to fixed disk. For non-512-byte removable media, the default behavior is doing RMW without warning message and no tunable for that configuration now.