From blake@Central Tue Jun 25 18:16:52 1991 To: Bill.S.@Eng Subject: Re: 64-bit file offsets I'll bet you are tired of looking at it by now, but here it is: Introduction ------------ 5.0 requires changes in order to support file systems larger than two gigabytes. This size limit arises because the kernel represents file offsets as signed 32-bit integers. The DDI/DKI provides limited support for 64-bit file offsets, but further steps are needed before this feature will be of any use. The proposed changes are in the following areas: 1) extended kernel support for 64-bit offsets; 2) enabling device drivers to use these offsets; 3) a new llseek system call to provide user-level access to the extended offsets; 4) changes to file system utilities to support large file systems. In addition, there are several open bugs concerning integer overflow when the size of a file system or individual file is too large. These should be fixed as part of the effort outlined here. The primary goal of this proposal is to allow file systems as large as one terabyte to be created and used with no special restrictions. This feature is important for Sun's server business, as the current 2GB limit is becoming increasingly restrictive in large configurations. A secondary goal is to make the necessary changes in a way that is compatible with the expected evolution of SunOS towards more general use of 64-bit interfaces. It would be a mistake, for instance, to imitate the method used by Medusa in 4.1.1, since that implementation represents a blind alley that would hinder future development. It is not a goal of this proposal to make the use of 64-bit file offsets universal either at user level or within kernel data structures. Both the level of effort and the risk of such extensive changes are significantly greater than what is reasonable in the 5.0 time-frame. Our concern here is just to support large file systems, for which purpose it suffices to make 64-bit file offsets usable with the raw disk interface. Current 64-bit support in 5.0 ----------------------------- With a view towards future support of large devices, the DDI/DKI defines two 64-bit types, lloff_t and lldaddr_t, to represent file offsets and disk addresses. The uio_loffset member in struct uio is of type lloff_t (uio_offset is an alias for the lower 32-bits of this extended offset), and lldaddr_t appears in struct buf and struct uio. However, at present these quantities are in effect 32-bit values with an extra 32 bits of space reserved for future use. There are three obstacles to making use of the full 64 bits: 1) The kernel code has been written to ignore the upper 32 bits of all 64-bit quantities. 2) Related quantities in other structures have not been widened to 64 bits. For instance, file offsets in the "file" and "page" structures remain at 32 bits, and struct buf uses both lldaddr_t and the 32-bit daddr_t. 3) No mechanism has been established for device drivers to indicate whether or not they understand 64-bit offsets and disk addresses. Proposed kernel changes ---------------- 1) To enable use of 64-bit file offsets, we must expand the f_offset member of struct file to 64 bits. There are other 32-bit offsets that are candidates for widening to 64 bits, particularly the l_start and l_len members of struct flock and the vnode offsets used in the VM system. However, we consider these changes to be beyond the scope of the present proposal. 2) The offset parameter of the vop_seek vnode operation must also be widened to 64-bits. For consistency, we will widen the offset parameters of other vnode operations, namely vop_putpage, vop_getpage, vop_addmap, vop_delmap, vop_map, vop_close, vop_frlock, and vop_space. We will ignore the upper 32 bits of these parameters for now. This requires changing the vnode operations for every file system type, but the changes are localized and largely mechanical. Since the read, write, and seek operations guarantee that the offset remains in the range 0 <= offset < 2GB, the other operations can immediately convert the offset to a 32-bit type on entry. 3) The vop_seek, vop_read, and vop_write operations for all file system types except specfs will check for and reject offsets greater than 2GB. These file systems types will continue to use 32-bit offsets for other vnode operations. Assertions will check that the offset is always in the legal range. The vnode operations for specfs will be rewritten as described in (6) below. 4) The generic read/write/lseek code will use the 64-bit f_offset file offset. lseek will return the least significant 32 bits of the offset. 5) The kernel must be able to decide on a per-device basis whether offsets and block numbers greater than 2GB are allowed or cause an error, using information supplied by the device driver. Drivers that expect 64-bit quantities will set a bit in the cb_flag member of their cb_ops structure. This bit will be defined in common/sys/ conf.h, as follows: #define D_64BIT 0x200 When specfs creates a "common" snode, it will check whether the driver's cb_ops structure has this bit set. If so, it will set a newly defined bit, SLOFFSET (0x40) in the s_flags member of the snode. 6) The spec_seek, spec_read, and spec_write specfs operations will call a new function, offset_t spec_maxoffset(struct vnode *vp); to determine the maximum legal file offset for the vnode on which the operation is to be performed. The return value is determined as follows: a) If vp is a streams device, returns -1 b) Else if the SLOFFSET bit is set, returns 1TB - 1 c) Else returns 2GB - 1 When -1 is returned, any offset is allowed; otherwise, offsets are required to remain positive and less than or equal to the returned maximum. (The current specfs implementation puts no restrictions on offset values; the 4.1 policy was to disallow negative offsets except for streams devices and the memory device. A consequence of this change is that the memory device must use 64-bit offsets). 7) physio() will convert 64-bit uio offsets into 64-bit block numbers. File systems that use only 32 bits of the uio offset must leave the upper 32 bits set to zero. Since no current devices use addresses larger than 32 bits, the upper 32 bits of the block number should always be zero. The 1TB - 1 limit imposed by spec_maxoffset() ensures that this condition holds true for all device I/O. 8) All standard disk and tape drivers will set the D_64BIT bit in their cb_ops structures and use 64-bit offsets and block numbers. The memory device driver will do so also. Since the 5.0 C compiler will support a built-in 64-bit "long long" type, all 64-bit operations will be implemented using long long. Device driver interface ----------------------- It is important that we continue to support device drivers written with 32-bit file offsets in mind. We cannot safely pass 64-bit offsets to such drivers since they will behave incorrectly when the offset exceeds 2GB. Thus, drivers need a method of indicating that they support extended offsets. This will be done by setting a bit, D_64BIT (0x200), in the cb_flag member of the per-driver cb_ops structure. Setting D_64BIT indicates that the driver expects to receive 64 bit uio offsets in struct uio and 64-bit block numbers in struct buf (however, in the current implementation, the block number will never exceed 2G). In the default case, i.e., when the bit is not set, the kernel will not pass values too large to represent in 32 bits to the driver. Thus existing 32-bit drivers require no changes. specfs will check whether the driver's D_64BIT bit is set whenever it creates a "common" device snode. The result is recorded in the s_flags member of the snode using the newly-defined SLOFFSET bit. Subsequent operations on the device use this bit to determine the maximum valid file offset for the device. User-level interface -------------------- A new system call, llseek, will supply the user-level interface to the extended file offsets. llseek is identical in usage and semantics to lseek, with the exceptions that it takes a 64-bit offset parameter instead of a 32-bit one and returns a 64-bit result. llseek will reside in libc along with the other system calls. File system utilities that use to raw device interface to read and write file system structures and data must be rewritten to use the new llseek interface. The following utilities require change: dump, fsck, mkfs, mkfs, clri, fsirand, edquota, quot, tunefs, fsdb, ff, fstyp, and labelit. llseek replaces lseek, and some variables correspondingly change type from off_t to the 64-bit offset_t. Since the utilities represent disk locations in terms of block numbers rather than byte offsets, the changes are localized and straightforward.