Copyright 2006 Sun Microsystems, Inc. THANKS ====== Firstly a note of thanks to Don Cragun, Casper Dik, and Mike Shapiro for their guidance, reviews, ideas, and baby sitting (of me). I could not have done this without them. Hopefully the 4th try will be the charm. 1. Introduction 1.1. Project/Component Working Name: Extended FILE space for 32-bit Solaris processes 1.2. Name of Document Author/Supplier: Author: Craig Mohrman 1.3 Date of This Document: 8 March, 2006 4. Technical Description NOTE ==== Please read the entire proposal before commenting. Much detail and many answers lie beneath. Thank you. Summary ======= 32-bit Solaris processes will no longer be limited to 256 FILE's with LITTLE chance of silent data corruption. Silent data corruption can only occur if: 1. the process uses FILE->_file directly, 2. the process uses the fileno() macro (the fileno() macro was removed from our headers in Solaris release 2.7) rather than the fileno() function, or 3. the process truncates the value returned by the fileno() function. This is a runtime change with no limit on the age of the software. Even pre-2.7 software will work. Background ========== It has been a historical limitation of UNIX implementations that an unsigned char is used to store the value of the file descriptor (fd) associated with the stream (FILE). The effect of this is that the maximum value such a file descriptor could take was 255. This limitation has become burdensome for a number of customers. See, for example: 1085341 _file in stdio.h should be an unsigned long 1147964 NIS+ servers start repeatedly doing FULL RESYNCS ... runs out of fd's 1173088 request to remove the stdio streams file pointer fopen limit of 256. 1253283 gethostbyname fails if more than 255 file descriptors are already open 1266821 lpsched goes into an infinite loop hanging. 4009255 getnettype limits # of fd's to 255 4031927 the use of fopen() in cdbopen() of libcdb fails when open files greater than 255 4043375 socketpair() returns EAGAIN. 4050822 dlopen() core dumps when no fds available (was ypmatch dumps core). 4081318 in.named: fdopen: Error 0 4101908 SUBSCRIBE/UNSUBSCRIBE fail if more than 256 files are open ... 4131997 pam_start() fails when the application has more than 255 open file .. 4152876 getspnam_r() fails due to use of fopen() in libnsl.so in 4156580 getnetlist uses fopen, limiting RPC to 256 descriptors 4166190 FNS uses fopen(), this is bad. 4239433 Request to remove the stdio streams file pointer fopen limit of 256. 4275900 dbx fails to open a file if there are more than 255 lwps running 4292123 Error in libsecurity: cannot open /etc/netconfig 4353836 if more than 255 file descriptors are already open then gethostbyname 4400631 "getgrent" calls "fopen" instead of "open" 4459640 Java crash with a load of more than 20 users, assertion on 4472001 gethostbyname failed with NO_RECOVERY when there are more .. 4472174 use of fopen in zip utils implementation 4472176 use of fopen in java sound implementation 4472179 use of fopen in J2SE networking classes implementation 4472183 Solaris: use of fopen() in timezone classes implementation 4472199 use of fopen() in java runtime profiling support 4472643 fopen and fdopen fail in less than FOPEN_MAX or STREAM_MAX calls 4481723 yacc generated code does not handle error returned by fdopen etc. Additionally, if application code uses a lot of file descriptors but also takes care to avoid stdio, it will still find that certain standard library functions break when fds 0...255 are all in use. (e.g., socket, nsl, nss and nameserver libraries alone contain nearly 100 calls to stdio functions returning a FILE *) This has lead to various bugs, many of which in effect require reimplementing parts of stdio inside applications or libraries, or rewriting an application to use file descriptors above 255 instead of using FILE * more naturally. This is a continuous maintenance burden, especially where additional code gets added to reimplement bits of stdio to work with fds > 255. While this limit is now lifted for 64-bit applications, we are still faced with this issue for our own 32-bit runtime, 32-bit JVM, and a variety of 32-bit applications. It is not possible for binary compatibility reasons to simply change the _file field of the FILE structure from an unsigned char to an int because this would cause the size of the structure to change and would therefore break binary compatibility constraints. This proposal draws on experiences and supersedes the following cases: PSARC/1996/413 - Lots of File Descriptors current status: closed superseded 1998/109 While discussing the case Steve Chessin proposed a reserved fd open against a device which yields the fd unusable. PSARC/1998/109 - Extended Stdio current status: closed withdrawn Joe Townsend and Mike Shapiro propose a new fcntl() command which will invalidate an fd that will be stored in the _file field. PSARC/2001/680 - Large fd stdio current status: discussing Casper Dik proposes a new "F" flag for internal usage of extended FILEs. If this case is approved, PSARC/2001/680 will be marked closed withdrawn. Overview of Changes =================== These changes will only affect 32-bit applications. The 64-bit environment is unchanged except for the fcntl(F_BADFD) system call which will fail. The default 32-bit case will work as it does today. The user must take action(s) in order to enable this feature. (I.e., opt in.) Applications that enable this feature will be able to associate any valid file descriptor (fd) with a STDIO stream. This feature can be enabled using a preloadable library or a new enable_extended_FILE_stdio(3C) function. When this feature is enabled, larger fd's (>255) will be stored in an auxiliary location and a bogus (bad) fd will be stored in the FILE->_file field. Improper accesses by the application to the FILE->_file field will either yield the proper underlying fd for those fd's <= 255 or the unavailable (bad) fd. Accessing the (bad) fd will yield errors. There is very little chance of silent data corruption. There are 2 programmatic interfaces allowing access to the larger than 256-fd FILE pool provided that the RLIMIT_NOFILE resource limit has been raised. The default RLIMIT_NOFILE is 256 fd's. The first interface is adding a flag, "F", to the existing mode string of standard I/O open calls, e.g. fopen(), fdopen(), popen(). This is intended to be used by library routines (e.g. gethostent()) but applications may use it as well. This interface provides no safeguards against misuse. The FILE pointer returned by these stdio open routines must never to passed to any uninspected code; data corruption may occur if uninspected code misuses a FILE pointer opened in this manner. The second interface, enable_extended_FILE_stdio(), is more generalized and does provide protection to software with unknown behaviors such as 3rd party libraries without source code. To aid in "cleaning up" source code which contains FILE->_file access we propose to rename FILE->_file to FILE->__file (additional "_"). This will forcibly break the compilation of any source containing references to FILE->_file. Obviously this will grab the attention of developers who will then hopefully recode FILE->_file to the more appropriate fileno(FILE). Sun code needs this as much as 3rd party code. This will very quickly clean up Solaris to be fiLE->_file clean and guarantee that using extended FILE will work throughout Solaris. Without this it will be very difficult to verify Solaris, and other software products, readiness. If it is known that the application accesses the FILE->_file field directly then this feature should not be used, even though we go to great lengths to handle all possible cases there can always be another case we did not think of. For well behaved applications (no direct FILE->_file accesses) then this works wonderfully. Details of Changes ================== 1. A new bitfield will be added to the FILE struct, __extendedfd. When this bitfield is set to 1, the file descriptor associated with the stream is stored as auxiliary data; not in _file. How _file is set in this case depends on other issues specified below. A second bitfield is added to the FILE struct, __extendedfd_nocheck. This bitfield is used to control error checking. 2. We add a new fcntl() command. marked_fd = fcntl(low_fd, F_BADFD, action) This fcntl() command will: A. mark the first currently closed file descriptor greater than or equal to low_fd to be unavailable for assignment as a file descriptor for any following system call that allocates a file descriptor, B. set the action to be taken by the kernel when marked_fd is passed in as an fd in subsequent system calls. If action is 0, any reference to marked_fd will return an error with errno set to EBADF; otherwise, any reference to marked_fd will send an "action" signal to the process and return an error with errno set to EBADF. The "action" may be any valid signal number. The "action" signal will not be sent if the application attempts close(marked_fd). We believe that many applications exist that attempt to close() fd's they did not open. This new fcntl() command will return the file descriptor that has been marked as unavailable. If no file descriptor in the range 3 <= low_fd <= marked_fd <= 255 can be marked, the fcntl() will fail and return -1. If the call fails, errno will be set as follows: EAGAIN All file descriptors in the inclusive range low_fd through 255 refer to files that are currently open in the process. EBADF low_fd is less than 3 or greater than 255. EEXIST A file descriptor has already been marked by an earlier call to fcntl(). EINVAL action is not a valid signal number. This fcntl() may only be called successfully once. Subsequent calls will return an error. An unallocatable file descriptor set by the fcntl() F_BADFD command will be inherited by a child across fork operations and cleared in the new process image during exec and spawn operations. This new fcntl(F_BADFD) command should not be used directly by applications. Applications should use enable_extended_FILE_stdio() (see #6 below) instead. Calling fcntl(F_BADFD) directly serves no purpose and would cause a subsequent call to enable_extended_FILE_stdio() to fail. 3. An additional flag "F" is added to the mode argument in our implementation of fopen(), fdopen(), and popen(). This flag will allow fopen() and fdopen() to assign/specify a file descriptor larger than 255. popen() will also recognize "F" in its mode string because it passes its mode string through to fdopen(). The "F" flag will only be recognized if it is the last character of the mode argument. When fopen() is called it will use the lowest available file descriptor. The "F" flag, when used by Sun's engineers, will be used by library routines only if that use of fopen() is to create a stream that will not be returned to application code for any use. If the file descriptor allocated (after F flag processing) is less than 256, __extendedfd will be set to 0 and _file will be set to the new file descriptor. Otherwise the action depends on the current value of __STDIO_bad_fd and the presence of the F flag. If __STDIO_bad_fd is less than 0 and the F flag is not present, the fopen() or fdopen() will fail; otherwise __extendedfd will be set to 1, _file will be set to (unsigned char)__STDIO_bad_fd, and the file descriptor will be stored in an auxiliary location. Another new bitfield, FILE->__extendedfd_nocheck, will be set to 1 to help us manage this internally requested extended FILE. These changes only apply to the 32-bit code path. The following pseudo code demonstrates what the new 32-bit fopen() might look like: libc will initialize: __STDIO_bad_fd = -1; /* runtime is signaling us to use */ /* extended FILEs */ /* default < 0; no extended FILEs */ fopen(filename, mode) { fd = open(filename, ...) if (fd < 0) return (NULL) if(fd < 256) { FILE->_file = fd FILE->__extendedfd = 0 FILE->__extendedfd_nocheck = 1 } else if ("F"_is_the_last_character_in_mode) { FILE->__extendedfd = 1 FILE->__extendedfd_nocheck = 1 fd is stored in an auxiliary location } else if(__STDIO_bad_fd >= 0) { FILE->_file = (unsigned char)__STDIO_bad_fd FILE->__extendedfd = 1 FILE->__extendedfd_nocheck = 0 fd is stored in an auxiliary location } else { close(fd) errno = EMFILE return (NULL) } } The possible scenarios are thus: A) User does nothing and runs applications: Everything works or fails as before. B) User raises RLIMIT_NOFILE: Normal stdio calls still limited to 256 FILE's but any usage of the "F" flag will allow RLIMIT_NOFILE FILE's. C) User raises RLIMIT_NOFILE and pre-loads extended FILE library: Application gets RLIMIT_NOFILE FILE's. Direct access to FILE->_file are preserved for the fd's <= 255. fd's > 255 will receive errors or signal. User's choice. Solaris internal libraries ("F" usage) gets the same as B above. $ ulimit -n 1000 $ LD_PRELOAD_32=/usr/lib/extendedFILE.so.1 \ application [args...] 4. It has been noted that private copies of STDIO code exist to locally alleviate the 256 fd limitation. These codes will be identified and subsequently converted to use the new STDIO and "F" flag. These changes are not part of this case but are being duly noted for future action. [Note: Casper has alerted me to one such case in networking code. I can find no other instances.] Upon integration, Sun Engineering will be notified of the availability of this change. 5. We add a preloadable library that can be used to invoke the new enable_extended_FILE_stdio(low_fd, action) function before entering main(). If the user sets the environment variable STDIO_BADFD to a value between 3 and 255 inclusive when loading the preloadable library, low_fd will be set to that value. Otherwise, low_fd will be given the default value 196. The "action" parameter can also be user specified. If the "action" parameter is not specified the default "action" will be to return errors and set errno as indicated above. The user can also choose to have signal sent by setting the environment variable STDIO_BADFD_SIGNAL to the desired signal number or name before launching their application. $ ulimit -n 1000 $ LD_PRELOAD_32=/usr/lib/extendedFILE.so.1 STDIO_BADFD=100 \ STDIO_BADFD_SIGNAL=SIGABRT application [args...] 6. A second programming interface; enable_extended_FILE_stdio(int fd, int action) is added to allow 3rd parties and end users access to the interface used in the preloadable library. (#5 above) The fd argument is the starting point to begin searching for an available fd to be marked as the unallocatable file descriptor. Its value must be between 3-255 inclusive. If an available fd cannot be found an error is returned. A value of -1 for fd will use a built in scheme for searching for an available fd. The action argument is the requested signal to be sent to the application in the event of an error in the usage of this mechanism or improper access to FILE->_file is detected. A value of -1 yields the default SIGABRT signal. A value of 0 disables the sending of a signal. 7. Rename FILE->_file to FILE->__file (additional "_"). This will aid all of Sun and 3rd parties in the source code cleanup that directly accesses _file. This is needed to ensure the safety in using the extended FILE mechanism. This change will not affect running code in any way; it only generates errors at compile time. References to _file in the remainder of this proposal refer to code that exists in applications that has not yet been converted to use public interfaces. The four programming interfaces described here provide similar features, but are intended for different audiences. The fcntl(low_fd, F_BADFD, action) function should never be called directly by applications. It is only intended to be used by enable_extended_FILE_stdio() which performs the additional "paperwork" needed to fully enable this feature. The "F" mode flag to fdopen(), fopen(), and popen() is intended for use in applications and libraries in cases where it is known that the FILE pointer returned will never be passed to any code that may use the FILE pointer to directly access the _file field. This flag allows large file descriptors to be used in cases where it is known that the FILE pointer will only be used in safe ways. Fewer checks for invalid file descriptors are performed since all uses of the "F" mode flag can be assumed to have been inspected. The "F" flag must never be used in any case where the FILE pointer will be given to uninspected code that might directly access _file through that FILE pointer. Passing a FILE pointer opened using the "F" flag to uninspected code could lead to data corruption. The "F" flag's primary intention is to allow the Solaris libraries to open configuration files with stdio functions when the file descriptors below 256 have been depleted. The enable_extended_FILE_stdio() function and the /usr/lib/extendedFILE.so.1 preloadable library can be used by any application that needs to use extended file descriptors with STDIO routines. They fully enable the extended file descriptors for STDIO features. New applications and applications that can be easily modified are expected to use the enable_extended_FILE_stdio() function; the preloadable library provides the same features when an existing application can't be modified to call enable_extended_FILE_stdio(). As long as the application checks for error returns and doesn't change the file descriptor being used by an STDIO stream by directly changing _file, using extended file descriptors will not increase the likelihood of data corruption. This is intended for 3rd party use. There is one theoretical case where this fails: (void) close(FILE->_file); FILE->_file = myfd; Assume FILE->_file is filled with the marked badfd file descriptor because we are using extended FILEs. The close() function will fail but since this usage ignores the return status the application proceeds to perform low level I/O on FILE->_file while calls to STDIO functions would continue to use the original, extended file descriptor. If the application continues using STDIO functions after changing FILE->_file silent data corruption could occur because the application thinks it has changed fd's via the above assignment but the actual STDIO fd is stored in the auxiliary location. The chances for corruption are even higher if myfd has a value greater than 255 and is truncated by the assignment to 8-bit _file. If the application does use _file directly (including using the fileno() macro that was provided in in Solaris 2.0 through 2.7), it should not use this feature. PERFORMANCE =========== The default, extended FILE's not enabled, will maintain current performance. With extended FILE's enabled, access to fd's <= 255 will also maintain current performance. fd's >= 256 will take a performance hit due to storage/retrieval of the file descriptor in an auxiliary location. fopen() with fds < 256 is as fast as before. fopen() is 5-10% slower with fd's >= 256 than with fd's < 256. Only 32 bit applications are affected. There is a small penalty in the remainder of the stdio code where file descriptors are used; but since stdio tries hard to avoid using file descriptors and the use of file descriptors causes much more expensive system calls this disappears in the noise. DOCUMENTATION ============= The fopen(3C), fdopen(3C), and popen(3C) man pages will be adjusted to this new behavior. A new man page for extendedFILE.so.1 will be created to explain in great detail how to use the feature and various scenarios will be presented. A new man page for enable_extended_FILE_stdio() will be created to explain the programmatic interface. FAILING CASE ============ We have added code to detect the theoretical failing case described above. If this failing case is detected then a simple message will be printed to stderr and abort() will be called. Exported Interfaces =================== Interface Stability --------- --------- FILE->__extendedfd project private FILE->__extendedfd_nocheck project private "F" flag to fopen() evolving "F" flag in fdopen() evolving "F" flag in popen() evolving F_BADFD fcntl() command project private enable_extended_FILE_stdio() evolving /usr/lib/extendedFILE.so.1 stable STDIO_BADFD env. variable stable STDIO_BADFD_SIGNAL env. variable stable STDIO_BADFD_SIGNAL_MESSAGE env. variable stable 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack