Copyright 2006 Sun Microsystems, Inc.

THANKS
======
Firstly a note of thanks to Don Cragun, Casper Dik, and Mike Shapiro
for their guidance, reviews, ideas, and baby sitting (of me).
I could not have done this without them.

Hopefully the 4th try will be the charm.


1. Introduction
    1.1. Project/Component Working Name:
	 Extended FILE space for 32-bit Solaris processes
    1.2. Name of Document Author/Supplier:
	 Author: Craig Mohrman
    1.3  Date of This Document:
	 8 March, 2006

4. Technical Description
	NOTE
	====
	Please read the entire proposal before commenting.  Much detail
	and many answers lie beneath.  Thank you.


	Summary
	=======
	32-bit Solaris processes will no longer be limited to 256 FILE's
	with LITTLE chance of silent data corruption.
	Silent data corruption can only occur if:
		   1.	the process uses FILE->_file directly,
		   2.	the process uses the fileno() macro (the
			fileno() macro was removed from our headers in
			Solaris release 2.7) rather than the fileno()
			function, or
		   3.	the process truncates the value returned by the
			fileno() function.

	This is a runtime change with no limit on the age of the software.
	Even pre-2.7 software will work.


	Background
	==========
	It has been a historical limitation of UNIX implementations
	that an unsigned char is used to store the value of the file
	descriptor (fd) associated with the stream (FILE).  The effect
	of this is that the maximum value such a file descriptor could
	take was 255. This limitation has become burdensome for a
	number of customers.

	See, for example:

	1085341 _file in stdio.h should be an unsigned long
	1147964 NIS+ servers start repeatedly doing FULL RESYNCS ...
		runs out of fd's
	1173088 request to remove the stdio streams file pointer fopen
		limit of 256.
	1253283 gethostbyname fails if more than 255 file descriptors
		are already open
	1266821 lpsched goes into an infinite loop hanging.
	4009255 getnettype limits # of fd's to 255
	4031927 the use of fopen() in cdbopen() of libcdb fails when
		open files greater than 255
	4043375 socketpair() returns EAGAIN.
	4050822 dlopen() core dumps when no fds available
		(was ypmatch dumps core).
	4081318 in.named: fdopen: Error 0
	4101908 SUBSCRIBE/UNSUBSCRIBE fail if more than 256 files are open ...
	4131997 pam_start() fails when the application has more
		than 255 open file ..
	4152876 getspnam_r() fails due to use of fopen() in libnsl.so in
	4156580 getnetlist uses fopen, limiting RPC to 256 descriptors
	4166190 FNS uses fopen(), this is bad.
	4239433 Request to remove the stdio streams file pointer
		fopen limit of 256.
	4275900 dbx fails to open a file if there are more than 255 lwps running
	4292123 Error in libsecurity: cannot open /etc/netconfig
	4353836 if more than 255 file descriptors are already open
		then gethostbyname
	4400631 "getgrent" calls "fopen" instead of "open"
	4459640 Java crash with a load of more than 20 users, assertion on
	4472001 gethostbyname failed with NO_RECOVERY when there are more ..
	4472174 use of fopen in zip utils implementation
	4472176 use of fopen in java sound implementation
	4472179 use of fopen in J2SE networking classes implementation
	4472183 Solaris: use of fopen() in timezone classes implementation
	4472199 use of fopen() in java runtime profiling support
	4472643 fopen and fdopen fail in less than FOPEN_MAX or STREAM_MAX calls
	4481723 yacc generated code does not handle error returned by fdopen

	etc.


	Additionally, if application code uses a lot of file
	descriptors but also takes care to avoid stdio, it will still
	find that certain standard library functions break when fds
	0...255 are all in use.  (e.g., socket, nsl, nss and nameserver
	libraries alone contain nearly 100 calls to stdio functions
	returning a FILE *)  This has lead to various bugs, many of
	which in effect require reimplementing parts of stdio inside
	applications or libraries, or rewriting an application to use
	file descriptors above 255 instead of using FILE * more
	naturally.

	This is a continuous maintenance burden, especially where
	additional code gets added to reimplement bits of stdio to work
	with fds > 255.

	While this limit is now lifted for 64-bit applications, we are
	still faced with this issue for our own 32-bit runtime, 32-bit
	JVM, and a variety of 32-bit applications.

	It is not possible for binary compatibility reasons to simply
	change the _file field of the FILE structure from an unsigned char
	to an int because this would cause the size of the structure to
	change and would therefore break binary compatibility constraints.

	This proposal draws on experiences and supersedes the following
	cases:

	PSARC/1996/413 - Lots of File Descriptors
	current status: closed superseded 1998/109
		While discussing the case Steve Chessin proposed a
		reserved fd open against a device which yields the
		fd unusable.

	PSARC/1998/109 - Extended Stdio
	current status: closed withdrawn
		Joe Townsend and Mike Shapiro propose a new fcntl()
		command which will invalidate an fd that will be
		stored in the _file field.

	PSARC/2001/680 - Large fd stdio
	current status: discussing
		Casper Dik proposes a new "F" flag for internal
		usage of extended FILEs.
	
	If this case is approved, PSARC/2001/680 will be marked closed
	withdrawn.


	Overview of Changes
	===================

	These changes will only affect 32-bit applications.  The 64-bit
	environment is unchanged except for the fcntl(F_BADFD) system
	call which will fail.

	The default 32-bit case will work as it does today.  The user
	must take action(s) in order to enable this feature.  (I.e.,
	opt in.)

	Applications that enable this feature will be able to associate
	any valid file descriptor (fd) with a STDIO stream.  This
	feature can be enabled using a preloadable library or a new
	enable_extended_FILE_stdio(3C) function.

	When this feature is enabled, larger fd's (>255) will be stored
	in an auxiliary location and a bogus (bad) fd will be stored in
	the FILE->_file field.  Improper accesses by the application to
	the FILE->_file field will either yield the proper underlying
	fd for those fd's <= 255 or the unavailable (bad) fd.
	Accessing the (bad) fd will yield errors.  There is very little
	chance of silent data corruption.

	There are 2 programmatic interfaces allowing access to the
	larger than 256-fd FILE pool provided that the RLIMIT_NOFILE
	resource limit has been raised.  The default RLIMIT_NOFILE is
	256 fd's.  The first interface is adding a flag, "F", to the
	existing mode string of standard I/O open calls, e.g. fopen(),
	fdopen(), popen().  This is intended to be used by library
	routines (e.g. gethostent()) but applications may use it as
	well.  This interface provides no safeguards against misuse.
	The FILE pointer returned by these stdio open routines must
	never to passed to any uninspected code; data corruption may
	occur if uninspected code misuses a FILE pointer opened in this
	manner.  The second interface, enable_extended_FILE_stdio(), is
	more generalized and does provide protection to software with
	unknown behaviors such as 3rd party libraries without source
	code.

	To aid in "cleaning up" source code which contains FILE->_file
	access we propose to rename FILE->_file to FILE->__file
	(additional "_").  This will forcibly break the compilation of
	any source containing references to FILE->_file.  Obviously
	this will grab the attention of developers who will then
	hopefully recode FILE->_file to the more appropriate
	fileno(FILE).  Sun code needs this as much as 3rd party code.
	This will very quickly clean up Solaris to be fiLE->_file clean
	and guarantee that using extended FILE will work throughout
	Solaris.  Without this it will be very difficult to verify
	Solaris, and other software products, readiness.

	If it is known that the application accesses the FILE->_file
	field directly then this feature should not be used, even
	though we go to great lengths to handle all possible cases
	there can always be another case we did not think of.  For well
	behaved applications (no direct FILE->_file accesses) then this
	works wonderfully.


	Details of Changes
	==================

	1.  A new bitfield will be added to the FILE struct,
	    __extendedfd.  When this bitfield is set to 1, the file
	    descriptor associated with the stream is stored as
	    auxiliary data; not in _file.  How _file is set in this
	    case depends on other issues specified below.  A second
	    bitfield is added to the FILE struct,
	    __extendedfd_nocheck.  This bitfield is used to control
	    error checking.

	2.  We add a new fcntl() command.
		marked_fd = fcntl(low_fd, F_BADFD, action)
    	    This fcntl() command will:
    	    A.  mark the first currently closed file descriptor greater
		than or equal to low_fd to be unavailable for assignment
		as a file descriptor for any following system call that
		allocates a file descriptor,
	    B.  set the action to be taken by the kernel when marked_fd
		is passed in as an fd in subsequent system calls.  If
		action is 0, any reference to marked_fd will return an
		error with errno set to EBADF; otherwise, any reference
		to marked_fd will send an "action" signal to the
		process and return an error with errno set to EBADF.
		The "action" may be any valid signal number.  The
		"action" signal will not be sent if the application
		attempts close(marked_fd).  We believe that many
		applications exist that attempt to close() fd's they
		did not open.
	    This new fcntl() command will return the file descriptor
	    that has been marked as unavailable.  If no file descriptor
	    in the range 3 <= low_fd <= marked_fd <= 255 can be marked,
	    the fcntl() will fail and return -1.  If the call fails,
	    errno will be set as follows:

	    EAGAIN	All file descriptors in the inclusive range
			low_fd through 255 refer to files that are
			currently open in the process.
	    EBADF	low_fd is less than 3 or greater than 255.
	    EEXIST	A file descriptor has already been marked by an
			earlier call to fcntl().
	    EINVAL	action is not a valid signal number.

	    This fcntl() may only be called successfully once.
	    Subsequent calls will return an error.  An unallocatable
	    file descriptor set by the fcntl() F_BADFD command will be
	    inherited by a child across fork operations and cleared in
	    the new process image during exec and spawn operations.

	    This new fcntl(F_BADFD) command should not be used directly
	    by applications.  Applications should use
	    enable_extended_FILE_stdio() (see #6 below) instead.
	    Calling fcntl(F_BADFD) directly serves no purpose and would
	    cause a subsequent call to enable_extended_FILE_stdio() to
	    fail.

	3.  An additional flag "F" is added to the mode argument in our
	    implementation of fopen(), fdopen(), and popen().  This
	    flag will allow fopen() and fdopen() to assign/specify a
	    file descriptor larger than 255.  popen() will also
	    recognize "F" in its mode string because it passes its mode
	    string through to fdopen().

	    The "F" flag will only be recognized if it is the last
	    character of the mode argument.  When fopen() is called it
	    will use the lowest available file descriptor.

	    The "F" flag, when used by Sun's engineers, will be used by
	    library routines only if that use of fopen() is to create a
	    stream that will not be returned to application code for
	    any use.

	    If the file descriptor allocated (after F flag processing)
	    is less than 256, __extendedfd will be set to 0 and _file
	    will be set to the new file descriptor.  Otherwise the
	    action depends on the current value of __STDIO_bad_fd and
	    the presence of the F flag.  If __STDIO_bad_fd is less than
	    0 and the F flag is not present, the fopen() or fdopen()
	    will fail; otherwise __extendedfd will be set to 1, _file
	    will be set to (unsigned char)__STDIO_bad_fd, and the file
	    descriptor will be stored in an auxiliary location.
	    Another new bitfield, FILE->__extendedfd_nocheck, will be
	    set to 1 to help us manage this internally requested
	    extended FILE.

	    These changes only apply to the 32-bit code path.

	    The following pseudo code demonstrates what the new 32-bit
	    fopen() might look like:

		libc will initialize:

		__STDIO_bad_fd = -1; /* runtime is signaling us to use */
				     /*	extended FILEs */
				     /* default < 0; no extended FILEs */


		fopen(filename, mode)
		{
			fd = open(filename, ...)
			if (fd < 0) return (NULL)
			if(fd < 256) {
				FILE->_file = fd
				FILE->__extendedfd = 0
				FILE->__extendedfd_nocheck = 1
			} else if ("F"_is_the_last_character_in_mode) {
				FILE->__extendedfd = 1
				FILE->__extendedfd_nocheck = 1
				fd is stored in an auxiliary location
			} else if(__STDIO_bad_fd >= 0) {
				FILE->_file = (unsigned char)__STDIO_bad_fd
				FILE->__extendedfd = 1
				FILE->__extendedfd_nocheck = 0
				fd is stored in an auxiliary location
			} else {
				close(fd)
				errno = EMFILE
				return (NULL)
			}
		}


		The possible scenarios are thus:

		A)  User does nothing and runs applications:
	    	    Everything works or fails as before.

		B)  User raises RLIMIT_NOFILE:
		    Normal stdio calls still limited to 256 FILE's but
		    any usage of the "F" flag will allow RLIMIT_NOFILE
		    FILE's.

		C)  User raises RLIMIT_NOFILE and pre-loads extended
		    FILE library:
	    	    Application gets RLIMIT_NOFILE FILE's.
	    	    Direct access to FILE->_file are preserved for the
		    fd's <= 255.
	    	    fd's > 255 will receive errors or signal.  User's
		    choice.
		    Solaris internal libraries ("F" usage) gets the
		    same as B above.

	    	    $ ulimit -n 1000
	    	    $ LD_PRELOAD_32=/usr/lib/extendedFILE.so.1 \
			application [args...]

	4.  It has been noted that private copies of STDIO code exist
	    to locally alleviate the 256 fd limitation.  These codes
	    will be identified and subsequently converted to use the
	    new STDIO and "F" flag.  These changes are not part of this
	    case but are being duly noted for future action.  [Note:
	    Casper has alerted me to one such case in networking code.
	    I can find no other instances.]

	    Upon integration, Sun Engineering will be notified of the
	    availability of this change.

	5.  We add a preloadable library that can be used to invoke the
	    new enable_extended_FILE_stdio(low_fd, action) function
	    before entering main().  If the user sets the environment
	    variable STDIO_BADFD to a value between 3 and 255 inclusive
	    when loading the preloadable library, low_fd will be set to
	    that value.  Otherwise, low_fd will be given the default
	    value 196.

	    The "action" parameter can also be user specified.  If the
	    "action" parameter is not specified the default "action"
	    will be to return errors and set errno as indicated above.
	    The user can also choose to have signal sent by setting the
	    environment variable STDIO_BADFD_SIGNAL to the desired
	    signal number or name before launching their application.

	    $ ulimit -n 1000
	    $ LD_PRELOAD_32=/usr/lib/extendedFILE.so.1 STDIO_BADFD=100 \
		STDIO_BADFD_SIGNAL=SIGABRT application [args...]

	6.  A second programming interface;
	    enable_extended_FILE_stdio(int fd, int action) is added to
	    allow 3rd parties and end users access to the interface
	    used in the preloadable library.  (#5 above) The fd
	    argument is the starting point to begin searching for an
	    available fd to be marked as the unallocatable file
	    descriptor.  Its value must be between 3-255 inclusive.  If
	    an available fd cannot be found an error is returned.  A
	    value of -1 for fd will use a built in scheme for searching
	    for an available fd.  The action argument is the requested
	    signal to be sent to the application in the event of an
	    error in the usage of this mechanism or improper access to
	    FILE->_file is detected.  A value of -1 yields the default
	    SIGABRT signal.  A value of 0 disables the sending of a
	    signal.

	7.  Rename FILE->_file to FILE->__file (additional "_").  This
	    will aid all of Sun and 3rd parties in the source code
	    cleanup that directly accesses _file.  This is needed to
	    ensure the safety in using the extended FILE mechanism.

	    This change will not affect running code in any way; it
	    only generates errors at compile time.  References to _file
	    in the remainder of this proposal refer to code that exists
	    in applications that has not yet been converted to use
	    public interfaces.

	The four programming interfaces described here provide similar
	features, but are intended for different audiences.

	The fcntl(low_fd, F_BADFD, action) function should never be
	called directly by applications.  It is only intended to be
	used by enable_extended_FILE_stdio() which performs the
	additional "paperwork" needed to fully enable this feature.

	The "F" mode flag to fdopen(), fopen(), and popen() is intended
	for use in applications and libraries in cases where it is
	known that the FILE pointer returned will never be passed to
	any code that may use the FILE pointer to directly access the
	_file field.  This flag allows large file descriptors to be
	used in cases where it is known that the FILE pointer will only
	be used in safe ways.  Fewer checks for invalid file
	descriptors are performed since all uses of the "F" mode flag
	can be assumed to have been inspected.  The "F" flag must never
	be used in any case where the FILE pointer will be given to
	uninspected code that might directly access _file through that
	FILE pointer.  Passing a FILE pointer opened using the "F" flag
	to uninspected code could lead to data corruption.  The "F"
	flag's primary intention is to allow the Solaris libraries to
	open configuration files with stdio functions when the file
	descriptors below 256 have been depleted.

	The enable_extended_FILE_stdio() function and the
	/usr/lib/extendedFILE.so.1 preloadable library can be used by
	any application that needs to use extended file descriptors
	with STDIO routines.  They fully enable the extended file
	descriptors for STDIO features.  New applications and
	applications that can be easily modified are expected to use
	the enable_extended_FILE_stdio() function; the preloadable
	library provides the same features when an existing application
	can't be modified to call enable_extended_FILE_stdio().  As
	long as the application checks for error returns and doesn't
	change the file descriptor being used by an STDIO stream by
	directly changing _file, using extended file descriptors will
	not increase the likelihood of data corruption.  This is
	intended for 3rd party use.

	There is one theoretical case where this fails:

		(void) close(FILE->_file);
		FILE->_file = myfd;

	Assume FILE->_file is filled with the marked badfd file
	descriptor because we are using extended FILEs.  The close()
	function will fail but since this usage ignores the return
	status the application proceeds to perform low level I/O on
	FILE->_file while calls to STDIO functions would continue to
	use the original, extended file descriptor.  If the application
	continues using STDIO functions after changing FILE->_file
	silent data corruption could occur because the application
	thinks it has changed fd's via the above assignment but the
	actual STDIO fd is stored in the auxiliary location.  The
	chances for corruption are even higher if myfd has a value
	greater than 255 and is truncated by the assignment to 8-bit
	_file.

	If the application does use _file directly (including using the
	fileno() macro that was provided in <stdio.h> in Solaris 2.0
	through 2.7), it should not use this feature.


	PERFORMANCE
	===========

	The default, extended FILE's not enabled, will maintain current
	performance.

	With extended FILE's enabled, access to fd's <= 255 will also
	maintain current performance.

	fd's >= 256 will take a performance hit due to
	storage/retrieval of the file descriptor in an auxiliary
	location.

	fopen() with fds < 256 is as fast as before.
	fopen() is 5-10% slower with fd's >= 256 than with fd's < 256.
	Only 32 bit applications are affected.

	There is a small penalty in the remainder of the stdio code
	where file descriptors are used; but since stdio tries hard to
	avoid using file descriptors and the use of file descriptors
	causes much more expensive system calls this disappears in the
	noise.


	DOCUMENTATION
	=============

	The fopen(3C), fdopen(3C), and popen(3C) man pages will be
	adjusted to this new behavior.

	A new man page for extendedFILE.so.1 will be created to explain
	in great detail how to use the feature and various scenarios
	will be presented.

	A new man page for enable_extended_FILE_stdio() will be created
	to explain the programmatic interface.


	FAILING CASE
	============

	We have added code to detect the theoretical failing case
	described above.

	If this failing case is detected then a simple message will be
	printed to stderr and abort() will be called.


	Exported Interfaces
	===================

	Interface               			Stability
	---------               			---------
	FILE->__extendedfd				project private
	FILE->__extendedfd_nocheck			project private

	"F" flag to fopen()				evolving
	"F" flag in fdopen()				evolving
	"F" flag in popen()				evolving

	F_BADFD fcntl() command				project private
	enable_extended_FILE_stdio()			evolving

	/usr/lib/extendedFILE.so.1			stable
	STDIO_BADFD env. variable			stable
	STDIO_BADFD_SIGNAL env. variable		stable
	STDIO_BADFD_SIGNAL_MESSAGE env. variable	stable

6. Resources and Schedule
    6.4. Steering Committee requested information
	6.4.1. Consolidation C-team Name: ON
    6.5. ARC review type: FastTrack