4/29/2010 ienup.sung@oracle.com Iconv Localization Guide ------------------------ 1. Introduction This memo provides information on how to create and supply iconv shared objects that can be called and used by iconv_open(3C), iconv(3C), iconv_close(3C), iconvctl(3C), and iconvstr(3C). (And thus also by iconv(1) and any other places using iconv code conversion.) This memo also describes on what kind of iconv shared objects should be developed and delivered to support code conversions between so called "char" and "wchar_t" codeset names and any other codesets. For geniconvtbl binary tables, refer to geniconvtbl(1) and geniconvtbl(4) man pages. 2. Exported methods from the iconv shared object Any qualified iconv shared object must have the following three functions as the entry points: iconv_t _icv_open(const char *strp); size_t _icv_iconv(iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); void _icv_close(iconv_t cd); Additionally, any qualified iconv shared object that supports features specified in [7], must have the following three functions as the additional entry points: iconv_t _icv_open_attr(int flag, void *reserved); int _icv_iconvctl(iconv_t cd, int request, void *arg); size_t _icv_iconvstr(char *inarray, size_t *inlen, char *outarray, size_t *outlen, int flag); The following subsections describe each of them individually: 2.1. Some internal details The iconv_open() locates the iconv shared object using fromcode and tocode input arguments and retrieves the addresses of the first five internal functions mentioned at above. (It does not retrieve _icv_iconvstr() which is used by iconvstr(3C) only.) Among them, the first three, i.e., _icv_open(), _icv_iconv(), and _icv_close(), are the must-haves to be qualified as a valid iconv shared object in its minimum. Additionally, it will also try to get the addresses of the two other functions, _icv_open_attr() and _icv_iconvctl(), from the iconv shared object. The iconvstr(), when called, locates the iconv shared object using fromcode and tocode input arguments and retrieves the address of the remaining _icv_iconvstr() for subsequent and immediate use. The lack of the latter three functions only and merely means that the iconv_open() with code conversion behavior modification requests, iconvctl(), and iconvstr() specified in [7] are not supported by the iconv shared object. 2.2. _icv_open() Once the addresses of the three or five functions are obtained, iconv_open() calls _icv_open() expecting a conversion descriptor as the return value. It has the following function prototype: iconv_t _icv_open(const char *sptr); Within _icv_open(), you are responsible for the following: - Create and initialize an internal data structure as needed and return the address of the internal data structure as the return value of the function with a type casting with (iconv_t). The internal data structure is to keep the code conversion state and also maintain any other necessary information for the code conversion. - Also, do any other bookkeeping and initialization works as needed before the return. - In case of an error, set errno as specified in the iconv_open(3C) man page and return (iconv_t)-1. - In many cases, existing iconv shared objects have _icv_open() with the return type of void * and no input argument which is okay since at this point, due to C is not a strongly typed programming language and dlsym(3C) does not enforce the function prototypes as so strictly. 2.3. _icv_iconv() It has the following function prototype: size_t _icv_iconv(iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft); When iconv() is called, it calls _icv_iconv() with the pointer to the internal data structure as "iconv_t cd" which was returned from _icv_open() or _icv_open_attr() as the first input argument (from left). All other arguments to iconv() are simply passed to the _icv_iconv() and also returned back once the _icv_iconv() is completed. Refer to iconv(3C) man page on what should be done within the _icv_iconv() including output arguments, the function return value, and possible errno settings since iconv() is merely a wrapper to the _icv_iconv() per se. Refer also to iconv_open(3C) and iconvctl(3C) of [7] and prepare your _icv_iconv() so that, if you choose to support, it will also take care of the additional new features described at [7] on code conversion behavior modification requests via iconv_open() with _icv_open_attr() or iconvctl() with _icv_iconvctl(). In many cases, existing iconv shared objects have _icv_iconv() with a pointer to a data structure type of their own that is used for the first argument (from left) not iconv_t. This is okay for the same reason as the _icv_open(). 2.4. _icv_close() It has the following function prototype: void _icv_close(iconv_t cd); When iconv_close() is called, it checks if the conversion descriptor cd is NULL or not and if so, return -1 with EBADF errno immediately. If not so, then, it calls _icv_close() with the pointer to the internal data structure that was returned from _icv_open() as "iconv_t cd". Within _icv_close(), you should check if the pointer to the internal data structure passed as the input argument cd is a valid one or not and if so, free all allocated system resources associated with it such as memory block and so on, uninitialize any other things as needed, and return. If not so, there is not much to do at this point but just return without any further operation. In many cases, existing iconv shared objects have _icv_close() with a pointer to a data structure type of their own that is used for the input argument not iconv_t. This is okay as mentioned before. See iconv_close(3C) for more info on general responsibilities of the function. 2.5. _icv_open_attr() It has the following function prototype: iconv_t _icv_open_attr(int flag, void *reserved); Once the addresses of the three or five functions are obtained and either or both of the fromcode argument and the tocode argument have one or more of valid code conversion behavior modification requests, for instance, "//ILLEGAL_DISCARD" as described in iconv_open(3C) of [7], iconv_open(), instead of calling _icv_open(), calls _icv_open_attr() with an integer input argument, flag, expecting a conversion descriptor as the return value. The second input argument reserved is not used at this point and for a possible future extension. Within _icv_open_attr(), you are responsible for the following: - Create and initialize an internal data structure as needed and return the address of the internal data structure as the return value of the function with a type casting with (iconv_t). The internal data structure is to keep the code conversion state and also maintain any other necessary information for the code conversion. - The flag input argument has bitwise-inclusive-OR of the following values: ICONV_CONV_ILLEGAL_DISCARD ICONV_CONV_ILLEGAL_REPLACE_HEX ICONV_CONV_ILLEGAL_RESTORE_HEX ICONV_CONV_NON_IDENTICAL_DISCARD ICONV_CONV_NON_IDENTICAL_REPLACE_HEX ICONV_CONV_NON_IDENTICAL_RESTORE_HEX ICONV_CONV_NON_IDENTICAL_TRANSLITERATE that you should check, set, and maintain the necessary code conversion behavior at the conversion descriptor of your iconv code conversion as described in iconv_open(3C) of [7]. Since iconv_open() screens out all possible conflicting requests as specified in iconv_open(3C), you do not need to filter out the bit values of the flag argument again by yourself. However, do expect that there could be more than one requests, for instance: (ICONV_CONV_ILLEGAL_DISCARD | ICONV_CONV_ILLEGAL_RESTORE_HEX | ICONV_CONV_NON_IDENTICAL_DISCARD) The ICONV_CONV_* macros are defined in header file. - Also, do any other bookkeeping and initialization works as needed before the return. - In case of an error, set errno as specified in the iconv_open(3C) man page and return (iconv_t)-1. 2.6. _icv_iconvctl() It has the following function prototype: int _icv_iconvctl(iconv_t cd, int request, void *arg); When iconvctl(3C) is called, it checks if request has any conflicting code conversion behavior modification requests, corrects such if necessary as specified in iconvctl(3C) of [7], and then calls _icv_iconvctl() with the pointer to the internal data structure as "iconv_t cd" which was returned from _icv_open() or _icv_open_attr(). All other arguments are simply passed to the _icv_iconvctl() and also returned back once the _icv_iconvctl() is completed. Refer to iconvctl(3C) man page on what should be done within the function including output arguments, the function return value, and possible errno settings. If there was a change in the code conversion behavior, then, any subsequent calls to _icv_iconv() should follow and conform to the new code conversion behavior. 2.7. _icv_iconvstr() It has the following function prototype: size_t _icv_iconvstr(char *inarray, size_t *inlen, char *outarray, size_t *outlen, int flag); When iconvstr() is called, it locates the corresponding iconv shared object, loads it, and gets the address for _icv_iconvstr() in a manner similar to what iconv_open() does. Once a valid address for the _icv_iconvstr() is obtained, then, it calls the function with the arguments merely passed to and then returned back once the execution is completed. Refer to iconvstr(3C) of [7] for more detail on what should be done within the _icv_iconvstr() function. Basically it is more or less like doing _icv_open(), _icv_iconv(), and _icv_close() in a row without conversion descriptor maintenance. (It has a bit different treatment on the arguments and also somewhat different kind of code conversion behavior modifications to be compatible with kiconvstr(9F). [5]) 3. Naming rules and install locations for iconv shared objects The iconv shared object must conform to the following naming convention to be recognized and properly loaded into memory: fromcode%tocode.so where the fromcode is the name of the source codeset and the tocode is the name of the target codeset. The names are then can be used at iconv_open(3C) and iconvstr(3C) as fromcode and tocode input arguments to open the code conversion. For each codeset name, if needed, you can add aliases to: /usr/lib/iconv/alias where the first column is the alias name and the second column is the canonical name that you used at the shared object file name. Since iconv_open(3C) internally uses strcasecmp(), you do not need to worry about the caseness of the aliases, e.g., if you supplied the following line at the /usr/lib/iconv/alias file: Foo Bar Then foo, FOO, foO, and so on will also match with Foo and reach to the canonical name, Bar. The canonical name, however, must match the name you used at the shared object file name, byte by byte. The alias matching uses a linear search from top to bottom. To be usable, the iconv shared object must be placed under the following location for 32-bit iconv shared objects: /usr/lib/iconv/ and for 64-bit iconv shared objects, for instance, one of the following based on the target instruction set architecture: /usr/lib/iconv/amd64/ /usr/lib/iconv/sparcv9/ The iconv shared objects should have 0555 as the permission mode bit value in octal, root as the owner, and bin as the group. 4. Necessary iconv shared objects for the support of char and wchar_t When iconv_open() encounters "" (i.e., empty string) or "char" as an input argument, it will interpret the name as the codeset name of the current locale that is returned from nl_langinfo(CODESET) call. As an example, if the current locale is en_US.ISO8859-1 and iconv_open("", "UTF-8") is called, then, the iconv_open() will internally look for an iconv code conversion from UTF-8 to ISO8859-1. Similarly, when iconv_open() encounters "wchar_t" as an input argument, it will internally interpret the name as a concatenated name of "wchar_t" followed by a dash (i.e., '-' or 0x2d) followed by the codeset name of the current locale that is returned from nl_langinfo(CODESET) call, i.e., it will interpret "wchar_t" as if it is: wchar_t- where the is the codeset name of the current locale. As an example, if iconv_open("UTF-8", "wchar_t") is called and the current locale is en_US.UTF-8, then, the iconv_open() will internally look for an iconv code conversion from wchar_t-UTF-8 to UTF-8. It is strongly recommended that, as its minimum, each localization center develop and deliver iconv code conversions between UTF-8 and the codeset names of all supported locales (if not so done yet) and also between UTF-8 and the wchar_t- for all supported locales of their responsibility. Additionally, it is also recommended that each localization center add more iconv code conversions between any other popular regional codeset names and the codeset names plus the wchar_t- for all supported locales of their responsibility. 5. References [1] PSARC/1993/153 iconv/iconv_open/iconv_close [2] PSARC/1999/292 Addition of geniconvtbl(1) [3] PSARC/2001/072 GNU gettext support (For /usr/lib/iconv/alias and alias support mechanism at iconv.) [4] PSARC/2001/659 Non-identical character conversion support in geniconvtbl(1) [5] PSARC/2007/173 kiconv [6] PSARC/2009/561 Pass-through iconv code conversion [7] PSARC/2010/XXX Libc iconv enhancement [8] The latest man pages for iconv(1), iconv_open(3C), iconv(3C), iconv_close(3C), iconvctl(3C), and iconvstr(3C). [9] Man pages for nl_langinfo(3C) and setlocale(3C). END_OF_MEMO.