#ident "@(#)ldterm-csi.txt 1.23 99/07/14 SMI" 3/20/1998 is@eng.sun.com Codeset independent ldterm(7M) and stty(1) ------------------------------------------ 1. Overview The current ldterm(7M) and stty(1) implementations are EUC codeset specific and also have EUC representation dependencies. This memo is to provide a design of codeset independent (CSI) ldterm(7M) module and stty(1) command. The design in this memo can be summarized as like below: - Provide three sets of internal methods in the ldterm(7M) to handle various codesets: (1) EUC codeset methods (default) (2) PC environment originated codeset methods (3) UTF-8 codeset methods The default method set that ldterm(7M) will start and run with will be the (1) from above. - Two new I_STR ioctl message commands specifically for the ldterm(7M) will be added: CSDATA_SET This call takes a pointer to a ldterm_cs_data_t data structure, and uses it to set the line discipline definition and also for a possible switch of the internal methods and data for the current locale's codeset. When this message is reached, the ldterm(7M) will check the validity of the message and if the message contains a valid data, it will accumulate the data and switch the internal methods if necessary to support the requested codeset. CSDATA_GET This call takes a pointer to a ldterm_cs_data_t structure and returns in it the codeset data info currently in use by the ldterm(7M) module. The two new ioctl commands will be added to header file. The EUC_WSET and EUC_WGET will not be removed. - Any locale that wants to utilize the (internal) non-EUC codeset methods of ldterm will provide /usr/lib/locale//LC_CTYPE/ldterm.dat file. The ldterm.dat file will contain info like codeset type, codeset and/or character widths of the current locale. - Upon user request of 'defeucw' mode setting, The stty(1) command will check if the current locale has the /usr/lib/locale//LC_CTYPE/ ldterm.dat file. If the locale has the file, the stty(1) command will read in the file and pass down the content of the file to the ldterm(7M) module by using the CSDATA_SET ioctl message command. The current behavior on EUC will not be changed. For 'write settings' request, i.e., stty -a, we will not change the current implementation. And thus if the stty(1) is executed with -a option, and the current locale is not EUC one, it will print out: eucw ?, scrw ? If the current locale is an EUC one, the stty(1) will print out byte widths and screen column widths for the EUC codesets, for instance, in case of any single byte locales we support, stty -a will give following result: eucw 1:1:0:0, scrw 1:1:0:0 2. Detail design 2.1. ldterm.dat file and header files The ldterm.dat file will have the following file structure shown in Figure 2.1.1: File (byte) offset +--------------------------+ 0 | ldterm data header info | +--------------------------+ 3 | pad byte | +--------------------------+ 4 | ldterm eucpc data 1 | +--------------------------+ 8 | ldterm eucpc data 2 | +--------------------------+ 12 | | : ... : | | +--------------------------+ 40 | ldterm eucpc data 10 | +--------------------------+ 44 Note: The size of ldterm data header info is 3 bytes and it consists of 'version,' 'codeset_type,' and, 'csinfo_num' data fields of ldterm_cs_data_t. The size of each ldterm eucpc data is 4 bytes and it contains width information for each sub-codeset. There will be total 10 ldterm eucpc data. The size of the file is 44 bytes including a pad byte. Figure 2.1.1: ldterm.dat file structure Definitions and data types that can be used to create and process the content of the 'ldterm.dat' file are like below and they will be added to the ldterm header file, : /* The next version will be the current LDTERM_DATA_VERSION + 1. */ #define LDTERM_DATA_VERSION 1 /* Supported codeset types. */ #define LDTERM_CS_TYPE_MIN 1 #define LDTERM_CS_TYPE_EUC 1 #define LDTERM_CS_TYPE_PCCS 2 #define LDTERM_CS_TYPE_UTF8 3 #define LDTERM_CS_TYPE_MAX 3 /* * The maximum number of bytes in a character of the codeset that * can be handled by ldterm. */ #define LDTERM_CS_MAX_BYTE_LENGTH 8 /* * The maximum number of sub-codesets in a codeset that can be * handled by ldterm. */ #define LDTERM_CS_MAX_CODESETS 10 /* * The following data structure is to provide codeset-specific * information for EUC and PC originated codesets (ldterm_eucpc_data_t) */ struct _ldterm_eucpc_data { unsigned char byte_length; unsigned char screen_width; unsigned char msb_start; unsigned char msb_end; }; typedef struct _ldterm_eucpc_data ldterm_eucpc_data_t; /* ldterm codeset data information. */ struct _ldterm_cs_data { unsigned char version; /* version: 1 ~ 255 */ unsigned char codeset_type; unsigned char csinfo_num; /* the # of codesets */ unsigned char pad; ldterm_eucpc_data_t eucpc_data[LDTERM_CS_MAX_CODESETS]; /* width data */ }; typedef struct _ldterm_cs_data ldterm_cs_data_t; /* * The following data structure is to handle Unicode codeset. * To represent a single Unicode plane, it requires to have 16384 * 'ldterm_unicode_data_cell_t' elements. */ struct _ldterm_unicode_data_cell { unsigned char u0:2; unsigned char u1:2; unsigned char u2:2; unsigned char u3:2; }; typedef struct _ldterm_unicode_data_cell ldterm_unicode_data_cell_t; Possible values for each data field of "ldterm_cs_data_t" are like below: - version: LDTERM_DATA_VERSION - codeset_type: LDTERM_CS_TYPE_EUC if the current locale is EUC one. LDTERM_CS_TYPE_PCCS if the current locale is PC originated codeset one. LDTERM_CS_TYPE_UTF8 if the current locale is UTF-8 one. - csinfo_num: If the codeset_type is LDTERM_CS_TYPE_EUC, it will have the number of supplementary codesets supported in the locale. Valid values are 0 to 3. The number excludes ASCII primary codeset. If the codeset_type is LDTERM_CS_TYPE_PCCS, it will have the number of distinguishable sub-codesets in the codeset of the locale. Valid values are 1 to 10. The number excludes ASCII sub-codeset. If the codeset_type is LDTERM_CS_TYPE_UTF8, the data field has no meaning. Possible values for each data fields of "ldterm_eucpc_data_t" are like below: - If the 'codeset_type' is LDTERM_CS_TYPE_EUC, there will be three "ldterm_eucpc_data_t" elements: -- The first element's: byte_length: The byte length of EUC supplementary codeset one. screen_width: The screen column width of EUC supplementary codeset one. -- The second element's: byte_length: The byte length of EUC supplementary codeset two. screen_width: The screen column width of EUC supplementary codeset two. -- The third element's: byte_length: The byte length of EUC supplementary codeset three. screen_width: The screen column width of EUC supplementary codeset three. - If the codeset_type is LD_TERM_CS_TYPE_PCCS, for each distinguishable sub-codesets that will be represented by each "ldterm_eucpc_data_t" elements, it will have: -- The i'th element's: byte_length: The byte length of sub-codeset i. screen_width: The screen column width of sub-codeset i. msb_start: The start range for the first leading byte of sub-codeset i. msb_end: The end range for the first leading byte of sub-codeset i. - If the codeset_type is LDTERM_CS_TYPE_UTF8, there will be no width data send down to the ldterm(7M) since the ldterm(7M) will have a Unicode width table as like below (Since Unicode width info are quite unique and practically not possible to categorize into supplementary or sub-codesets like EUC or PC originated codesets, we will have to provide a character-by-character width table like below): #include /* * The following two table contains width information for Unicode. * Values in the table "ucode" points index to the "width_tbl" vector. * * There are only three different kind of widths: zero, one, or, two. * The value -1 means that particular code point is not yet * assigned or not a Unicode character, i.e., U+FFFE and U+FFFF. */ static const int width_tbl[4] = { 0, 1, 2, -1 }; static const ldterm_unicode_data_cell_t ucode[1][16384] = { { /* Plane 00 a.k.a. BMP */ /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* ---------------------------------------------- */ /* U+0000 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+000F */ /* U+0010 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+001F */ /* U+0020 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+002F */ /* U+0030 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+003F */ /* U+0040 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+004F */ /* U+0050 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+005F */ /* U+0060 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+006F */ /* U+0070 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, /* U+007F */ /* U+0080 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+008F */ /* U+0090 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+009F */ /* U+00A0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+00AF */ ... /* U+FF50 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, /* U+FF5F */ /* U+FF60 */ 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF6F */ /* U+FF70 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF7F */ /* U+FF80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF8F */ /* U+FF90 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF9F */ /* U+FFA0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FFAF */ /* U+FFB0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, /* U+FFBF */ /* U+FFC0 */ 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, /* U+FFCF */ /* U+FFD0 */ 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 3, /* U+FFDF */ /* U+FFE0 */ 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 3, /* U+FFEF */ /* U+FFF0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 3 /* U+FFFF */ }, }; Above two data structures will be provided at a separate header file, "sys/uwidth.h" and it will only contain a single plane as defined in the Unicode 3.0. The total size of the above two .rodata data structures is approximately 16KB. Above will be accessed from the ldterm(7M) by using following algorithm: plane = get_plane(utf8); if (plane > 0) { width = 1; return; } rowcol = get_rowcolumn(utf8); i = rowcol / 4; j = rowcol % 4; switch (j) { case 0: width = width_tbl[ucode[plane][i].u0]; break; case 1: width = width_tbl[ucode[plane][i].u1]; break; case 2: width = width_tbl[ucode[plane][i].u2]; break; case 3: width = width_tbl[ucode[plane][i].u3]; break; } Our Unicode/UTF-8 locales conform to Unicode 2.1 and we will also conform to Unicode 3.0 before Solaris 8 FCS. 2.2. ldtermstd_state_t data structure at The EUC specific data fields will not be removed because we still need them. We will, however, add four more data fields, t_csdata, t_csmethods, t_scratch[], and, t_scratch_len to support different codeset types as like below lines marked with vertical bar ('|') at the beginning of each added/changed line: | typedef struct _ldterm_cs_methods { | int (*ldterm_dispwidth)(uchar_t, void *, int); | int (*ldterm_memwidth)(uchar_t, void *); | } ldterm_cs_methods_t; typedef struct ldterm_mod { ... /* | * The following are for EUC and also other types of codeset | * processing. */ uchar_t t_codeset; /* current code set indicator (read side) */ uchar_t t_eucleft; /* bytes left to get in current char (read) */ uchar_t t_eucign; /* bytes left to ignore (output post proc) */ uchar_t t_eucpad; /* padding ... for eucwioc */ eucioc_t eucwioc; /* eucioc structure (have to use bcopy) */ uchar_t *t_eucp; /* ptr to parallel array of column widths */ mblk_t *t_eucp_mp; /* the m_blk that holds parallel array */ uchar_t t_maxeuc; /* the max length in memory bytes of an EUC */ int t_eucwarn; /* bad EUC counter */ | | /* | * The t_csdata and t_csmethods data fields are to support | * various non-EUC codesets. | */ | ldterm_cs_data_t t_csdata; | ldterm_cs_methods_t t_csmethods; | uchar_t t_scratch[LDTERM_CS_MAX_BYTE_LENGTH]; | uchar_t t_scratch_len; } ldtermstd_state_t; 2.3. UNKNOWN_WIDTH macro and typetab[] change at We will also have UNKNOWN_WIDTH macro defined at the header file: #define EUC_BSWIDTH 254 #define EUC_NLWIDTH 253 #define EUC_CRWIDTH 252 | | #define UNKNOWN_WIDTH 251 | #define EUC_MAXW 4 Detail will be described at section 2.6.1 and 2.6.4. We will put T_SS2 and T_SS3 at typetab[0x8e] and typetab[0x8f] as like below: static char typetab[256] = { /* 000 */ CONTROL, CONTROL, CONTROL, CONTROL, /* 004 */ CONTROL, CONTROL, CONTROL, CONTROL, ... /* 214 */ CONTROL, CONTROL, T_SS2, T_SS3, ... }; 2.4 The /usr/include/sys/csiioctl.h header file will contain following contents: #ifndef CSI_IOC #define CSI_IOC (('C' | 128) << 8) #endif #define CSDATA_SET (CSI_IOC | 1) #define CSDATA_GET (CSI_IOC | 2) 2.5. stty(1) 2.5.1. stty.h The header will not be changed. 2.5.2. stty.c and sttyparse.c Two additional global variables for the support of new codeset types will be added at stty.c: static ldterm_cs_data_t cswp; /* User side codeset width data */ static ldterm_cs_data_t kcswp; /* kernel side codeset width data */ After the setlocale() invocation in the main() routine, the stty(1) command will try to read /usr/lib/locale//LC_CTYPE/ldterm.dat file. If there is no such file at the directory, the command will assume the locale is an EUC locale. If there is ldterm.dat file, the command will read the file and save the data at the 'cswp'. It will also check if the data just read is valid one or not and also if the data is EUC one or not. If the data is invalid or the data is of EUC codeset, it will nullify the 'cswp' so that it will fallback to EUC mode. Since we don't print out codeset width information for non-EUC codeset locale, there will be no change at the get_ttymode() function. In the sttyparse(), if user specified the "defeucw" in the command line, the content of the 'cswp' will also be copied into the 'kcswp'. (Also, if the current locale's codeset is a multibyte one, it will also enable 'cs8' and disable 'istrip', 'cs7' and 'parenb'.) In the set_ttymode() function, if the current locale is non-EUC codeset locale and the data from the ldterm.dat is a valid one, the function will send down CSDATA_SET command with the 'kcswp' to the ldterm(7M). If the data from the ldterm.dat is invalid or the current locale is EUC codeset locale, it will send downstream EUC_WSET command with 'kwp'. 2.6. ldterm(7M) 2.6.1. Codeset type specific methods Internal codeset specific methods are like below: - EUC codeset methods: static int __ldterm_dispwidth_euc(uchar_t c, void *tp, int mode); static int __ldterm_memwidth_euc(uchar_t c, void *tp); - PC environment originated codeset methods: static int __ldterm_dispwidth_pccs(uchar_t c, void *tp, int mode); static int __ldterm_memwidth_pccs(uchar_t c, void *tp); - UTF-8 codeset methods: static int __ldterm_dispwidth_utf8(uchar_t c, void *tp, int mode); static int __ldterm_memwidth_utf8(uchar_t c, void *tp); Since in case of UTF-8 codeset, it is impossible to know the display width, i.e., screen column width, of a character simply looking at the first byte, it will always return UNKNOWN_WIDTH. The macro for the UNKNOWN_WIDTH will be defined at the . 2.6.2. ldtermopen() The t_csdata will be initialized with C locale (EUC) width info. The t_csmethods will be initialized with EUC codeset methods. 2.6.3. ldtermclose() There is no need to change this function. 2.6.4. ldterm_docanon() Checking on the last character in this function to see if it is an ASCII character or a part of multi-byte and/or multi-column character will be changed into a more generic and codeset independent one. We will replace ldterm_euc_erase() and ldterm_tokerase() to more generic and codeset independent ones: static void ldterm_csi_erase(queue_t *, size_t, ldtermstd_state_t *); static void ldterm_csi_werase(queue_t *, size_t, ldtermstd_state_t *); When the ldterm_csi_erase() and the ldterm_csi_werase() encounters UNKNOWN_WIDTH during their erase operation and the current codeset type is LDTERM_CS_TYPE_UTF8, it will compute the width of the corresponding character by calling a function: static uchar_t ldterm_utf8_length(uchar_t *u8char, int length); Above function will use the algorithm presented at the section 2.1 to figure out the column width. In this function, if given UTF-8 bytes in 'u8char' does not form a valid character within the 'length', it will return 1. Otherwise, the function will return the correct width of the character. If the state of the ldterm(7M) has TS_MEUC, i.e., if the ldterm(7M) is processing a codeset that is a multibyte one and/or a multi-column width one, it will use the current codeset specific methods to figure out display width (screen column width) and memory width (byte length) of each character. Maintenance of t_eucleft, t_eucp, and, t_codeset will be codeset independent. 2.6.5. ldterm_tabcols() If the fucntion encounters UNKNOWN_WIDTH from the 't_eucp' vector and the current codeset type is LDTERM_CS_TYPE_UTF8, it will replace the value of the '*t_eucp' with the return value from the ldterm_utf8_width() function described at the section 2.6.4 so that the correct column position for the tab can be returned. 2.6.6. ldterm_kill() The rubout will be done by using the values in 't_eucp' if the current t_state contains TS_MEUC instead of actually looking into the character returned from ldterm_unget(). If '*t_eucp' is 1, we will send the character returned from the ldterm_unget() to ldterm_rubout(). Otherwise, we will send ' ' (an ASCII space character) to the ldterm_rubout(). If the '*t_eucp' is UNKNOWN_WIDTH and the current codeset type is LDTERM_CS_TYPE_UTF8, it will replace the '*t_eucp' with the return value from the ldterm_utf8_width() function described at section 2.6.4 so that correct rubouts can be done for the UTF-8 character. 2.6.7. ldterm_do_ioctl() - CSDATA_SET: If ioctl command is CSDATA_SET, it will first check the message validity by looking at the user-supplied data. If the user-supplied data is not valid, it will negative acknowledge it. If data provided is valid, it will initialize following data fields of the module state with proper values: t_maxeuc: the max byte length of the codeset. t_state: bitwise or'ng of TS_MEUC if the current codeset's screen column width is bigger than 1. t_eucp_mp: if the 't_maxeuc' is bigger than 1 and/or the 't_state' has TS_MEUC set, we will allocate a memory block of CANBSIZ to the field if it does not have one yet. Otherwise, this data field will be freed and/or nullified. t_eucp: if the 't_maxeuc' is bigger than 1 and/or the 't_state' has TS_MEUC set, the 't_eucp' will have a proper pointer to an address of 't_euc_mp'. Otherwise, this data field will be nullified. t_csdata: newly received codeset header and width tables will be placed. t_csmethods: if the new codeset type is different from the previous one, we will also switch the methods to match the new codeset type. Each command we receive, we will acknowledge or negative acknowledge depend on the validity of the message received and also pass it downstream. If the user-supplied data is for EUC codeset, the function will also save the byte length and display width information of the EUC codeset to the tp->eucwioc. - CSDATA_GET: If ioctl command is CSDATA_GET, it will copy over necessary data from the 't_csdata' to user-supplied memory block and then it will acknowledge the message. - EUC_WSET: If ioctl command is EUC_WSET, the EUC codeset byte lengths and display widths will be also saved at the tp->cs_datap with LDTERM_CS_TYPE_EUC codeset type at tp->t_csdata.codeset_type. 2.6.8. ldterm_codeset() The function will be modified such that if the given input leading byte of a character is not ASCII and the current codeset type is not EUC, it will simply return 1. Otherwise, it will fine out EUC codeset number. 2.6.9. ldterm_output_msg() This function will be modified such that it will be codeset independent. Changes: - In the ICANON/XCASE processing, before we apply the input buffer byte to omaptab[] vector and also before we do the OLUCU processing, we will make sure that the current byte is an ASCII character. If the current byte is a byte of a multibyte character, it won't apply either one of the processings. Whether the function will apply the processings mentioned at above or not will be decided beforehand by looking at the first byte of a character as like the following if expression: if ((tp->t_state & TS_MEUC) && tp->t_eucign == 0 && NOTASCII(c)) tp->t_eucign = tp->t_csmethods.ldterm_memwidth(c, (void *)tp); The 'tp->t_eucign' will be 0 if and only if the current character, 'c', is an ASCII character (byte). Otherwise, it will have the byte length of a multibyte character. At this pint, we also add in column position that the multibyte character will take in if the current codeset type is not LDTERM_CS_TYPE_UTF8. We also save the byte length of the multibyte character at 'tp->t_scratch_len'. - When deciding the character type, if the current codeset type is not LDTERM_CS_TYPE_EUC, the function will use the 'typetab[]' only when the character is a single byte ASCII character. In any other cases, it will have 'ORDINAY' as a type. - For ORDINARY, T_SS2, and, T_SS3 character byte types, we will do a special column position calculation especially for UTF-8 codeset since we didn't add in the column positions needed for this multibyte UTF-8 character up until now in this function. (This is mainly we cannot decide the display width until we have all necessary bytes of a multibyte character in hand.) 3. Impact to any other components From the on28-gate, only crash(1M) command makes use of the 'ldtermstd_state_t' data fields, especially t_euc* data fields, to print out content of the system memory image. We will not change the crash(1M) command. There is one debug info need to be changed to incorporate the addition of two data fields at the 'ldtermstd_state_t': ldtermstd_state.dbg