#ident "@(#)ldterm-csi.txt 1.9 99/04/28 SMI" 3/20/1998 is@eng.sun.com Codeset independent ldterm(7M) and stty(1) ------------------------------------------ 1. Overview Current ldterm(1M) and stty(1) implementations are EUC codeset specific and also have EUC representation dependencies. This memo is to provide a design of codeset independent (CSI) ldterm(7M) module and stty(1) command. The design in this memo can be summarized as like below: - Provide three sets of internal methods in the ldterm(7M) to handle various codesets: (1) EUC codeset methods (default) (2) PC environment originated codeset methods (3) UTF-8 codeset methods The default method set that ldterm(7M) will start and run with will be the (1) from above. - Three new I_STR ioctl message commands specifically for the ldterm(7M) will be added: CSINFO_SET This call takes a pointer to a ldterm_cs_header_t data structure, and uses it to set the line discipline definition and also for a possible switch of the internal methods and data for the current locale's codeset. When this message is reached, the ldterm(7M) will check the validity of the message and if the message contains correct info, it will accumulate the header info. CSDATA_SET Depend on the header info previously set by 'CSINFO_SET' command, especially, 'csinfo_num' data field of the header, the ldterm(7M) will accept one or more of 'CSDATA_SET' messages and accumulate them internally. When it receives the final 'CSDATA_SET', the ldterm(7M) will validate so far received messages and set the received data as the data that will be used in the ldterm(7M) and then switch into the corresponding methods. If the validation fails, the ldterm(7M) will negative acknowledge the message. It is a responsibility of stty(1) that there will be always exactly the 'csinfo_num' number of 'CSDATA_SET' ioctl messages after the 'CSINFO_SET'. CSINFO_GET This call takes a pointer to a ldterm_cs_header_t structure and returns in it the codeset header info currently in use by the ldterm(7M) module. The three new ioctl commands will be added to header file. The EUC_WSET and EUC_WGET will not be removed. - Any locale that wants to utilize the (internal) non-EUC codeset methods of ldterm will provide /usr/lib/locale//LC_CTYPE/ldterm.dat file. The ldterm.dat file will contain info like codeset type, codeset and/or character widths of the current locale. - Upon user request of 'defeucw' mode setting, The stty(1) command will check if the current locale has the /usr/lib/locale//LC_CTYPE/ ldterm.dat file. If it does have the file, the stty(1) command will read in the file and pass down the content of the file to the ldterm(7M) module by using the CSINFO_SET and CSDATA_SET ioctl message commands. The current behavior on EUC will not be changed. For 'write settings' request, i.e., stty -a, we will not change the current implementation. And thus if the stty(1) is executed with -a option, and the current locale is not EUC one, it will print out: eucw ?, scrw ? If the current locale is an EUC one, the stty(1) will print out byte widths and screen column widths for the EUC codesets, for instance, in case of any single byte locales we support, stty -a will give following result: eucw 1:1:0:0, scrw 1:1:0:0 2. Detail design 2.1. ldterm.dat file and header files The ldterm.dat file will have either one of following file structures shown in Figure 2.1.1 or Figure 2.1.2: File (byte) offset +--------------------------+ 0 | ldterm data header info | +--------------------------+ 3 | ldterm eucpc data 1 | +--------------------------+ 25 | ldterm eucpc data 2 | +--------------------------+ 47 | | : ... : | | +--------------------------+ 201 | ldterm eucpc data 10 | +--------------------------+ 223 Note: The size of ldterm data header info is 3 bytes and it consists of 'version,' 'codeset_type,' and, 'csinfo_num' data fields of ldterm_cs_header_t. The data field 'csinfo_num' of the ldterm data header is 10 in above example and the size of each ldterm eucpc data is 22 bytes. Figure 2.1.1: ldterm.dat file structure example for EUC or PC originated codeset File (byte) offset +--------------------------+ 0 | ldterm data header info | +--------------------------+ 3 | Unicode data for Plane00 | +--------------------------+ 16387 | Unicode data for Plane01 | +--------------------------+ 32771 | | : ... : | | +--------------------------+ 262147 | Unicode data for Plane16 | +--------------------------+ 278531 Note: The size of ldterm data header info is 3 bytes and it consists of 'version,' 'codeset_type,' and, 'csinfo_num' data fields of ldterm_cs_header_t. The data field 'csinfo_num' of the ldterm data header is 16 planes in above example and the size of each Unicode plane is 16384 bytes. Figure 2.1.2: ldterm.dat file structure example for Unicode/UTF-8 codeset Definitions and data types that can be used to create and process the content of the 'ldterm.dat' file are like below and they will be added to the ldterm header file, : /* Next version will be the current LDTERM_DATA_VERSION + 1. */ #define LDTERM_DATA_VERSION 1 /* Supported codeset types. */ #define LDTERM_CS_TYPE_MIN 1 #define LDTERM_CS_TYPE_EUC 1 #define LDTERM_CS_TYPE_PCCS 2 #define LDTERM_CS_TYPE_UTF8 3 #define LDTERM_CS_TYPE_MAX 3 /* ldterm codeset header information. */ struct _ldterm_cs_header { unsigned char version; /* version: 1 ~ 255 */ unsigned char codeset_type; unsigned char csinfo_num; /* the number of */ /* codesets/planes */ }; typedef struct _ldterm_cs_header ldterm_cs_header_t; /* * The maximum number of bytes in a character of the codeset that * can be handled by ldterm. */ #define LDTERM_CS_MAX_BYTE_LENGTH 10 /* * Following two data structures are to provide codeset-specific * information for EUC and PC originated codesets (ldterm_eucpc_data_t) * and, Unicode/UTF-8 codeset (ldterm_unicode_data_cell_t). */ struct _ldterm_eucpc_data { unsigned char byte_length; unsigned char screen_width; unsigned char byte_range_start[LDTERM_CS_MAX_BYTE_LENGTH]; unsigned char byte_range_end[LDTERM_CS_MAX_BYTE_LENGTH]; }; typedef struct _ldterm_eucpc_data ldterm_eucpc_data_t; /* * To represent a single Unicode plane, it requires to have 16384 * 'ldterm_unciode_data_cell_t' elements. */ struct _ldterm_unicode_data_cell { unsigned char u0:2; unsigned char u1:2; unsigned char u2:2; unsigned char u3:2; }; typedef struct _ldterm_unicode_data_cell ldterm_unicode_data_cell_t; Possible values for each data field of "ldterm_cs_header_t" are like below: - version: LDTERM_DATA_VERSION - codeset_type: LDTERM_CS_TYPE_EUC if the current locale is EUC one. LDTERM_CS_TYPE_PCCS if the current locale is PC originated codeset one. LDTERM_CS_TYPE_UTF8 if the current locale is UTF-8 one. - csinfo_num: If the codeset_type is LDTERM_CS_TYPE_EUC, it will have the number of supplementary codesets supported in the locale. Valid values are 0 to 3. If the codeset_type is LDTERM_CS_TYPE_PCCS, it will have the number of distinguishable sub-codesets in the codeset of the locale. Valid values are 1 to 10. The number excludes ASCII sub-codeset. If the codeset_type is LDTERM_CS_TYPE_UTF8, it will contain the number of planes in this locale Unicode locale is supporting. Valid values are 1 to 16. Possible values for each data fields of "ldterm_eucpc_data_t" are like below: - If the 'codeset_type' is LDTERM_CS_TYPE_EUC, there will be three "ldterm_eucpc_data_t" elements: -- The first element's: byte_length: The byte length of EUC supplementary codeset one. screen_width: The screen column width of EUC supplementary codeset one. -- The second element's: byte_length: The byte length of EUC supplementary codeset two. screen_width: The screen column width of EUC supplementary codeset two. -- The third element's: byte_length: The byte length of EUC supplementary codeset three. screen_width: The screen column width of EUC supplementary codeset three. - If the codeset_type is LD_TERM_CS_TYPE_PCCS, for each distinguishable sub-codesets that will be represented by each "ldterm_eucpc_data_t" elements, it will have: -- The i'th element's: byte_length: The byte length of sub-codeset i. screen_width: The screen column width of sub-codeset i. byte_range_start: The start range for each byte of sub-codeset i including the start byte. byte_range_end: The end range for each byte of sub-codeset i including the end byte. - If the codeset_type is LDTERM_CS_TYPE_UTF8, since Unicode width info are quite unique and practically not possible to categorize into supplementary or sub-codesets like EUC or PC originated codesets, we will have to provide a character-by-character width table like following example source: #include /* * Following two table contains width information for Unicode. * Values in the table "ucode" points index to the "width_tbl" vector. * * There are only three different kind of widths: zero, one, or, two. * The value -1 means that particular code point is not yet * assigned or not a Unicode character, i.e., U+FFFE and U+FFFF. */ static const int width_tbl[4] = { 0, 1, 2, -1 }; ldterm_unicode_data_cell_t ucode[16][16384] = { { /* Plane 00 a.k.a. BMP */ /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* ---------------------------------------------- */ /* U+0000 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+000F */ /* U+0010 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+001F */ /* U+0020 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+002F */ /* U+0030 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+003F */ /* U+0040 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+004F */ /* U+0050 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+005F */ /* U+0060 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+006F */ /* U+0070 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, /* U+007F */ /* U+0080 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+008F */ /* U+0090 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* U+009F */ /* U+00A0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+00AF */ ... /* U+FF50 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, /* U+FF5F */ /* U+FF60 */ 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF6F */ /* U+FF70 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF7F */ /* U+FF80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF8F */ /* U+FF90 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FF9F */ /* U+FFA0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+FFAF */ /* U+FFB0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, /* U+FFBF */ /* U+FFC0 */ 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, /* U+FFCF */ /* U+FFD0 */ 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 3, /* U+FFDF */ /* U+FFE0 */ 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 3, /* U+FFEF */ /* U+FFF0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 3 /* U+FFFF */ } { 0 }, /* Plane 1 */ { 0 }, /* Plane 2 */ ... { 0 }, /* Plane 16 */ }; Above will be accessed from the ldterm(7M) by using following algorithm: plane = get_plane(utf8); rowcol = get_rowcolumn(utf8); i = rowcol / 4; j = rowcol % 4; switch (j) { case 0: width = width_tbl[ucode[plane][i].u0]; break; case 1: width = width_tbl[ucode[plane][i].u1]; break; case 2: width = width_tbl[ucode[plane][i].u2]; break; case 3: width = width_tbl[ucode[plane][i].u3]; break; } Our Unicode/UTF-8 locales are conforming to Unicode 2.1 and soon will also conform to Unicode 3.0 when it is available. 2.2. ldtermstd_state_t data structure at The EUC specific data fields will not be removed because we still need them. We will, however, add three more data fields, t_csheaderp, t_csdatap, and, t_csmethodsp, to support different codeset types as like below lines marked with vertical bar ('|') at the beginning of each added/changed line: | typedef struct _ldterm_cs_methods { | void (*ldterm_dispwidth)(uchar_t, void *, int); | void (*ldterm_memwidth)(uchar_t, void *); | char (*ldterm_non_ascii_trailing_char)(ldtermstd_state_t *); | char (*ldterm_output_msg)(queue_t *, mblk_t *, mblk_t **, | ldtermstd_state_t *, size_t, int); | } ldterm_cs_methods_t; typedef struct ldterm_mod { ... /* | * The following are for EUC and also other types of codeset | * processing. */ uchar_t t_codeset; /* current code set indicator (read side) */ uchar_t t_eucleft; /* bytes left to get in current char (read) */ uchar_t t_eucign; /* bytes left to ignore (output post proc) */ uchar_t t_eucpad; /* padding ... for eucwioc */ eucioc_t eucwioc; /* eucioc structure (have to use bcopy) */ uchar_t *t_eucp; /* ptr to parallel array of column widths */ mblk_t *t_eucp_mp; /* the m_blk that holds parallel array */ uchar_t t_maxeuc; /* the max length in memory bytes of an EUC */ int t_eucwarn; /* bad EUC counter */ | | /* | * The t_csheaderp, t_csdatap, and, t_csmethodsp data fields are | * to have support for various codesets. | */ | ldterm_cs_header_t *t_csheaderp; | void *t_csdatap; | ldterm_cs_methods_t *t_csmethodsp; } ldtermstd_state_t; 2.3. UNKNOWN_WIDTH macro and typetab[] change at We will also have UNKNOWN_WIDTH macro defined at the header file: #define EUC_BSWIDTH 254 #define EUC_NLWIDTH 253 #define EUC_CRWIDTH 252 | | #define UNKNOWN_WIDTH 251 | #define EUC_MAXW 4 Detail will be described at section 2.6.1 and 2.6.4. We will put T_SS2 and T_SS3 at typetab[0x8e] and typetab[0x8f] as like below: static char typetab[256] = { /* 000 */ CONTROL, CONTROL, CONTROL, CONTROL, /* 004 */ CONTROL, CONTROL, CONTROL, CONTROL, ... /* 214 */ CONTROL, CONTROL, T_SS2, T_SS3, ... }; 2.4 The /usr/include/sys/csiioctl.h header file will contain following contents: #ifndef CSI_IOC #define CSI_IOC (('C' | 128) << 8) #endif #define CSINFO_SET (CSI_IOC | 1) #define CSDATA_SET (CSI_IOC | 2) #define CSINFO_GET (CSI_IOC | 3) 2.5. stty(1) 2.5.1. stty.h The header file will have one more bit flag: #define CSI_CSW 32 2.5.2. stty.c and sttyparse.c Two additional global variables for the support of new codeset types will be added at stty.c: static ldterm_cs_header_t *cswp; /* User side codeset width header pointer */ static ldterm_cs_header_t kcswp; /* kernel side codeset width header */ After the setlocale() invocation in the main() routine, the stty(1) command will try to read /usr/lib/locale//LC_CTYPE/ldterm.dat file. If there is no such file at the directory, the command will assume the locale is an EUC locale. If there is ldterm.dat file, the command will mmap() the header portion of the file to the 'cswp'. The get_ttymode() function will retrieve the current width header info from the ldterm(7M) module into the 'kcswp' by using an ioctl() with CSINFO_GET command. If the ioctl() returns with the return value of zero and the current codeset is not the EUC codeset, the CSI_CSW bit flag will also be set to indicate the current terminal mode. If the current codeset is an EUC one, we will call ioctl() with EUC_WGET to get the EUC codeset width information. The function will set EUCW bit flag if the ioctl() call with EUC_WGET command is acknowledged. In the sttyparse(), if user specified the "defeucw" in the command line, and the current locale is non-EUC one, the content of the 'cswp' will be saved into the 'kcswp'. (Also, if the current locale's codeset is a multibyte one, it will also enable 'cs8' and disable 'istrip', 'cs7' and 'parenb'.) The set_ttymode() function will check the 'CSI_CSW' bit flag from the terminal mode and if it is set, the function will send down CSINFO_SET command with the 'kcswp' to the ldterm(7M). After the acknowledgement from the initial CSINFO_SET command, the function will further mmap() remainder of the ldtem.dat file and then send down necessary amount of CSDATA_SET commands to the ldterm(7M) for the current codeset. If there is no CSI_CSW bit flag but EUCW bit flag, it will send downstream EUC_WSET command. 2.6. ldterm(7M) 2.6.1. Codeset type specific methods Internal codeset specific methods are like below: - EUC codeset methods: static void __ldterm_dispwidth_euc(uchar_t c, void w*, int mode); static void __ldterm_memwidth_euc(uchar_t c, void w*); static char __ldterm_non_ascii_trailing_char_euc(ldtermstd_state_t *tp); static char __ldterm_output_msg_euc(queue_t *q, mblk_t *imp, mblk_t **omp, ldtermstd_state_t *tp, size_t bsize, int echoing); - PC environment originated codeset methods: static void __ldterm_dispwidth_pccs(uchar_t c, void w*, int mode); static void __ldterm_memwidth_pccs(uchar_t c, void w*); static char __ldterm_non_ascii_trailing_char_pccs(ldtermstd_state_t *tp); static char __ldterm_output_msg_pccs(queue_t *q, mblk_t *imp, mblk_t **omp, ldtermstd_state_t *tp, size_t bsize, int echoing); - UTF-8 codeset methods: static void __ldterm_dispwidth_utf8(uchar_t c, void w*, int mode); static void __ldterm_memwidth_utf8(uchar_t c, void w*); static char __ldterm_non_ascii_trailing_char_utf8(ldtermstd_state_t *tp); static char __ldterm_output_msg_utf8(queue_t *q, mblk_t *imp, mblk_t **omp, ldtermstd_state_t *tp, size_t bsize, int echoing); Since in case of UTF-8 codeset, it is impossible to know the display width, i.e., screen column width, of a character simply looking at the first byte, it will always return UNKNOWN_WIDTH. The macro for the UNKNOWN_WIDTH will be defined at the . Except the __ldterm_output_msg_euc() method, other __ldterm_output_msg_*() methods will not use typetab[], notrantab[], 2.6.2. ldtermopen() It will allocate memory blocks to t_csheaderp, t_csdatap, and, t_csmethodsp of the ldterm module's state pointer 'tp'. The t_csheaderp and t_csdatap will be initialized with C locale (EUC) width info. The t_csmethodsp will be initialized with EUC codeset methods. The memory allocations and initializations will be done before qprocson() invocation. 2.6.3. ldtermclose() It will free the memory blocks assigned to the t_csheaderp, t_csdatap, and, t_csmethodsp data fields if they are not NULL pointers. The memory deallocation will be done after qprocsoff() invocation. 2.6.4. ldterm_docanon() To figure out the type of the character at the end of the canonical buffer, we will use the current codeset specific method of the ldterm(7M), 'tp->ldterm_non_ascii_tailing_char()'. We will replace ldterm_euc_erase() and ldterm_tokerase() to more generic and codeset independent ones: static void ldterm_csi_erase(queue_t *, size_t, ldtermstd_state_t *); static void ldterm_csi_werase(queue_t *, size_t, ldtermstd_state_t *); When the ldterm_csi_erase() and the ldterm_csi_werase() encounters UNKNOWN_WIDTH during their erase operation and the current codeset type is LDTERM_CS_TYPE_UTF8, it will compute the width of corresponding character by calling a function: static int ldterm_utf8_width(uchar_t *u8char, int length); Above function will use the algorithm presented at the section 2.1 to figure out the column width. In this function, if given UTF-8 bytes in 'u8char' does not form a valid character within the 'length', it will return -1. Otherwise, the function will return the width of the character. If the state of the ldterm(7M) has TS_MEUC, i.e., if the ldterm(7M) is processing a codeset that is a multibyte one and/or a multi-column width one, it will use the current codeset specific methods to figure out display with (screen column width) and memory width (byte length) of each character. Maintenance of t_eucleft, t_eucp, and, t_codeset will be codeset independent. 2.6.5. ldterm_tabcols() If the function encounters UNKNOWN_WIDTH from the 't_eucp' vector and the current codeset type is LDTERM_CS_TYPE_UTF8, it will replace the value of the '*t_eucp' with the return value from the ldterm_utf8_width() function described at section 2.6.4 so that correct column positions for the tab can be returned. 2.6.6. ldterm_kill() The rubout will be done by using the values in 't_eucp' if the current t_state contains TS_MEUC instead of actually looking into the character returned from ldterm_unget(). If '*t_eucp' is 1, we will send the character returned from the ldterm_unget() to ldterm_rubout(). Otherwise, we will send ' ' (an ASCII space character) to the ldterm_rubout(). If the '*t_eucp' is UNKNOWN_WIDTH and the current codeset type is LDTERM_CS_TYPE_UTF8, it will replace the '*t_eucp' with the return value from the ldterm_utf8_width() function described at section 2.6.4 so that correct rubouts can be done for the UTF-8 character. 2.6.7. ldterm_do_ioctl() - CSINFO_SET: If ioctl command is CSINFO_SET, it will first check the message validity by looking at user-supplied data. If the user-supplied data is not right, it will negative acknowledge it. If it contains a proper user-supplied data, the module will save the data at a temporary data structure that will be saved later at the module's state, 't_csheaderp'. After that, the function will wait for CSDATA_SET command(s) from stty(1). Once the function receives all necessary codeset width data with the CSDATA_SET command(s), it will check the validity of the received data and if data provided is correct, it will initialize following data fields of the module state with proper values: t_maxeuc: the max byte length of the codeset. t_state: bitwise or'ng of TS_MEUC if the current codeset's screen column width is bigger than 1. t_eucp_mp: if the 't_maxeuc' is bigger than 1 and/or the 't_state' has TS_MEUC set, we will allocate a memory block of CANBSIZ to the field if it does not have one yet. Otherwise, this data field will be freed and/or nullified. t_eucp: if the 't_maxeuc' is bigger than 1 and/or the 't_state' has TS_MEUC set, the 't_eucp' will have a proper pointer to an address of 't_euc_mp'. Otherwise, this data field will be nullified. t_csheaderp: newly received codeset header information will be placed. t_csdatap: newly received codeset width tables will be placed. t_csmethodsp: if the new codset type is different from the previous one, we will also switch the methods to match the new codeset type. Each command we receive, we will acknowledge or negative acknowledge depend on the validity of the message received and also pass it downstream. - CSINFO_GET: If ioctl command is CSINFO_GET, it will copy over necessary data from the 't_csheaderp' to user-supplied memory block and then it will acknowledge the message. 3. Impact to any other components From the on998 gate, only crash(1M) command makes use of the 'ldtermstd_state_t' data fields, especially t_euc* data fields, to print out content of the system memory image. We will not change the crash(1) command. There is one debug info need to be changed to incorporate the addition of two data fields at the 'ldtermstd_state_t': ldtermstd_state.dbg