L2/02-291 From: Kenneth Whistler To: unicore@unicode.org Subject: WG2 report from Dublin Date: May 31, 2002 UTC participants: As is my wont, I have gathered together a short report on the doings at the recent WG2 meeting in Dublin for your delectation. I will be focussing on items relevant to UTC decisions and work, to indicate in general how things are progressing for closure on Unicode 4.0, and what open items will require further UTC decision coming out of the WG2 meeting. For a more complete accounting of all that took place, see the list of resolutions from WG2 meeting 42 (WG2 N2454), and eventually the minutes from the meeting, when they become available. --Ken ****************************************************************** 1. Amendment 2 to 10646-1 This is the amendment which covers all the additions to the BMP which will become part of Unicode 4.0. The disposition of comments for this amendment was completed without too much problem, and the amendment was progressed for an FPDAM ballot (due to close in October). A substantial number of additions were made to the set of characters in the PDAM document. Most of these were already covered by UTC decisions. Among these were: 23CF EJECT SYMBOL (cf. WG2 N2415; resolution M42.1) 12 Indic characters (cf. WG2 N2425; resolution M42.2) In addition, most of the requests for additional characters that were explicitly covered in the U.S. ballot comments were accomodated. This includes: 4 Limbu character additions 16 Arabic character additions 24FF NEGATIVE CIRCLED DIGIT ZERO 43 additional Khmer characters 2 Bactrian Greek character additions The following problematical characters were *removed* from the amendment -- again accomodating U.S. ballot comments: 267E DO NOT LITTER SIGN 267F RECYCLABLE PACKAGING 31D0..31D4 vulgar fractions 2618 CIRCLED UPWARDS INDICATION 2700 LEFTWARDS SCISSORS A few other of the symbols were moved to new coding locations, and/or had their names changed, to accomodate U.S. or other comments. At the request of the Cambodian national body, two of the Khmer characters that had been approved by the UTC were removed from the list to be added to the amendment: the two voicing marks for Krung. Those were deferred for further study, since there may be others needed for other minority languages, and their exact status was unclear. Other additions involved slight changes from UTC decisions: 3 double diacritics were added, but were encoded in the regular combining characters block (035D..035F), rather than in the combining half marks block (FE24..FE26). (cf. WG2 N2457; resolution M42.8) The 134 characters for the Uralic Phonetic Alphabet were added, but the U.S. request to put the main group of these symbols on Plane 1 was rejected in favor of a new block on the BMP at 1D00..1D7F. (cf. WG2 N2442; resolution M42.5) 6 additional Syriac characters that the UTC had considered but not actually formally accepted yet were added. (cf. WG2 N2422; resolution M42.3) 4 Arabic characters for Urdu were added, based on discussions in detail with Dr. Hussain from Pakistan, who attended the meeting. These were from the list considered by the UTC but not yet formally accepted. (cf. WG2 N2413-4; resolution M42.6) The four characters accepted were: 060F ARABIC SIGN MISRA 0603 ARABIC SIGN SAFHA 0659 ARABIC SMALL HIGH TAH FDFD ARABIC LIGATURE BISMILLAH ARRAHMAN ARRAHIM 122 Han compatibility characters were added at FA70..FAE9 for compatibility with the DPRK standard. This proved to be the most problematical addition, as the DPRK delegation originally brought in a set of 160 characters, requesting some to be added to the BMP and others to Plane 2 in Amendment 1 to Part 2. There were strong objections about apparent duplicates in the set, and 38 unifiable candidates were identified during the course of the meeting, bringing the set to be encoded down to 122. This remaining set should be very carefully checked during the FPDAM balloting, before the UTC signs off on them, as mappings for these will be unchangeable once they are standardized. Various other technical and editorial changes were also made to the amendment. Notable among these were: The two problematical variation sequences <2278, FE00> and <2279, FE00> were removed from the list of variation sequences. Khmer character issues were resolved with agreement to add explanatory remarks about the 6 characters the UTC is considering for deprecation; and some other glyph changes were agreed upon. Details on the character additions and other changes will be showing up in the Unicode pipeline document on the website soon. And I will provide a detailed consent document, as usual, for the next UTC/L2 meeting, indicating those additions and changes which will need further UTC and L2 decisions to bring the agreed-upon repertoires back into synch. ****************************************************************** 2. Amendment 1 to 10646-2 This is the amendment which covers all the additions to the supplementary planes. These will also become part of Unicode 4.0. The disposition of comments for this amendment was completed fairly easily, and the amendment was also progressed for an FPDAM ballot (due to close in October). Notable character additions already approved by the UTC include: Addition of 87 monogram, digram and tetragram Tai Xuan Jing symbols at 1D300..1D356. (cf. WG2 N2416; resolution M42.16) Addition of 4 Deseret characters. (cf. WG2 N2473, N2474; resolution M42.18) The annotation and glyph changes for Linear B were approved. (cf. WG2 N2455; resolution M42.17) A defect report on U+2114 SCRIPT SMALL L was resolved by agreement to add a disunified version for the mathematical script small l at U+1D4C1. This will need to be reviewed by the UTC. A ballot comment on the variation selectors in Plane 14 was resolved by moving the set left one column (from E0110..E01FF to E0100..E01EF). The UTC will need to reconfirm this. And finally, a number of errors in the source references and glyphs for Extension B were accepted. These were the least problematical of the CJK mapping fixes, involving dictionary mappings, rather than coded character set mappings. A small number of suggested corrections that may impact coded character set mappings were deferred, and will need further study to verify their impact. The UTC experts should very carefully consider the implications, as these impact normative mappings. (cf. WG2 N2448; resolution M42.20) ****************************************************************** 3. Coptic The WG2 formally went on record as agreed to the supplementation of Coptic, and invited a full proposal for the addition of Coptic characters. This puts the UTC and WG2 in synch as agreeing that Coptic should not be unified with Greek and paves the way for a complete encoding of the Coptic script in the future. ****************************************************************** 4. Khmer There were extensive meetings between the 4 members of the Cambodian national body guests and interested participants from the U.S., Japanese, and Irish NB's, and Maurice Bauhahn, to discuss various technical and procedural issues regarding the Khmer encoding. These discussions were remarkably cordial and productive, and the ad hoc group was able to make a number of unanimous, consensual recommendations to the WG2 plenary. As noted above regarding Amendment 2, 43 character additions were agreed upon, including the set of 32 lunar date symbols, a set of 10 divination numbers, and the KHMER SIGN ATTHACAN. The two Krung voicing marks were postponed for future study. Consensus was reached regarding the 6 characters that the UTC agreed at its last meeting to consider for deprecation. The ad hoc recommends formal deprecation of U+17A3 and U+17D3, and explanatory notes strongly discouraging the use of U+17B4, U+17B5, U+17A4, and U+17D8. Of course WG2 doesn't do deprecation of characters, but the WG2 plenary agreed to add appropriate usage notes in Annex P for these 6 characters. The UTC will need to take the action regarding the formal deprecations, and the editorial committee can handle the notes regarding discouragement of use. Consensus was also reached to change the glyphs for U+17B4 and U+17B5, to properly reflect the limited use to which they could be put. The Cambodian national body expressed their disappointment with the virama model, but agreed to accept the model in 10646 and Unicode as a matter of "force majeure", in the interest of clearing away the uncertainties and making it possible to proceed as quickly as possible to real implementations of the Khmer script. The Unicode editors invited the Cambodian national body to make further contributions which could clarify and extend the discussion about the Khmer script for the Unicode Standard. And we expect to find ways to publicly acknowledge the mistakes made in the original encoding and the contribution of the Cambodian participants in correcting them. Everyone agreed to use this experience as an object lesson, and to redouble the efforts to ensure that full and complete feedback from stakeholders in other scripts to be encoded is obtained before standardization of them is completed. ****************************************************************** 5. Korean Issues There were a couple contentious issues that came up regarding Korean or Han characters for Korean use. The addition of 122 compatibility Han characters for mapping to the DPRK standard was discussed above. The UTC needs to carefully review the proposed mappings and additions. In addition, there was an unresolved naming issue for four Korean symbols, which revolved around inconsistent usage of transliteration schemes for Korean syllables. These two problems were not completely resolved to all participant's satisfaction, and accounted for 3 abstentions and the single negative vote on the relevant resolutions. It is notable that all other technical resolutions (a total of 26, altogether) were passed unanimously at this meeting. ****************************************************************** 6. Re-editing 10646 into a single standard WG2 went on record supporting the editor's proposal to coalesce the current two-part standard (10646-1 BMP, 10646-2 Supplementary Planes) into a single standard. This will consolidate all the standard's text into a more manageable and comprehensible chunk (without some clauses being lost in the middle of the charts), and will markedly simplify the maintenance of the standard. When completed, we will no longer have to deal with confusing situations (Amendment 2 to Part 1 and Amendment 1 to Part 2, etc.) since an amendment will be able to account for additions to the BMP and the supplementary planes at the same time, and the editor won't have to track textual cross-references and consistency fixes between the parts. The fact the 10646 is now published on CD-ROM eliminates the page count problems for this very large standard. The contentious issue of multi-column versus single-column presentation of the CJK charts was sidestepped for the moment. The current charts, as is, will simply be consolidated for now, keeping the multi-column layout for the BMP and the single-column layout for the SIP, while future plans can be debated further. The editor is charged with bringing a working draft of the consolidated text into the next WG2 meeting in December, at which point a new work item to do the actual consolidation could be issued by SC2. It is conceivable that a consolidated version might be available in the same time frame as the publication of Unicode 4.0, or shortly thereafter.