L2/02-291
From:     Kenneth Whistler
To:       unicore@unicode.org 
Subject:  WG2 report from Dublin
Date:     May 31, 2002

UTC participants:

As is my wont, I have gathered together a short report on
the doings at the recent WG2 meeting in Dublin for your delectation.

I will be focussing on items relevant to UTC decisions and work,
to indicate in general how things are progressing for closure on
Unicode 4.0, and what open items will require further UTC decision
coming out of the WG2 meeting. For a more complete accounting
of all that took place, see the list of resolutions from WG2 meeting
42 (WG2 N2454), and eventually the minutes from the meeting, when
they become available.

--Ken

******************************************************************

1. Amendment 2 to 10646-1

This is the amendment which covers all the additions to the BMP
which will become part of Unicode 4.0. The disposition of comments
for this amendment was completed without too much problem, and the
amendment was progressed for an FPDAM ballot (due to close in October).

A substantial number of additions were made to the set of characters
in the PDAM document. Most of these were already covered by UTC
decisions. Among these were:

23CF EJECT SYMBOL (cf. WG2 N2415; resolution M42.1)

12 Indic characters (cf. WG2 N2425; resolution M42.2)

In addition, most of the requests for additional characters that were
explicitly covered in the U.S. ballot comments were accomodated.
This includes:

4 Limbu character additions
16 Arabic character additions
24FF NEGATIVE CIRCLED DIGIT ZERO
43 additional Khmer characters
2 Bactrian Greek character additions

The following problematical characters were *removed* from the
amendment -- again accomodating U.S. ballot comments:

267E DO NOT LITTER SIGN
267F RECYCLABLE PACKAGING
31D0..31D4 vulgar fractions
2618 CIRCLED UPWARDS INDICATION
2700 LEFTWARDS SCISSORS

A few other of the symbols were moved to new coding locations, and/or
had their names changed, to accomodate U.S. or other comments.

At the request of the Cambodian national body, two of the Khmer
characters that had been approved by the UTC were removed from
the list to be added to the amendment: the two voicing marks for
Krung. Those were deferred for further study, since there may be
others needed for other minority languages, and their exact
status was unclear.

Other additions involved slight changes from UTC decisions:

3 double diacritics were added, but were encoded in the regular
combining characters block (035D..035F), rather than in the
combining half marks block (FE24..FE26). (cf. WG2 N2457; resolution
M42.8)

The 134 characters for the Uralic Phonetic Alphabet were added,
but the U.S. request to put the main group of these symbols
on Plane 1 was rejected in favor of a new block on the BMP
at 1D00..1D7F. (cf. WG2 N2442; resolution M42.5)

6 additional Syriac characters that the UTC had considered
but not actually formally accepted yet were added. (cf. WG2 N2422;
resolution M42.3)

4 Arabic characters for Urdu were added, based on
discussions in detail with Dr. Hussain from Pakistan, who
attended the meeting. These were from the list considered by
the UTC but not yet formally accepted. (cf. WG2 N2413-4; resolution
M42.6) The four characters accepted were:

060F ARABIC SIGN MISRA
0603 ARABIC SIGN SAFHA
0659 ARABIC SMALL HIGH TAH
FDFD ARABIC LIGATURE BISMILLAH ARRAHMAN ARRAHIM

122 Han compatibility characters were added at FA70..FAE9
for compatibility with the DPRK standard. This proved to
be the most problematical addition, as the DPRK delegation
originally brought in a set of 160 characters, requesting some
to be added to the BMP and others to Plane 2 in Amendment 1 to
Part 2. There were strong objections about apparent duplicates
in the set, and 38 unifiable candidates were identified during
the course of the meeting, bringing the set to be encoded
down to 122. This remaining set should be very carefully checked
during the FPDAM balloting, before the UTC signs off on them,
as mappings for these will be unchangeable once they are
standardized.

Various other technical and editorial changes were also made
to the amendment. Notable among these were:

The two problematical variation sequences <2278, FE00> and
<2279, FE00> were removed from the list of variation sequences.

Khmer character issues were resolved with agreement to add
explanatory remarks about the 6 characters the UTC is considering
for deprecation; and some other glyph changes were agreed upon.

Details on the character additions and other changes will be
showing up in the Unicode pipeline document on the website
soon. And I will provide a detailed consent document, as usual,
for the next UTC/L2 meeting, indicating those additions and
changes which will need further UTC and L2 decisions to bring
the agreed-upon repertoires back into synch.

******************************************************************

2. Amendment 1 to 10646-2

This is the amendment which covers all the additions to the supplementary
planes. These will also become part of Unicode 4.0. The disposition of
comments for this amendment was completed fairly easily, and the
amendment was also progressed for an FPDAM ballot (due to close in
October).

Notable character additions already approved by the UTC include:

Addition of 87 monogram, digram and tetragram Tai Xuan Jing symbols
at 1D300..1D356. (cf. WG2 N2416; resolution M42.16)

Addition of 4 Deseret characters. (cf. WG2 N2473, N2474; resolution M42.18)

The annotation and glyph changes for Linear B were approved. (cf.
WG2 N2455; resolution M42.17)

A defect report on U+2114 SCRIPT SMALL L was resolved by agreement
to add a disunified version for the mathematical script small l
at U+1D4C1. This will need to be reviewed by the UTC.

A ballot comment on the variation selectors in Plane 14 was resolved
by moving the set left one column (from E0110..E01FF to E0100..E01EF).
The UTC will need to reconfirm this.

And finally, a number of errors in the source references and glyphs
for Extension B were accepted. These were the least problematical
of the CJK mapping fixes, involving dictionary mappings, rather
than coded character set mappings. A small number of suggested corrections
that may impact coded character set mappings were deferred, and
will need further study to verify their impact. The UTC experts should
very carefully consider the implications, as these impact normative
mappings. (cf. WG2 N2448; resolution M42.20)

******************************************************************

3. Coptic

The WG2 formally went on record as agreed to the supplementation of
Coptic, and invited a full proposal for the addition of Coptic
characters. This puts the UTC and WG2 in synch as agreeing that
Coptic should not be unified with Greek and paves the way for a
complete encoding of the Coptic script in the future.

******************************************************************

4. Khmer

There were extensive meetings between the 4 members of the Cambodian
national body guests and interested participants from the U.S.,
Japanese, and Irish NB's, and Maurice Bauhahn, to discuss various
technical and procedural issues regarding the Khmer encoding.
These discussions were remarkably cordial and productive, and the
ad hoc group was able to make a number of unanimous, consensual
recommendations to the WG2 plenary.

As noted above regarding Amendment 2, 43 character additions were
agreed upon, including the set of 32 lunar date symbols, a set of
10 divination numbers, and the KHMER SIGN ATTHACAN. The two Krung
voicing marks were postponed for future study.

Consensus was reached regarding the 6 characters that the UTC
agreed at its last meeting to consider for deprecation. The
ad hoc recommends formal deprecation of U+17A3 and U+17D3, and
explanatory notes strongly discouraging the use of U+17B4, U+17B5,
U+17A4, and U+17D8. Of course WG2 doesn't do deprecation of
characters, but the WG2 plenary agreed to add appropriate usage
notes in Annex P for these 6 characters. The UTC will need to
take the action regarding the formal deprecations, and the
editorial committee can handle the notes regarding discouragement
of use.

Consensus was also reached to change the glyphs for U+17B4 and
U+17B5, to properly reflect the limited use to which they could
be put.

The Cambodian national body expressed their disappointment with
the virama model, but agreed to accept the model in 10646 and
Unicode as a matter of "force majeure", in the interest of
clearing away the uncertainties and making it possible to proceed
as quickly as possible to real implementations of the Khmer script.

The Unicode editors invited the Cambodian national body to make
further contributions which could clarify and extend the discussion
about the Khmer script for the Unicode Standard. And we expect to
find ways to publicly acknowledge the mistakes made in the original
encoding and the contribution of the Cambodian participants in
correcting them.

Everyone agreed to use this experience as an object lesson, and
to redouble the efforts to ensure that full and complete feedback
from stakeholders in other scripts to be encoded is obtained
before standardization of them is completed.

******************************************************************

5. Korean Issues

There were a couple contentious issues that came up regarding
Korean or Han characters for Korean use. The addition of 122
compatibility Han characters for mapping to the DPRK standard
was discussed above. The UTC needs to carefully review the
proposed mappings and additions.

In addition, there was an unresolved naming issue for four
Korean symbols, which revolved around inconsistent usage of
transliteration schemes for Korean syllables.

These two problems were not completely resolved to all
participant's satisfaction, and accounted for 3 abstentions
and the single negative vote on the relevant resolutions. It
is notable that all other technical resolutions (a total of 26,
altogether) were passed unanimously at this meeting.

******************************************************************

6. Re-editing 10646 into a single standard

WG2 went on record supporting the editor's proposal to coalesce
the current two-part standard (10646-1 BMP, 10646-2 Supplementary Planes)
into a single standard. This will consolidate all the standard's
text into a more manageable and comprehensible chunk (without some
clauses being lost in the middle of the charts), and will markedly
simplify the maintenance of the standard. When completed, we will
no longer have to deal with confusing situations (Amendment 2 to
Part 1 and Amendment 1 to Part 2, etc.) since an amendment will be
able to account for additions to the BMP and the supplementary
planes at the same time, and the editor won't have to track textual
cross-references and consistency fixes between the parts.

The fact the 10646 is now published on CD-ROM eliminates the
page count problems for this very large standard.

The contentious issue of multi-column versus single-column
presentation of the CJK charts was sidestepped for the moment.
The current charts, as is, will simply be consolidated for now,
keeping the multi-column layout for the BMP and the single-column
layout for the SIP, while future plans can be debated further.

The editor is charged with bringing a working draft of the
consolidated text into the next WG2 meeting in December, at which
point a new work item to do the actual consolidation could be
issued by SC2. It is conceivable that a consolidated version might
be available in the same time frame as the publication of Unicode 4.0,
or shortly thereafter.