131 lines
5 KiB
Text
131 lines
5 KiB
Text
== Notes on {kddi,docomo,softbank}-*.ucm mappings.
|
|
|
|
kddi-jisx-208 is a variant of JIS X 208 used by KDDI, a Japanese cell
|
|
phone carrier.
|
|
|
|
kddi-shift_jis, docomo-shift_jis, and softbank-shift_jis are variants
|
|
of Shift_JIS used by KDDI, DoCoMo and SoftBank.
|
|
|
|
- kddi-jisx-208 contains Emoji (emoticon) code points in
|
|
0x75xx, 0x76xx, 0x77xx, 0x78xx, 0x79xx, 0x7Axx, 0x7Bxx,
|
|
where xx means 21-7E.
|
|
|
|
- kddi-shift_jis contains Emoji code points in
|
|
0xEBxx, 0xECxx, 0xEDxx, and 0xEExx, 0xF3xx, 0xF4xx, 0xF6xx, 0xF7xx,
|
|
where xx means 40-7E, 80-FC.
|
|
|
|
- docomo-shift_jis contains Emoji code points in
|
|
0xF8xx, and 0xF9xx, where xx means 40-7E, 80-FC.
|
|
|
|
- softbank-shift_jis contains Emoji code points in
|
|
0xF7xx, 0xF9xx, and 0xFBxx, where xx means 40-7E, 80-FC.
|
|
|
|
- softbank-jisx-208 contains Emoji code points in
|
|
0x75xx, 0x76xx, 0x77xx, 0x78xx, 0x79xx, 0x7Axx, 0x7Bxx, 0x7Dxx
|
|
where xx means 21-7E.
|
|
|
|
|
|
== How the -2012.ucm tables were modified in April 2013
|
|
|
|
The -2012 versions were created by
|
|
http://code.google.com/p/emoji4unicode/source/browse/trunk/src/gen_conversion_files.py
|
|
|
|
using each of the older 2012 versions as the base table files
|
|
to avoid non-Emoji changes:
|
|
|
|
# gen_google_ucm.sh
|
|
icu_mappings=/google/src/cloud/mscherer/icubranch/google_vendor_src_branch/icu/source/data/mappings
|
|
dest=/home/mscherer/www/no_crawl/emoji
|
|
./gen_conversion_files.py $icu_mappings/docomo-shift_jis-2012.ucm
|
|
cp ../generated/docomo-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py $icu_mappings/kddi-shift_jis-2012.ucm
|
|
cp ../generated/kddi-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py $icu_mappings/softbank-shift_jis-2012.ucm
|
|
cp ../generated/softbank-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py
|
|
|
|
The only differences from 2012-sep are in mappings for symbols
|
|
that have Unicode Variation Selector (VS) sequences.
|
|
|
|
The older tables relied on a hack in the ICU conversion code that
|
|
ignored the "use fallback" flag for fallbacks from sequences with VS.
|
|
|
|
The new tables rely on a new feature in ICU4C 51:
|
|
For the relevant symbols that have roundtrip mappings,
|
|
- the mappings with Emoji Variation Selector
|
|
use the |0 roundtrip precision
|
|
- the other mappings (no VS & text VS)
|
|
use the |4 "good one-way" precision
|
|
|
|
See http://bugs.icu-project.org/trac/ticket/9602
|
|
|
|
== How the -2012.ucm tables were created in September 2012
|
|
|
|
The 2012 versions were created by
|
|
http://code.google.com/p/emoji4unicode/source/browse/trunk/src/gen_conversion_files.py
|
|
|
|
using each of the 2007 versions as the base table files
|
|
to avoid non-Emoji changes:
|
|
|
|
icu_mappings=~/p4/emoji/google_vendor_src_branch/icu/source/data/mappings
|
|
dest=~/www/no_crawl/emoji
|
|
./gen_conversion_files.py $icu_mappings/docomo-shift_jis-2007.ucm
|
|
cp ../generated/docomo-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py $icu_mappings/kddi-shift_jis-2007.ucm
|
|
cp ../generated/kddi-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py $icu_mappings/softbank-shift_jis-2007.ucm
|
|
cp ../generated/softbank-shift_jis-2012.ucm $dest
|
|
./gen_conversion_files.py
|
|
|
|
The emoji4unicode code uses the mappings that were established during the
|
|
Unicode Emoji standardization process.
|
|
The new conversion tables round-trip carrier Emoji symbol codes
|
|
to and from Unicode 6 standard code points
|
|
and also include fallback mappings from the Google PUA code points
|
|
to the carrier codes.
|
|
|
|
The trailing "|0" etc. on the mapping table lines specify the mapping type:
|
|
|0 round-trip Unicode <-> charset
|
|
|1 fallback Unicode -> charset
|
|
|3 "reverse fallback" Unicode <- charset
|
|
|
|
For details about the .ucm file format see
|
|
http://userguide.icu-project.org/conversion/data#TOC-.ucm-File-Format
|
|
|
|
== How the -2007.ucm tables were created
|
|
|
|
So far, we haven't obtained "official" conversion tables from the cell
|
|
phone carriers. However, we empirically know their clients support
|
|
VDCs in MS932, like U2460 (CIRCLED DIGIT ONE), etc. Hence we use
|
|
MS932 as the base table for them.
|
|
|
|
kddi-jisx-208-2007.ucm is based on jisx-208.ucm in this directory.
|
|
The original table's mappings to codes 0x75xx to 0x7Bxx are excluded
|
|
to avoid collisions with emoji.
|
|
|
|
kddi-shift_jis-2007.ucm is based on windows-932-2000.ucm.
|
|
The original table's mappings to codes 0xEBxx to 0xEExx, and 0xF0xx to
|
|
0xF90xx (EUDC block), are excluded to avoid collisions with emoji.
|
|
|
|
docomo-shift_jis-2007.ucm is based on windows-932-2000.ucm.
|
|
The original table's mappings to codes 0xF0xx to 0xF90xx (EUDC block)
|
|
are excluded to avoid collisions with emoji.
|
|
|
|
softbank-shift_jis-2007.ucm is based on windows-932-2000.ucm.
|
|
The original table's mappings to codes 0xF0xx to 0xF90xx (EUDC block),
|
|
and 0xFBxx, are excluded to avoid collisions with emoji.
|
|
|
|
softbank-jisx-208-2007.ucm is based on jisx-208.ucm in this directory.
|
|
The original table's mappings to codes 0x75xx to 0x7Bxx, and 0x7Dxx
|
|
are excluded to avoid collisions with emoji.
|
|
|
|
== Google Standard Emoji Unicode Mapping
|
|
|
|
The Google standard emoji Unicode mapping can be found at:
|
|
|
|
/home/build/google3/i18n/encodings/emoji/emoji_unicode_mapping.txt
|
|
|
|
|
|
|
|
TODO(mscherer): Use <icu:base> to share most standard JIS mappings
|
|
among *-shift_jis-2007.ucm files.
|