* Copyright (C) 2016 and later: Unicode, Inc. and others.
* License & terms of use: http://www.unicode.org/copyright.html
* Copyright (C) 2004-2016, International Business Machines
* Corporation and others. All Rights Reserved.
*
* file name: changes.txt
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2004may06
* created by: Markus W. Scherer
* change log for Unicode updates
For an overview, see https://unicode-org.github.io/icu/processes/unicode-update
Notes:
This log includes several command lines as used in the update process.
Some of them include a console prompt with the present working directory (pwd) followed by a $ si gn.
Use a console window that is set to that directory, or cd to there,
and then paste the command that follows the $ sign.
Most command lines use environment variables to make them more portable across versions
and machine configurations. When you set up a console window, copy & paste the `export` commands
from near the top of the current section before pasting tool command lines.
Adjust the environment variables to the current version and your machine setup.
(The command lines are currently as used on Linux.)
Syntax of this file:
`***` - section heading
`*` - sub heading
`-` - 1st level bullet
`+` - 2nd level bullet
`=` - 1st level bullet
`->` - "the previous things leads to...", OR a 2nd level bullet/item
---------------------------------------------------------------------------- ***
* New ISO 15924 script codes
Normally, add new script codes as part of a Unicode update.
See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums
and see the change logs below.
---------------------------------------------------------------------------- ***
Unicode 16.0 update for ICU 76
https://www.unicode.org/versions/Unicode16.0.0/
https://www.unicode.org/versions/beta-16.0.0.html
https://www.unicode.org/Public/draft/
https://www.unicode.org/reports/uax-proposed-updates.html
https://www.unicode.org/reports/tr44/tr44-33.html
https://unicode-org.atlassian.net/browse/ICU-22707 Unicode 16
https://unicode-org.atlassian.net/browse/CLDR-17226 BRS Unicode 16
https://github.com/unicode-org/unicodetools/pull/774 delete the RecommendedSetGenerator
https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1
* Command-line environment setup
Markus:
export UNIDATA_ROOT=~/unidata
export UNICODE_DATA=$UNIDATA_ROOT/uni16/final
export CLDR_SRC=~/cldr/uni/src
export ICU_ROOT=~/icu/uni
export ICU_SRC=$ICU_ROOT/src
export ICU_OUT=$ICU_ROOT/dbg
export ICUDT=icudt76b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
export UNICODE_TOOLS=~/unitools/mine/src
Elango:
export UNIDATA_ROOT=~/oss/unidata
export UNICODE_DATA=$UNIDATA_ROOT/uni16/final
export CLDR_SRC=~/oss/cldr/mine/src
export ICU_ROOT=~/oss/icu
export ICU_SRC=$ICU_ROOT
export ICU_OUT=$ICU_ROOT
export ICUDT=icudt76b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
export UNICODE_TOOLS=~/oss/unicodetools/mine/src
*** Unicode version numbers
- icu4c/source/data/makedata.mak
- icu4c/source/common/unicode/uchar.h
- com.ibm.icu.util.VersionInfo
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
*** Configure: Build Unicode data for ICU4J
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
so that the makefiles see the new version number.
- FYI: The option that adds the additional Unicode data files for ICU4J is
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data
- Markus's version:
cd $ICU_OUT/icu4c
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ../../src/icu4c/source/runConfigureICU --enable-debug --disable-release Linux/clang --prefix=/usr/local/google/home/mscherer/icu/mine/inst/icu4c > config.out 2>&1 ; tail config.out
- Elango's version (diff default C++ compiler & in-source build paths):
cd $ICU_OUT/icu4c/source
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ./runConfigureICU --enable-debug --disable-release Linux/gcc --prefix=/usr/local/google/home/elango/oss/icu/icu4c > config.out 2>&1 ; tail config.out
*** data files & enums & parser code
* download files
- same as for the early Unicode Tools setup and data refresh:
https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
- mkdir -p $UNICODE_DATA
- download Unicode files into $UNICODE_DATA
+ use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc.
+ subfolders: emoji, idna, security, ucd, uca
+ for pre-release (alpha, beta) data files:
~ if one of us produces the alpha.zip or beta.zip collection of data files for publication,
then we can use its contents directly (no FTP from unicode.org necessary)
~ otherwise download from https://www.unicode.org/Public/draft/
~ you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders
~ you can omit or discard UCD/ucd/Unihan.zip
+ alternate way of fetching files, if available:
copy the files from a Unicode Tools workspace that is up to date with
https://github.com/unicode-org/unicodetools
and which might at this point be *ahead* of "Public"
~ before the Unicode release copy files from "dev" subfolders, for example
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
+ for final-release data files, the source of truth is the files in
https://www.unicode.org/Public/(version) [=UCD],
https://www.unicode.org/Public/UCA/(version),
https://www.unicode.org/Public/idna/(version),
etc.
- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already)
or from the UCD/cldr/ output folder of the Unicode Tools:
From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73,
CLDR used modified grapheme break rules.
This might happen again.
+ To check in the Unicode Tools workspace:
~/unitools/mine/Generated$ meld UCD/16.0.0/auxiliary/*GraphemeBreakTest.txt UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt
+ If different, and after copying into CLDR:
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
or
cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
+ We may need CLDR versions of WordBreakTest.txt and LineBreakTest.txt
unless Unicode 16 and CLDR 46 eliminate their differences:
unicodetools issue #492
* process and/or copy files
- cd $ICU_SRC/tools/unicode
py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ For debugging, and tweaking how ppucd.txt is written,
the tool has an --only_ppucd option:
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
e.g.
py/preparseucd.py $UNICODE_DATA --only_ppucd /tmp/ppucd.txt
* new constants for new property values
- preparseucd.py error:
ValueError: missing uchar.h enum constants for some property values:
[('blk', {'Garay', 'Tulu_Tigalari', 'Todhri', 'Sunuwar', 'Egyptian_Hieroglyphs_Ext_A', 'Kirat_Rai', 'Symbols_For_Legacy_Computing_Sup', 'Myanmar_Ext_C', 'Ol_Onal', 'Gurung_Khema'}),
('sc', {'Gara', 'Onao', 'Todr', 'Krai', 'Tutg', 'Sunu', 'Gukh'}),
('InSC', {'Reordering_Killer'})]
= PropertyValueAliases.txt new property values (diff old & new .txt files)
(cd $UNIDATA_ROOT && diff -u uni15.1/final/ucd/PropertyValueAliases.txt uni16/alpha/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]')
+age; 16.0 ; V16_0
+blk; Egyptian_Hieroglyphs_Ext_A ; Egyptian_Hieroglyphs_Extended_A
+blk; Garay ; Garay
+blk; Gurung_Khema ; Gurung_Khema
+blk; Kirat_Rai ; Kirat_Rai
+blk; Myanmar_Ext_C ; Myanmar_Extended_C
+blk; Ol_Onal ; Ol_Onal
+blk; Sunuwar ; Sunuwar
+blk; Symbols_For_Legacy_Computing_Sup ; Symbols_For_Legacy_Computing_Supplement
+blk; Todhri ; Todhri
+blk; Tulu_Tigalari ; Tulu_Tigalari
+InSC; Reordering_Killer ; Reordering_Killer
-jg ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal
+jg ; Teh_Marbuta_Goal ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal
+sc ; Gara ; Garay
+sc ; Gukh ; Gurung_Khema
+sc ; Krai ; Kirat_Rai
+sc ; Onao ; Ol_Onal
+sc ; Sunu ; Sunuwar
+sc ; Todr ; Todhri
+sc ; Tutg ; Tulu_Tigalari
+ copy new API constants from the preparseucd.py output into the .h/.java files,
add/adjust comments, wrap lines, and set numeric values
+ (ignore Age: no API constants for that)
+ Block: uchar.h before UBLOCK_COUNT,
UCharacter.UnicodeBlock IDs, UCharacter.UnicodeBlock objects
+ Script: uscript.h & com.ibm.icu.lang.UScript
+ for new scripts: fix expectedLong names
in cintltst/cucdapi.c/TestUScriptCodeAPI()
and in com.ibm.icu.dev.test.lang.TestUScript.java
+ Indic_Syllabic_Category: uchar.h & UCharacter.IndicSyllabicCategory
+ after adding new API constants, run preparseucd.py again
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
(not strictly necessary for NOT_ENCODED scripts)
$ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
* build ICU
to make sure that there are no syntax errors
$ICU_OUT/icu4c$ echo;echo; date; make -j20 tests &> out.txt ; tail -n 30 out.txt ; date
* Bazel build process
See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
for an overview and for setup instructions.
Consider running `bazelisk --version` outside of the $ICU_SRC folder
to find out the latest `bazel` version, and
copying that version number into the $ICU_SRC/.bazeliskrc config file.
(Revert if you find incompatibilities, or, better, update our build & config files.)
* generate data files
- remember to define the environment variables
(see the start of the section for this Unicode version)
- cd $ICU_SRC
- optional but not necessary:
bazelisk clean
or even
bazelisk clean --expunge
- build/bootstrap/generate new files:
icu4c/source/data/unidata/generate.sh
* run & fix ICU4C tests
- Note: Some of the collation data and test data will be updated below,
so at this time we might get some collation test failures.
Ignore these for now.
- Some properties are hardcoded in the ICU libraries because they apply to
few characters or ranges, and are not expected to change often.
They are tested at least in C++ intltest (e.g., against ppucd.txt).
If these tests fail, then update the implementation and the tests.
- update CLDR GraphemeBreakTest.txt
(see the download section above about this file)
cd ~/unitools/mine/Generated
cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
- Robin or Andy helps with RBBI & spoof check test failures
* collation: CLDR collation root, UCA DUCET
- UCA DUCET goes into Mark's Unicode tools,
and a tool-tailored version goes into CLDR, see
https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
(note removing the underscore before "Rules")
cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
- restore TODO diffs in UCARules.txt
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
from the CLDR root files (..._CLDR_..._SHORT.txt)
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/collate/src/test/resources/com/ibm/icu/dev/data
- if CLDR common/uca/unihan-index.txt changes, then update
CLDR common/collation/root.xml <collation type="private-unihan">
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
- update CollationFCD.java:
copy & paste the initializers of lcccIndex[] etc.
from
$ICU_SRC/icu4c/source/i18n/collationfcd.cpp
to
$ICU_SRC/icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationFCD.java
- generate data files, as above (generate.sh), now to pick up new collation data
- rebuild ICU4C (make clean, make check, as usual)
* Unihan collators
https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
- generate ICU zh collation data
instructions inspired by
https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
+ setup:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
(didn't work without setting JAVA_HOME,
nor with the Google default of /usr/local/buildtools/java/jdk
[Google security limitations in the XML parser])
export TOOLS_ROOT=$ICU_SRC/tools
export CLDR_DIR=$CLDR_SRC
export CLDR_DATA_DIR=$CLDR_DIR
(pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
cd "$TOOLS_ROOT/cldr/lib"
./install-cldr-jars.sh "$CLDR_DIR"
+ generate the files we need
cd "$TOOLS_ROOT/cldr/cldr-to-icu"
ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
+ diff
cd $ICU_SRC
meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
+ copy into the source tree
cd $ICU_SRC
cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
- rebuild ICU4C
* run & fix ICU4C tests, now with new CLDR collation root data
- run all tests with the collation test data *_SHORT.txt or the full files
(the full ones have comments, useful for debugging)
- note on intltest: if collate/UCAConformanceTest fails, then
utility/MultithreadTest/TestCollators will fail as well;
fix the conformance test before looking into the multi-thread test
* update Java data files
- refresh just the UCD/UCA-related/derived files, just to be safe
- see (ICU4C)/source/data/icu4j-readme.txt
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
you need to reconfigure with unicore data; see the "configure" line above.
output:
...
make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt76b
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt76l.dat ./out/icu4j/icudt76b.dat -s ./out/build/icudt76l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt76b
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b"
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt76b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt76b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
- copy the binary data files into the ICU4J tree
cd $ICU_OUT/icu4c/data/out/icu4j
cp -v com/ibm/icu/impl/data/icudata/coll/* $ICU_SRC/icu4j/main/collate/src/main/resources/com/ibm/icu/impl/data/icudata/coll
cp -v com/ibm/icu/impl/data/icudata/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr
cp -v com/ibm/icu/impl/data/icudata/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata
cp -v com/ibm/icu/impl/data/icudata/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata
cd com/ibm/icu/impl/data/icudata/
ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata";}' | sh
- The procedure above is very conservative:
It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update.
It avoids dealing with any other discrepancies
between the source and generated data files.
*If* instead we wanted to refresh *all* of the ICU4J data from ICU4C:
$ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
* refresh Java test .txt files
- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cd $ICU_SRC/icu4c/source/data/unidata
cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cd ../../test/testdata
cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
* run & fix ICU4J tests
*** API additions
- send notice to icu-design about new born-@stable API (enum constants etc.)
*** CLDR numbering systems
- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
for example:
~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.1.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
-->
+10D40..10D49 ; Nd # [10] GARAY DIGIT ZERO..GARAY DIGIT NINE
+116D0..116E3 ; Nd # [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE
+11BF0..11BF9 ; Nd # [10] SUNUWAR DIGIT ZERO..SUNUWAR DIGIT NINE
+16130..16139 ; Nd # [10] GURUNG KHEMA DIGIT ZERO..GURUNG KHEMA DIGIT NINE
+16D70..16D79 ; Nd # [10] KIRAT RAI DIGIT ZERO..KIRAT RAI DIGIT NINE
+1CCF0..1CCF9 ; Nd # [10] OUTLINED DIGIT ZERO..OUTLINED DIGIT NINE
+1E5F1..1E5FA ; Nd # [10] OL ONAL DIGIT ZERO..OL ONAL DIGIT NINE
--> https://github.com/unicode-org/cldr/pull/3658
*** merge the Unicode update branch back onto the main branch
- make sure that changes to Unicode tools are checked in:
https://github.com/unicode-org/unicodetools
---------------------------------------------------------------------------- ***
Unicode 15.1 update for ICU 74
https://www.unicode.org/versions/Unicode15.1.0/
https://www.unicode.org/versions/beta-15.1.0.html
https://www.unicode.org/Public/draft/
https://www.unicode.org/reports/uax-proposed-updates.html
https://www.unicode.org/reports/tr44/tr44-31.html
https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1
https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1
https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1
* Command-line environment setup
Markus:
export UNIDATA_ROOT=~/unidata
export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final
export CLDR_SRC=~/cldr/uni/src
export ICU_ROOT=~/icu/uni
export ICU_SRC=$ICU_ROOT/src
export ICU_OUT=$ICU_ROOT/dbg
export ICUDT=icudt74b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
export UNICODE_TOOLS=~/unitools/mine/src
Elango:
export UNIDATA_ROOT=~/oss/unidata
export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot
export CLDR_SRC=~/oss/cldr/mine/src
export ICU_ROOT=~/oss/icu
export ICU_SRC=$ICU_ROOT
export ICU_OUT=$ICU_ROOT
export ICUDT=icudt74b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
export UNICODE_TOOLS=~/oss/unicodetools/mine/src
*** Unicode version numbers
- makedata.mak
- uchar.h
- com.ibm.icu.util.VersionInfo
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
*** Configure: Build Unicode data for ICU4J
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
so that the makefiles see the new version number.
cd $ICU_OUT/icu4c
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
*** data files & enums & parser code
* download files
- same as for the early Unicode Tools setup and data refresh:
https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
- mkdir -p $UNICODE_DATA
- download Unicode files into $UNICODE_DATA
+ new since Unicode 15.1:
for the pre-release (alpha, beta) data files,
download all of https://www.unicode.org/Public/draft/
(you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders)
+ if one of us produces the alpha.zip or beta.zip collection of data files for publication,
then we can use its contents directly (no FTP from unicode.org necessary)
+ for final-release data files, the source of truth are the files in
https://www.unicode.org/Public/(version) [=UCD],
https://www.unicode.org/Public/UCA/(version),
https://www.unicode.org/Public/idna/(version),
etc.
+ use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc.
+ subfolders: emoji, idna, security, ucd, uca
+ whichever way you download the files:
~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip
~ split Unihan into single-property files
~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan
~ FYI: for updating ICU, we do not actually need Unihan.zip contents
+ alternate way of fetching files, if available:
copy the files from a Unicode Tools workspace that is up to date with
https://github.com/unicode-org/unicodetools
and which might at this point be *ahead* of "Public"
~ before the Unicode release copy files from "dev" subfolders, for example
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already)
or from the UCD/cldr/ output folder of the Unicode Tools:
From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73,
CLDR used modified grapheme break rules.
This might happen again.
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
or
cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
+ Done: figure out whether we need a CLDR version of LineBreakTest.txt:
unicodetools issue #492
We should have had one, and instead rbbitst.cpp has "known issue" exception.
Unicode 16 and CLDR 46 might get back to having the same behavior.
- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+ done in ICU 76: modify preparseucd.py to copy this file
* Note: Since Unicode 15.1, data files are no longer published with version suffixes
even during the alpha or beta.
Thus we no longer need steps & tools to remove those suffixes.
(remove this note next time)
* process and/or copy files
- cd $ICU_SRC/tools/unicode
py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ For debugging, and tweaking how ppucd.txt is written,
the tool has an --only_ppucd option:
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
* new constants for new property values
- preparseucd.py error:
ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})]
= PropertyValueAliases.txt new property values (diff old & new .txt files)
cd $UNIDATA_ROOT
$ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
+age; 15.1 ; V15_1
+blk; CJK_Ext_I ; CJK_Unified_Ideographs_Extension_I
+IDSU; N ; No ; F ; False
+IDSU; Y ; Yes ; T ; True
+ID_Compat_Math_Continue; N ; No ; F ; False
+ID_Compat_Math_Continue; Y ; Yes ; T ; True
+ID_Compat_Math_Start; N ; No ; F ; False
+ID_Compat_Math_Start; Y ; Yes ; T ; True
+lb ; AK ; Aksara
+lb ; AP ; Aksara_Prebase
+lb ; AS ; Aksara_Start
+lb ; VF ; Virama_Final
+lb ; VI ; Virama
-> add new blocks to uchar.h before UBLOCK_COUNT
use long property names for enum constants,
for the trailing comment get the block start code point: diff old & new Blocks.txt
cd $UNIDATA_ROOT
$ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
+2EBF0..2EE4F; CJK Unified Ideographs Extension I
(ignore blocks whose end code point changed)
-> add new blocks to UCharacter.UnicodeBlock IDs
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
replace public static final int \1_ID = \2; \3
-> add new blocks to UCharacter.UnicodeBlock objects
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
-> add new line break values to uchar.h & UCharacter.LineBreak
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
(not strictly necessary for NOT_ENCODED scripts)
$ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
* build ICU
to make sure that there are no syntax errors
$ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
* update spoof checker UnicodeSet initializers:
inclusionPat & recommendedPat in i18n/uspoof.cpp
INCLUSION & RECOMMENDED in SpoofChecker.java
- make sure that the Unicode Tools tree contains the latest security data files
- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
- run the tool (no special environment variables needed)
cd $UNICODE_TOOLS
mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \
-Dexec.args="" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
- copy & paste from the Console output into the .cpp & .java files
* check hardcoded IDS_Unary_Operator
- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
- check that it has not changed:
(cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator)
- if it has changed, then update the implementation and the tests
- Since ICU 75, this property is tested in C++ intltest against ppucd.txt.
* check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue
- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
- check that they have not changed:
(cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math)
- if they have changed, then update the implementation and the tests
- Since ICU 75, these properties are tested in C++ intltest against ppucd.txt.
* Bazel build process
See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
for an overview and for setup instructions.
Consider running `bazelisk --version` outside of the $ICU_SRC folder
to find out the latest `bazel` version, and
copying that version number into the $ICU_SRC/.bazeliskrc config file.
(Revert if you find incompatibilities, or, better, update our build & config files.)
* generate data files
- remember to define the environment variables
(see the start of the section for this Unicode version)
- cd $ICU_SRC
- optional but not necessary:
bazelisk clean
or even
bazelisk clean --expunge
- build/bootstrap/generate new files:
icu4c/source/data/unidata/generate.sh
* Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD).
These characters are now just valid, no longer disallowed_STD3_valid.
Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid())
from uts46.cpp & UTS46.java,
and special test code from uts46test.cpp & UTS46Test.java.
(remove this section next time)
* run & fix ICU4C tests
- Note: Some of the collation data and test data will be updated below,
so at this time we might get some collation test failures.
Ignore these for now.
- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
- update CLDR GraphemeBreakTest.txt
cd ~/unitools/mine/Generated
cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
- Robin or Andy helps with RBBI & spoof check test failures
* collation: CLDR collation root, UCA DUCET
- UCA DUCET goes into Mark's Unicode tools,
and a tool-tailored version goes into CLDR, see
https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
(note removing the underscore before "Rules")
cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
- restore TODO diffs in UCARules.txt
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
from the CLDR root files (..._CLDR_..._SHORT.txt)
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
- if CLDR common/uca/unihan-index.txt changes, then update
CLDR common/collation/root.xml <collation type="private-unihan">
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
- generate data files, as above (generate.sh), now to pick up new collation data
- update CollationFCD.java:
copy & paste the initializers of lcccIndex[] etc. from
ICU4C/source/i18n/collationfcd.cpp to
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
- rebuild ICU4C (make clean, make check, as usual)
* Unihan collators
https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
- generate ICU zh collation data
instructions inspired by
https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
+ setup:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
(didn't work without setting JAVA_HOME,
nor with the Google default of /usr/local/buildtools/java/jdk
[Google security limitations in the XML parser])
export TOOLS_ROOT=$ICU_SRC/tools
export CLDR_DIR=$CLDR_SRC
export CLDR_DATA_DIR=$CLDR_DIR
(pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
cd "$TOOLS_ROOT/cldr/lib"
./install-cldr-jars.sh "$CLDR_DIR"
+ generate the files we need
cd "$TOOLS_ROOT/cldr/cldr-to-icu"
ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
+ diff
cd $ICU_SRC
meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
+ copy into the source tree
cd $ICU_SRC
cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
- rebuild ICU4C
* run & fix ICU4C tests, now with new CLDR collation root data
- run all tests with the collation test data *_SHORT.txt or the full files
(the full ones have comments, useful for debugging)
- note on intltest: if collate/UCAConformanceTest fails, then
utility/MultithreadTest/TestCollators will fail as well;
fix the conformance test before looking into the multi-thread test
* update Java data files
- refresh just the UCD/UCA-related/derived files, just to be safe
- see (ICU4C)/source/data/icu4j-readme.txt
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
you need to reconfigure with unicore data; see the "configure" line above.
output:
...
make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b"
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
- copy the binary data files into the ICU4J tree
cd $ICU_OUT/icu4c/data/out/icu4j
cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr
cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
cd com/ibm/icu/impl/data/$ICUDT/
ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh
- The procedure above is very conservative:
It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update.
It avoids dealing with any other discrepancies
between the source and generated data files.
*If* instead we wanted to refresh *all* of the ICU4J data from ICU4C:
$ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
* refresh Java test .txt files
- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cd $ICU_SRC/icu4c/source/data/unidata
cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cd ../../test/testdata
cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
* run & fix ICU4J tests
*** API additions
- send notice to icu-design about new born-@stable API (enum constants etc.)
*** CLDR numbering systems
- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
for example:
~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt
~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt
-->
(empty this time)
or:
~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
-->
(empty this time)
Unicode 15.1:
(none this time)
*** merge the Unicode update branch back onto the main branch
- do not merge the icudata.jar and testdata.jar,
instead rebuild them from merged & tested ICU4C
- if there is a merge conflict in icudata.jar, here is one way to deal with it:
+ remove icudata.jar from the commit so that rebasing is trivial
+ ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
+ ~/icu/uni/src$ git commit -a --amend
+ switch to main, pull updates, switch back to the dev branch
+ ~/icu/uni/src$ git rebase main
+ rebuild icudata.jar
+ ~/icu/uni/src$ git commit -a --amend
+ ~/icu/uni/src$ git push -f
- make sure that changes to Unicode tools are checked in:
https://github.com/unicode-org/unicodetools
---------------------------------------------------------------------------- ***
CLDR 43 root collation update for ICU 73
Partial update only for the root collation.
See
- https://unicode-org.atlassian.net/browse/CLDR-15946
Treat quote marks as equivalent when strength=UCOL_PRIMARY
- https://github.com/unicode-org/cldr/pull/2691
CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks
- https://github.com/unicode-org/cldr/pull/2833
CLDR-15946 make fancy quotes secondary-different from each other
The related changes to tailorings were already integrated in an earlier PR for
https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS.
This update is for the root collation,
which is handled by different tools than the locale data updates.
* Command-line environment setup
export UNICODE_DATA=~/unidata/uni15/20220830
export CLDR_SRC=~/cldr/uni/src
export ICU_ROOT=~/icu/uni
export ICU_SRC=$ICU_ROOT/src
export ICUDT=icudt73b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
*** Configure: Build Unicode data for ICU4J
cd $ICU_ROOT/dbg/icu4c
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
* Bazel build process
See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
for an overview and for setup instructions.
Consider running `bazelisk --version` outside of the $ICU_SRC folder
to find out the latest `bazel` version, and
copying that version number into the $ICU_SRC/.bazeliskrc config file.
(Revert if you find incompatibilities, or, better, update our build & config files.)
* generate data files
- remember to define the environment variables
(see the start of the section for this Unicode version)
- cd $ICU_SRC
- optional but not necessary:
bazelisk clean
or even
bazelisk clean --expunge
- build/bootstrap/generate new files:
icu4c/source/data/unidata/generate.sh
* collation: CLDR collation root, UCA DUCET
- UCA DUCET goes into Mark's Unicode tools,
and a tool-tailored version goes into CLDR, see
https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
(note removing the underscore before "Rules")
cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
- restore TODO diffs in UCARules.txt
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
from the CLDR root files (..._CLDR_..._SHORT.txt)
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
- if CLDR common/uca/unihan-index.txt changes, then update
CLDR common/collation/root.xml <collation type="private-unihan">
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
- generate data files, as above (generate.sh), now to pick up new collation data
- rebuild ICU4C (make clean, make check, as usual)
* run & fix ICU4C tests, now with new CLDR collation root data
- run all tests with the collation test data *_SHORT.txt or the full files
(the full ones have comments, useful for debugging)
- note on intltest: if collate/UCAConformanceTest fails, then
utility/MultithreadTest/TestCollators will fail as well;
fix the conformance test before looking into the multi-thread test
* update Java data files
- refresh just the UCD/UCA-related/derived files, just to be safe
- see (ICU4C)/source/data/icu4j-readme.txt
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
you need to reconfigure with unicore data; see the "configure" line above.
output:
...
make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b"
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
- copy the big-endian Unicode data files to another location,
separate from the other data files,
and then refresh ICU4J
cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
- new for ICU 73: also copy the binary data files directly into the ICU4J tree
cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
* When refreshing all of ICU4J data from ICU4C
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
or
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
* refresh Java test .txt files
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
cd $ICU_SRC/icu4c/source/data/unidata
cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
cd ../../test/testdata
cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
* run & fix ICU4J tests
*** merge the Unicode update branch back onto the main branch
- do not merge the icudata.jar and testdata.jar,
instead rebuild them from merged & tested ICU4C
- if there is a merge conflict in icudata.jar, here is one way to deal with it:
+ remove icudata.jar from the commit so that rebasing is trivial
+ ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
+ ~/icu/uni/src$ git commit -a --amend
+ switch to main, pull updates, switch back to the dev branch
+ ~/icu/uni/src$ git rebase main
+ rebuild icudata.jar
+ ~/icu/uni/src$ git commit -a --amend
+ ~/icu/uni/src$ git push -f
- make sure that changes to Unicode tools are checked in:
https://github.com/unicode-org/unicodetools
---------------------------------------------------------------------------- ***
Unicode 15.0 update for ICU 72
https://www.unicode.org/versions/Unicode15.0.0/
https://www.unicode.org/versions/beta-15.0.0.html
https://www.unicode.org/Public/15.0.0/ucd/
https://www.unicode.org/reports/uax-proposed-updates.html
https://www.unicode.org/reports/tr44/tr44-29.html
https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15
https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15
https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41)
* Command-line environment setup
export UNICODE_DATA=~/unidata/uni15/20220830
export CLDR_SRC=~/cldr/uni/src
export ICU_ROOT=~/icu/uni
export ICU_SRC=$ICU_ROOT/src
export ICUDT=icudt72b
export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
*** Unicode version numbers
- makedata.mak
- uchar.h
- com.ibm.icu.util.VersionInfo
- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
so that the makefiles see the new version number.
cd $ICU_ROOT/dbg/icu4c
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
*** data files & enums & parser code
* download files
- same as for the early Unicode Tools setup and data refresh:
https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
- mkdir -p $UNICODE_DATA
- download Unicode files into $UNICODE_DATA
+ subfolders: emoji, idna, security, ucd, uca
+ old way of fetching files: from the "Public" area on unicode.org
~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
~ split Unihan into single-property files
~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
+ new way of fetching files, if available:
copy the files from a Unicode Tools workspace that is up to date with
https://github.com/unicode-org/unicodetools
and which might at this point be *ahead* of "Public"
~ before the Unicode release copy files from "dev" subfolders, for example
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
+ get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
or from the UCD/cldr/ output folder of the Unicode Tools:
Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
or
cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
* for manual diffs and for Unicode Tools input data updates:
remove version suffixes from the file names
~$ unidata/desuffixucd.py $UNICODE_DATA
(see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
* process and/or copy files
- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ For debugging, and tweaking how ppucd.txt is written,
the tool has an --only_ppucd option:
py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
* new constants for new property values
- preparseucd.py error:
ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})]
= PropertyValueAliases.txt new property values (diff old & new .txt files)
~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
+age; 15.0 ; V15_0
+blk; Arabic_Ext_C ; Arabic_Extended_C
+blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H
+blk; Cyrillic_Ext_D ; Cyrillic_Extended_D
+blk; Devanagari_Ext_A ; Devanagari_Extended_A
+blk; Kaktovik_Numerals ; Kaktovik_Numerals
+blk; Kawi ; Kawi
+blk; Nag_Mundari ; Nag_Mundari
+sc ; Kawi ; Kawi
+sc ; Nagm ; Nag_Mundari
-> add new blocks to uchar.h before UBLOCK_COUNT
use long property names for enum constants,
for the trailing comment get the block start code point: diff old & new Blocks.txt
~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
+10EC0..10EFF; Arabic Extended-C
+11B00..11B5F; Devanagari Extended-A
+11F00..11F5F; Kawi
-13430..1343F; Egyptian Hieroglyph Format Controls
+13430..1345F; Egyptian Hieroglyph Format Controls
+1D2C0..1D2DF; Kaktovik Numerals
+1E030..1E08F; Cyrillic Extended-D
+1E4D0..1E4FF; Nag Mundari
+31350..323AF; CJK Unified Ideographs Extension H
(ignore blocks whose end code point changed)
-> add new blocks to UCharacter.UnicodeBlock IDs
Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
replace public static final int \1_ID = \2; \3
-> add new blocks to UCharacter.UnicodeBlock objects
Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
-> add new scripts to uscript.h & com.ibm.icu.lang.UScript
Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
replace public static final int \1 = \2; \3
-> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
and in com.ibm.icu.dev.test.lang.TestUScript.java
* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
(not strictly necessary for NOT_ENCODED scripts)
$ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
* build ICU
to make sure that there are no syntax errors
$ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
* update spoof checker UnicodeSet initializers:
inclusionPat & recommendedPat in i18n/uspoof.cpp
INCLUSION & RECOMMENDED in SpoofChecker.java
- make sure that the Unicode Tools tree contains the latest security data files
- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
- run the tool (no special environment variables needed)
- copy & paste from the Console output into the .cpp & .java files
* Bazel build process
See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
for an overview and for setup instructions.
Consider running `bazelisk --version` outside of the $ICU_SRC folder
to find out the latest `bazel` version, and
copying that version number into the $ICU_SRC/.bazeliskrc config file.
(Revert if you find incompatibilities, or, better, update our build & config files.)
* generate data files
- remember to define the environment variables
(see the start of the section for this Unicode version)
- cd $ICU_SRC
- optional but not necessary:
bazelisk clean
- build/bootstrap/generate new files:
icu4c/source/data/unidata/generate.sh
* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt
- Unicode 6.0..15.0: U+2260, U+226E, U+226F
- nothing new in this Unicode version, no test file to update
* run & fix ICU4C tests
- Note: Some of the collation data and test data will be updated below,
so at this time we might get some collation test failures.
Ignore these for now.
- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
(no rule changes in Unicode 15)
- update CLDR GraphemeBreakTest.txt
cd ~/unitools/mine/Generated
cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
- Andy helps with RBBI & spoof check test failures
* collation: CLDR collation root, UCA DUCET
- UCA DUCET goes into Mark's Unicode tools,
and a tool-tailored version goes into CLDR, see
https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
(note removing the underscore before "Rules")
cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
- restore TODO diffs in UCARules.txt
meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
- update (ICU4C)/source/test/testdata/CollationTest_*.txt
and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
from the CLDR root files (..._CLDR_..._SHORT.txt)
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
- if CLDR common/uca/unihan-index.txt changes, then update
CLDR common/collation/root.xml <collation type="private-unihan">
and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
- generate data files, as above (generate.sh), now to pick up new collation data
- update CollationFCD.java:
copy & paste the initializers of lcccIndex[] etc. from
ICU4C/source/i18n/collationfcd.cpp to
ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
- rebuild ICU4C (make clean, make check, as usual)
* Unihan collators
https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
- generate ICU zh collation data
instructions inspired by
https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
+ setup:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
(didn't work without setting JAVA_HOME,
nor with the Google default of /usr/local/buildtools/java/jdk
[Google security limitations in the XML parser])
export TOOLS_ROOT=~/icu/uni/src/tools
export CLDR_DIR=~/cldr/uni/src
export CLDR_DATA_DIR=~/cldr/uni/src
(pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
cd "$TOOLS_ROOT/cldr/lib"
./install-cldr-jars.sh "$CLDR_DIR"
+ generate the files we need
cd "$TOOLS_ROOT/cldr/cldr-to-icu"
ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
+ diff
cd $ICU_SRC
meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
+ copy into the source tree
cd $ICU_SRC
cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
- rebuild ICU4C
* run & fix ICU4C tests, now with new CLDR collation root data
- run all tests with the collation test data *_SHORT.txt or the full files
(the full ones have comments, useful for debugging)
- note on intltest: if collate/UCAConformanceTest fails, then
utility/MultithreadTest/TestCollators will fail as well;
fix the conformance test before looking into the multi-thread test
* update Java data files
- refresh just the UCD/UCA-related/derived files, just to be safe
- see (ICU4C)/source/data/icu4j-readme.txt
- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
you need to reconfigure with unicore data; see the "configure" line above.
output:
...
make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b
mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b
LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b
mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b"
jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/
mkdir -p /tmp/icu4j/main/shared/data
cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
- copy the big-endian Unicode data files to another location,
separate from the other data files,
and then refresh ICU4J
cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
* When refreshing all of ICU4J data from ICU4C
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
or
- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
* refresh Java test .txt files
- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
cd $ICU_SRC/icu4c/source/data/unidata
cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
cd ../../test/testdata
cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
* run & fix ICU4J tests
*** API additions
- send notice to icu-design about new born-@stable API (enum constants etc.)
*** CLDR numbering systems
- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
for example:
~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
--> --------------------
--> maximum size reached
--> --------------------
¤ Dauer der Verarbeitung: 0.17 Sekunden
(vorverarbeitet)
¤
*© Formatika GbR, Deutschland
|
|