blob: 9c159ce2a8c8ac5c468fd8719f33425ce5f1bd3c [file] [log] [blame]
Denys Vlasenko698dca52010-02-01 15:58:08 +01001 Unicode support in busybox
2
3There are several scenarios where we need to handle unicode
4correctly.
5
6 Shell input
7
8We want to correctly handle input of unicode characters.
9There are several problems with it. Just handling input
10as sequence of bytes would break any editing. This was fixed
11and now lineedit operates on the array of wchar_t's.
12But we also need to handle the following problematic moments:
13
14* It is unreasonable to expect that output device supports
15 _any_ unicode chars. Perhaps we need to avoid printing
16 those chars which are not supported by output device.
17 Examples: chars which are not present in the font,
18 chars which are not assigned in unicode,
19 combining chars (especially trying to combine bad pairs:
20 a_chinese_symbol + "combining grave accent" = ??!)
21
22* We need to account for the fact that unicode chars have
23 different widths: 0 for combining chars, 1 for usual,
24 2 for ideograms (are there 3+ wide chars?).
25
26* Bidirectional handling. If user wants to echo a phrase
27 in Hebrew, he types: echo "srettel werbeH"
28
Denys Vlasenko4875e712010-02-01 22:35:30 +010029 Editors (vi, ed)
Denys Vlasenko698dca52010-02-01 15:58:08 +010030
31This case is a bit similar to "shell input", but unlike shell,
Dan Fandrichb5de0c12011-07-08 05:47:49 +020032editors may encounter many more unexpected unicode sequences
Denys Vlasenko4875e712010-02-01 22:35:30 +010033(try to load a random binary file...), and they need to preserve
Denys Vlasenko698dca52010-02-01 15:58:08 +010034them, unlike shell which can afford to drop bogus input.
35
Denys Vlasenko698dca52010-02-01 15:58:08 +010036 more, less
37
Denys Vlasenko4875e712010-02-01 22:35:30 +010038Need to correctly display any input file. Ideally, with
39ASCII/unicode/filtered_unicode option or keyboard switch.
40Note: need to handle tabs and backspaces specially
41(bksp is for manpage compat).
42
43 cut, fold, watch
44
45May need ability to cut unicode string to specified number of wchars
46and/or to specified screen width. Need to handle tabs specially.
47
48 sed, awk, grep
49
50Handle unicode-aware regexp match
Denys Vlasenko698dca52010-02-01 15:58:08 +010051
52 ls (multi-column display)
53
Denys Vlasenko4875e712010-02-01 22:35:30 +010054ls will fail to line up columnar output if it will not account
55for character widths (and maybe filter out some of them, see
56above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
57should NOT filter out bad unicode (but need to filter out
58control chars (coreutils does that). Note that unlike more/less,
59tabs and backspaces need not special handling.
Denys Vlasenko698dca52010-02-01 15:58:08 +010060
61 top, ps
62
Denys Vlasenko4875e712010-02-01 22:35:30 +010063Need to perform filtering similar to ls.
Denys Vlasenko698dca52010-02-01 15:58:08 +010064
65 Filename display (in error messages and elsewhere)
66
Denys Vlasenko4875e712010-02-01 22:35:30 +010067Need to perform filtering similar to ls.
Denys Vlasenko698dca52010-02-01 15:58:08 +010068
69
70TODO: write an email to Asmus Freytag (asmus@unicode.org),
71author of http://unicode.org/reports/tr11/