Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 1 | Unicode support in busybox |
| 2 | |
| 3 | There are several scenarios where we need to handle unicode |
| 4 | correctly. |
| 5 | |
| 6 | Shell input |
| 7 | |
| 8 | We want to correctly handle input of unicode characters. |
| 9 | There are several problems with it. Just handling input |
| 10 | as sequence of bytes would break any editing. This was fixed |
| 11 | and now lineedit operates on the array of wchar_t's. |
| 12 | But we also need to handle the following problematic moments: |
| 13 | |
| 14 | * It is unreasonable to expect that output device supports |
| 15 | _any_ unicode chars. Perhaps we need to avoid printing |
| 16 | those chars which are not supported by output device. |
| 17 | Examples: chars which are not present in the font, |
| 18 | chars which are not assigned in unicode, |
| 19 | combining chars (especially trying to combine bad pairs: |
| 20 | a_chinese_symbol + "combining grave accent" = ??!) |
| 21 | |
| 22 | * We need to account for the fact that unicode chars have |
| 23 | different widths: 0 for combining chars, 1 for usual, |
| 24 | 2 for ideograms (are there 3+ wide chars?). |
| 25 | |
| 26 | * Bidirectional handling. If user wants to echo a phrase |
| 27 | in Hebrew, he types: echo "srettel werbeH" |
| 28 | |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 29 | Editors (vi, ed) |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 30 | |
| 31 | This case is a bit similar to "shell input", but unlike shell, |
Dan Fandrich | b5de0c1 | 2011-07-08 05:47:49 +0200 | [diff] [blame] | 32 | editors may encounter many more unexpected unicode sequences |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 33 | (try to load a random binary file...), and they need to preserve |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 34 | them, unlike shell which can afford to drop bogus input. |
| 35 | |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 36 | more, less |
| 37 | |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 38 | Need to correctly display any input file. Ideally, with |
| 39 | ASCII/unicode/filtered_unicode option or keyboard switch. |
| 40 | Note: need to handle tabs and backspaces specially |
| 41 | (bksp is for manpage compat). |
| 42 | |
| 43 | cut, fold, watch |
| 44 | |
| 45 | May need ability to cut unicode string to specified number of wchars |
| 46 | and/or to specified screen width. Need to handle tabs specially. |
| 47 | |
| 48 | sed, awk, grep |
| 49 | |
| 50 | Handle unicode-aware regexp match |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 51 | |
| 52 | ls (multi-column display) |
| 53 | |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 54 | ls will fail to line up columnar output if it will not account |
| 55 | for character widths (and maybe filter out some of them, see |
| 56 | above). OTOH, non-columnar views (ls -1, ls -l, ls | car) |
| 57 | should NOT filter out bad unicode (but need to filter out |
| 58 | control chars (coreutils does that). Note that unlike more/less, |
| 59 | tabs and backspaces need not special handling. |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 60 | |
| 61 | top, ps |
| 62 | |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 63 | Need to perform filtering similar to ls. |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 64 | |
| 65 | Filename display (in error messages and elsewhere) |
| 66 | |
Denys Vlasenko | 4875e71 | 2010-02-01 22:35:30 +0100 | [diff] [blame] | 67 | Need to perform filtering similar to ls. |
Denys Vlasenko | 698dca5 | 2010-02-01 15:58:08 +0100 | [diff] [blame] | 68 | |
| 69 | |
| 70 | TODO: write an email to Asmus Freytag (asmus@unicode.org), |
| 71 | author of http://unicode.org/reports/tr11/ |