Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 1 | Keeping data small |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 2 | |
| 3 | When many applets are compiled into busybox, all rw data and |
| 4 | bss for each applet are concatenated. Including those from libc, |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 5 | if static busybox is built. When busybox is started, _all_ this data |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 6 | is allocated, not just that one part for selected applet. |
| 7 | |
| 8 | What "allocated" exactly means, depends on arch. |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 9 | On NOMMU it's probably bites the most, actually using real |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 10 | RAM for rwdata and bss. On i386, bss is lazily allocated |
| 11 | by COWed zero pages. Not sure about rwdata - also COW? |
| 12 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 13 | In order to keep busybox NOMMU and small-mem systems friendly |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 14 | we should avoid large global data in our applets, and should |
| 15 | minimize usage of libc functions which implicitly use |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 16 | such structures. |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 17 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 18 | Small experiment to measure "parasitic" bbox memory consumption: |
| 19 | here we start 1000 "busybox sleep 10" in parallel. |
| 20 | busybox binary is practically allyesconfig static one, |
| 21 | built against uclibc. Run on x86-64 machine with 64-bit kernel: |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 22 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 23 | bash-3.2# nmeter '%t %c %m %p %[pn]' |
| 24 | 23:17:28 .......... 168M 0 147 |
| 25 | 23:17:29 .......... 168M 0 147 |
| 26 | 23:17:30 U......... 168M 1 147 |
| 27 | 23:17:31 SU........ 181M 244 391 |
| 28 | 23:17:32 SSSSUUU... 223M 757 1147 |
| 29 | 23:17:33 UUU....... 223M 0 1147 |
| 30 | 23:17:34 U......... 223M 1 1147 |
| 31 | 23:17:35 .......... 223M 0 1147 |
| 32 | 23:17:36 .......... 223M 0 1147 |
| 33 | 23:17:37 S......... 223M 0 1147 |
| 34 | 23:17:38 .......... 223M 1 1147 |
| 35 | 23:17:39 .......... 223M 0 1147 |
| 36 | 23:17:40 .......... 223M 0 1147 |
| 37 | 23:17:41 .......... 210M 0 906 |
| 38 | 23:17:42 .......... 168M 1 147 |
| 39 | 23:17:43 .......... 168M 0 147 |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 40 | |
| 41 | This requires 55M of memory. Thus 1 trivial busybox applet |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 42 | takes 55k of memory on 64-bit x86 kernel. |
| 43 | |
| 44 | On 32-bit kernel we need ~26k per applet. |
| 45 | |
Denis Vlasenko | 5a65447 | 2007-06-10 17:11:59 +0000 | [diff] [blame] | 46 | Script: |
| 47 | |
| 48 | i=1000; while test $i != 0; do |
| 49 | echo -n . |
| 50 | busybox sleep 30 & |
| 51 | i=$((i - 1)) |
| 52 | done |
| 53 | echo |
| 54 | wait |
| 55 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 56 | (Data from NOMMU arches are sought. Provide 'size busybox' output too) |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 57 | |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 58 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 59 | Example 1 |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 60 | |
| 61 | One example how to reduce global data usage is in |
Denys Vlasenko | 833d4e7 | 2010-11-03 02:38:31 +0100 | [diff] [blame] | 62 | archival/libarchive/decompress_unzip.c: |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 63 | |
| 64 | /* This is somewhat complex-looking arrangement, but it allows |
| 65 | * to place decompressor state either in bss or in |
| 66 | * malloc'ed space simply by changing #defines below. |
| 67 | * Sizes on i386: |
| 68 | * text data bss dec hex |
| 69 | * 5256 0 108 5364 14f4 - bss |
| 70 | * 4915 0 0 4915 1333 - malloc |
| 71 | */ |
| 72 | #define STATE_IN_BSS 0 |
| 73 | #define STATE_IN_MALLOC 1 |
| 74 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 75 | (see the rest of the file to get the idea) |
| 76 | |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 77 | This example completely eliminates globals in that module. |
Denis Vlasenko | c14d39e | 2007-06-08 13:05:39 +0000 | [diff] [blame] | 78 | Required memory is allocated in unpack_gz_stream() [its main module] |
Denis Vlasenko | 972288e | 2007-03-15 00:57:01 +0000 | [diff] [blame] | 79 | and then passed down to all subroutines which need to access 'globals' |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 80 | as a parameter. |
| 81 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 82 | |
| 83 | Example 2 |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 84 | |
| 85 | In case you don't want to pass this additional parameter everywhere, |
| 86 | take a look at archival/gzip.c. Here all global data is replaced by |
Denis Vlasenko | 972288e | 2007-03-15 00:57:01 +0000 | [diff] [blame] | 87 | single global pointer (ptr_to_globals) to allocated storage. |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 88 | |
| 89 | In order to not duplicate ptr_to_globals in every applet, you can |
| 90 | reuse single common one. It is defined in libbb/messages.c |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 91 | as struct globals *const ptr_to_globals, but the struct globals is |
Denis Vlasenko | 972288e | 2007-03-15 00:57:01 +0000 | [diff] [blame] | 92 | NOT defined in libbb.h. You first define your own struct: |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 93 | |
Denis Vlasenko | 972288e | 2007-03-15 00:57:01 +0000 | [diff] [blame] | 94 | struct globals { int a; char buf[1000]; }; |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 95 | |
| 96 | and then declare that ptr_to_globals is a pointer to it: |
| 97 | |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 98 | #define G (*ptr_to_globals) |
| 99 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 100 | ptr_to_globals is declared as constant pointer. |
| 101 | This helps gcc understand that it won't change, resulting in noticeably |
Denis Vlasenko | 574f2f4 | 2008-02-27 18:41:59 +0000 | [diff] [blame] | 102 | smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro: |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 103 | |
Denis Vlasenko | 574f2f4 | 2008-02-27 18:41:59 +0000 | [diff] [blame] | 104 | SET_PTR_TO_GLOBALS(xzalloc(sizeof(G))); |
Denis Vlasenko | 7560578 | 2007-03-14 00:07:51 +0000 | [diff] [blame] | 105 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 106 | Typically it is done in <applet>_main(). |
Denis Vlasenko | 972288e | 2007-03-15 00:57:01 +0000 | [diff] [blame] | 107 | |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 108 | Now you can reference "globals" by G.a, G.buf and so on, in any function. |
| 109 | |
| 110 | |
| 111 | bb_common_bufsiz1 |
| 112 | |
| 113 | There is one big common buffer in bss - bb_common_bufsiz1. It is a much |
| 114 | earlier mechanism to reduce bss usage. Each applet can use it for |
| 115 | its needs. Library functions are prohibited from using it. |
| 116 | |
| 117 | 'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer: |
| 118 | |
| 119 | #define G (*(struct globals*)&bb_common_bufsiz1) |
| 120 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 121 | Be careful, though, and use it only if globals fit into bb_common_bufsiz1. |
| 122 | Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change |
| 123 | from one libc to another, you have to add compile-time check for it: |
| 124 | |
Denis Vlasenko | 17a1526 | 2007-03-26 20:48:46 +0000 | [diff] [blame] | 125 | if (sizeof(struct globals) > sizeof(bb_common_bufsiz1)) |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 126 | BUG_<applet>_globals_too_big(); |
Denis Vlasenko | 4efeaee | 2007-03-15 19:52:42 +0000 | [diff] [blame] | 127 | |
| 128 | |
| 129 | Drawbacks |
| 130 | |
| 131 | You have to initialize it by hand. xzalloc() can be helpful in clearing |
| 132 | allocated storage to 0, but anything more must be done by hand. |
| 133 | |
| 134 | All global variables are prefixed by 'G.' now. If this makes code |
| 135 | less readable, use #defines: |
| 136 | |
| 137 | #define dev_fd (G.dev_fd) |
| 138 | #define sector (G.sector) |
| 139 | |
| 140 | |
| 141 | Word of caution |
| 142 | |
Bernhard Reutner-Fischer | 486e7ca | 2007-03-16 11:14:38 +0000 | [diff] [blame] | 143 | If applet doesn't use much of global data, converting it to use |
| 144 | one of above methods is not worth the resulting code obfuscation. |
| 145 | If you have less than ~300 bytes of global data - don't bother. |
Denis Vlasenko | 3d101dd | 2007-03-19 16:04:11 +0000 | [diff] [blame] | 146 | |
| 147 | |
Denys Vlasenko | abb154b | 2010-06-02 13:28:17 +0200 | [diff] [blame] | 148 | Finding non-shared duplicated strings |
| 149 | |
| 150 | strings busybox | sort | uniq -c | sort -nr |
| 151 | |
| 152 | |
Denis Vlasenko | 3d101dd | 2007-03-19 16:04:11 +0000 | [diff] [blame] | 153 | gcc's data alignment problem |
| 154 | |
| 155 | The following attribute added in vi.c: |
| 156 | |
| 157 | static int tabstop; |
| 158 | static struct termios term_orig __attribute__ ((aligned (4))); |
| 159 | static struct termios term_vi __attribute__ ((aligned (4))); |
| 160 | |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 161 | reduces bss size by 32 bytes, because gcc sometimes aligns structures to |
Denis Vlasenko | 3d101dd | 2007-03-19 16:04:11 +0000 | [diff] [blame] | 162 | ridiculously large values. asm output diff for above example: |
| 163 | |
| 164 | tabstop: |
| 165 | .zero 4 |
| 166 | .section .bss.term_orig,"aw",@nobits |
| 167 | - .align 32 |
| 168 | + .align 4 |
| 169 | .type term_orig, @object |
| 170 | .size term_orig, 60 |
| 171 | term_orig: |
| 172 | .zero 60 |
| 173 | .section .bss.term_vi,"aw",@nobits |
| 174 | - .align 32 |
| 175 | + .align 4 |
| 176 | .type term_vi, @object |
| 177 | .size term_vi, 60 |
| 178 | |
| 179 | gcc doesn't seem to have options for altering this behaviour. |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 180 | |
Denis Vlasenko | f363065 | 2007-03-20 15:53:11 +0000 | [diff] [blame] | 181 | gcc 3.4.3 and 4.1.1 tested: |
| 182 | char c = 1; |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 183 | // gcc aligns to 32 bytes if sizeof(struct) >= 32 |
Denis Vlasenko | f363065 | 2007-03-20 15:53:11 +0000 | [diff] [blame] | 184 | struct { |
| 185 | int a,b,c,d; |
| 186 | int i1,i2,i3; |
| 187 | } s28 = { 1 }; // struct will be aligned to 4 bytes |
| 188 | struct { |
| 189 | int a,b,c,d; |
| 190 | int i1,i2,i3,i4; |
| 191 | } s32 = { 1 }; // struct will be aligned to 32 bytes |
Denis Vlasenko | e84aeb5 | 2007-03-20 11:08:39 +0000 | [diff] [blame] | 192 | // same for arrays |
| 193 | char vc31[31] = { 1 }; // unaligned |
| 194 | char vc32[32] = { 1 }; // aligned to 32 bytes |
Denis Vlasenko | f363065 | 2007-03-20 15:53:11 +0000 | [diff] [blame] | 195 | |
Denis Vlasenko | b8e72fd | 2007-03-21 10:07:01 +0000 | [diff] [blame] | 196 | -fpack-struct=1 reduces alignment of s28 to 1 (but probably |
| 197 | will break layout of many libc structs) but s32 and vc32 |
| 198 | are still aligned to 32 bytes. |
| 199 | |
| 200 | I will try to cook up a patch to add a gcc option for disabling it. |
| 201 | Meanwhile, this is where it can be disabled in gcc source: |
| 202 | |
| 203 | gcc/config/i386/i386.c |
| 204 | int |
| 205 | ix86_data_alignment (tree type, int align) |
| 206 | { |
| 207 | #if 0 |
| 208 | if (AGGREGATE_TYPE_P (type) |
| 209 | && TYPE_SIZE (type) |
| 210 | && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST |
| 211 | && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256 |
| 212 | || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256) |
| 213 | return 256; |
| 214 | #endif |
| 215 | |
| 216 | Result (non-static busybox built against glibc): |
| 217 | |
| 218 | # size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox |
| 219 | text data bss dec hex filename |
| 220 | 634416 2736 23856 661008 a1610 busybox |
| 221 | 632580 2672 22944 658196 a0b14 busybox_noalign |
Denys Vlasenko | a7bb3c1 | 2009-10-08 12:28:08 +0200 | [diff] [blame] | 222 | |
| 223 | |
| 224 | |
| 225 | Keeping code small |
| 226 | |
| 227 | Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once", |
| 228 | produce "make bloatcheck", see the biggest auto-inlined functions. |
| 229 | Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE |
| 230 | to some of these functions. In 1.16.x timeframe, the results were |
| 231 | (annotated "make bloatcheck" output): |
| 232 | |
| 233 | function old new delta |
| 234 | expand_vars_to_list - 1712 +1712 win |
| 235 | lzo1x_optimize - 1429 +1429 win |
| 236 | arith_apply - 1326 +1326 win |
| 237 | read_interfaces - 1163 +1163 loss, leave w/o NOINLINE |
| 238 | logdir_open - 1148 +1148 win |
| 239 | check_deps - 1148 +1148 loss |
| 240 | rewrite - 1039 +1039 win |
| 241 | run_pipe 358 1396 +1038 win |
| 242 | write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE |
| 243 | dump_identity - 987 +987 win |
| 244 | mainQSort3 - 921 +921 win |
| 245 | parse_one_line - 916 +916 loss |
| 246 | summarize - 897 +897 almost the same |
| 247 | do_shm - 884 +884 win |
| 248 | cpio_o - 863 +863 win |
| 249 | subCommand - 841 +841 loss |
| 250 | receive - 834 +834 loss |
| 251 | |
| 252 | 855 bytes saved in total. |
Denys Vlasenko | adf922e | 2009-10-08 14:35:37 +0200 | [diff] [blame] | 253 | |
| 254 | scripts/mkdiff_obj_bloat may be useful to automate this process: run |
| 255 | "scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE" |
| 256 | and select modules which shrank. |