libbb/sha1: in unrolled x86-64 code, pass initial W[] in registers, not on stack

This can be faster on some CPUs.
On Skylake, evidently load latency from L1 (or store-to-load
forwarding in LSU) is fast enough to completely hide
memory reference latencies here.

function                                             old     new   delta
sha1_process_block64                                3495    3514     +19

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
2 files changed