Nearly Tight Bounds on the Encoding Length of the Burrows-Wheeler Transform.

Document Type

Article

Publication Date

January 2008

Abstract

In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hth-order empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described.In designing our BWT encoding and its lower bound, we also address the t-subset problem; here, the goal is to store a subset of t items drawn from a universe [1‥n] using just lg(nt) + O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

Share

COinS