A. Algorithm to generate the Prefix Code Table for the Pattern of size 3 to 9 bases with Codeword
1. To create Codeword for each Pattern of size 3 to 9 bases, the Patterns present in the DNA sequence file are sorted in increasing order by frequency. Totally 7 Prefix code Table is required for the Pattern of size 3 to 9 bases. For each Prefix code Table, Extended Binary Tree called Huffman coding is generated and unique Codeword is assigned. Procedure for building a Huffman Prefix code tree is given as follows.
a. Create a list of free nodes, where each node corresponds to each Pattern P0, P1, P2, and P3 and so on.
b. Select two free nodes with the lowest frequency Pattern from the list.
c. Create a parent node for two nodes found in b) with a frequency equal to the sum of the two child nodes.
d. Remove the two nodes found in b) from the list of free nodes, and add the parent node created in c) to the list.
e. Repeat the process starting from b) until only a single tree remains.
f. After building the Huffman Pattern Code tree, Assign 0 for a left branch and 1 for a right branch.
g. Create a Prefix Codeword for each Pattern by traversing the binary tree from the root to the node, which corresponds to the Pattern. The Prefix Codeword generated by the Huffman Pattern Code tree is uniquely decodable because Prefix Codeword is a Codeword in which prefix part of a Codeword is not a Codeword by itself.
2. Repeat the steps a to g for Pattern of size 3 to 9 bases and generate 7 PatternCode Table.
3. The number of patterns in each of the Prefix code Table is limited by checking the number of bits required to represent a pattern is less than or equal to n x 2, where n is the number of bases in the pattern. Length-limited Huffman coding is used to attain a minimum weighted path length.
4. The remaining Patterns from the Prefix code Table can be removed for better compression.
5. The compressed file consists of three different regions: Header, Compressed file and the length of the file header. The Header contains all the information like Prefix code Table content that must be known to decode the compressed code regions.
The compressed file is structured as:
Ø Blocks of bits representing the content of the Prefix code Table 000-110 of exact 3 bytes to 9 bytes pattern that provide the maximum amount of compression in the encoding pass generated by the Extended Binary Tree based on Huffman Coding required at the time of encoding as well as decoding process.
Ø Blocks of bits representing the Pattern Code 000-111.
Ø Blocks of bit representing the Codeword of identified patterns with the pattern type 0/1 in a contiguous form to indicate the repeated patterns and also the reversed pattern.
Ø Blocks of bit patterns representing compressed data in a contiguous form.