Representation of Data in Computer Systems
- The computer systems , each character is associated with a fixed length of encoding , say each character occupies a byte of the space and is represented by the 8 bits , in binary format . Thus we have the ASCII code , wherein each character is assigned a unique 8 bit representation and in decimal , the number ranges from
0-255
- However , in practice , when we tend to store the data each character becomes associated with a variable frequency , some have a higher frequency of occurrence than others , and under the
uniform
space allocation scheme , it is nit the optimum strategy to store data in this fashion , as it wastes CPU's resources MEMORY
, thus the need for variable
length encoding arises !!!
- If variable length encoding were allowed , one could opt for the greedy choice strategy by giving the most frequently occurring character the minimum bit encoding and , least frequently occurring character , largest bit encoding amongst the encoding for other characters.
Huffman Encoding
provides a way to implement the same .
- In Huffman Encoding implementation , we first iterate over the given text , to generate a character frequency table containing , and arrange them in a tree structure such that the node represents the character with their corresponding frequencies , such that the characters with least frequency are placed at the bottom of the tree .
- We construct a complete binary tree , where the leaf nodes represent the unique characters along with their frequencies , such that the minimum freq ones are at bottom of the tree.
- Next , going bottom-up approach , we replace the root node with frequency equal to the sum of the frequency of their children , and repeat this step until we reach the root of the tree and no more characters are available .
- Now , we use the convention of assigning bit 0 to the left child and 1 to the right child of the non-leaf nodes , and starting from the root of the tree , do this until an encoding is assigned to all the characters of the data.
- The tree has prefix-free property , that ensures that no code word could be a prefix of any other code word , this prevents ambiguity , and thus a variable length encoding could be carried out.
Code Implementation