Similarity deduplication algorithm is a typical storage algorithm to find the similar data segments among a lot of data blocks, and call the compression methods to shrink the storage space the data would occupy.
Identifying the data blocks containing similar segments in a large data set is the key for the algorithm. Students should aim to design and implement an intuitive mechanism to help identify these data blocks, and thereafter improve the data reduction ratio compared to the baseline algorithm.