presents the results of an investigation comparing the
proposed distributed storage deduplication with
traditional backup storage deduplication on a range of
datasets. Figures from Fig. 4, to Fig. 7 shows the
tested results. The results of data deduplication in a
Hadoop environment demonstrated above significant
improvements in storage efficiency and overall
system performance. By utilizing MD5 hashing to
identify and eliminate redundant data chunks, the
total storage space required for the dataset was
markedly reduced. The storage decreases the
expenses and made room for more data by reducing
the need for additional storage infrastructure.
4 CONCLUSIONS
The utilization of data deduplication methods in
HDFS has heavily relied on data deduplication
techniques to optimize storage efficiency. The
method such as MD5 hashing can be effectively
employed to identify and eliminate redundant data,
hence decreasing storage expenses and enhancing
processing effectiveness. This paper provides a hash-
based method for data deduplicate using MD5. It is
utilized in a distributed setting made possible by the
Hadoop architecture. By making use of the mapper
and reducer functions, this method maximizes storage
space by ensuring that only unique files are kept and
redundant ones are destroyed. Each file is uniquely
hashed by MD5, acting as a fingerprint that enables
bucket-based indexing of those files. By accelerating
hash value computation, this all- encompassing
method not only improves storage efficiency but also
dramatically increases computational performance.
Additionally, by greatly raising the deduplication
ratio, this method reduces the overall storage
footprint. The method's re- markable ability to
recognize and handle unnecessary components is
among the factors that make it so successful for data
management jobs. With Hadoop's dis- tributed
processing capability and the benefits of MD5
hashing, this endeavour can provide a dependable
solution for data redundancy issues in large-scale
storage systems.
ACKNOWLEDGEMENTS
The endless thanks go to Lord Almighty for all the
blessings he has showered onto me, which has
enabled me to write this last note in my research
work. During the period of my research, as in the rest
of my life, I have been blessed by Almighty with
some extraordinary people who have spun a web of
support around me. Words can never be enough in
expressing how grateful I am to those incredible
people in my life who made this thesis possible. I
would like an attempt to thank them for making my
time during my research in the Institute a period I will
treasure. I am deeply indebted to my research
supervisor, Professor Dr G UMA DEVI such an
interesting thesis topic. Each meeting with her added
in valuable aspects to the implementation and
broadened my perspective. She has guided me with
her invaluable suggestions, lightened up the way in
my darkest times and encouraged me a lot in the
academic life.
REFERENCES
Sais, N., Mahdaoui, J., 2023 Distributed storage
optimization using multi-agent systems in Hadoop, E3S
Web of Conferences 412, 01091 ICIES’11
Sharma, A., Kakulapati., 2019 Data Deduplication
Techniques for Big Data Storage Systems, International
Journal of Innovative Technology and Exploring
Engineering (IJITEE) ISSN: 2278-3075 (Online), vol.-
8 Issue-10
Kumar, S., Bhardwaj, P., 2017 Enhancing Storage
Efficiency Using Distributed Deduplication On Big
Data Storage System. vol. 9, pp, Number-1
Balamurugan, K., 2021 A Survey On Deduplication
Techniques Handling Bigdata In HDFS. vol 4- Issue 1,
Paper 14
Phyu, T., 2001 Capacity Optimized Deduplication for Big
Unstructured Data in Scale-out Distributed Storage
System. Banff, Canada, pp. 174-187
R., S., 2017 Image Storage Optimization using
Deduplication, International Journal of Scientific
Engineering and Research (IJSER) vol 5 Issue 5
Fu, N., Jiang, F, Hu, W., 2017 Application-Aware Big Data
Deduplication in Cloud Environment, IEEE
Kumar, S., 2017 Secure Data Deduplication in Hadoop
Distributed File Storage System, Journal of Network
Communications and Emerging Technologies (JNCET)
vol 7, Issue 9, ISSN: 2395-5317
Powar, B., 2018 Massive Volume of Unstructured Data and
Storage Space Optimization, International Journal of
Engineering & Technology, 252-257
Luo, G., Li, S., Wu., 2015 Boafft: Distributed
Deduplication for Big Data Storage in the Cloud, IEEE
Transactions On Cloud Computing, vol. 61, No
Alange, A., 2022 Optimization of Small Sized File Access
Efficiency in HDFS by Integrating Virtual File System
Layer, International Journal of Advance Computer
Science and Applications (IJACSA), Vol. 13, No. 6
M.S.Ali, B., 2020 Big Data Optimization Techniques: An
Empirical Study, International Journal Of Scientific &