Apache Hadoop for large-scale data processing using machine learning techniques

Nidaa  Ghalib Ali; Mohanaed  Ajmi Falih; Ali  Ajmi Falih

Download

PDF

Statistic

Read Counter : 100 Download : 89

Abstract

As big data volumes increase and data variety becomes greater, there is a need for more advanced technology. The paper discusses Volume, Variety, and Velocity, which are known as the 3Vs of Big Data, along with Valence and Veracity. As organizations battle with these complexities, Apache Spark perhaps emerges as a technology that can overcome the limitations of Hadoop MapReduce to enable real-time analytics. The focus of this paper is on Big Data. The study evaluates the effectiveness of the K-Nearest Neighbors (KNN) algorithm on structured data. Decision Tree regression is evaluated on unstructured data, and logistic regression on semi-structured data in this study. The algorithms performed well on structured data; however, all the models failed to predict unstructured data. Moreover, an examination of the framework’s performance proves the computational efficiency of Apache Hadoop and Apache Spark. Furthermore, in terms of processing speed across all data types and algorithms, Spark outperformed Hadoop. As a result, it requires advanced analytical tools. Apache Spark is a modern, high-performance data processing framework that enables organizations to manage Big Data in real time.

Keywords

Big Data Hadoop Spark Machine learning

How to Cite

Ghalib Ali , N. ., Ajmi Falih, M. ., & Ajmi Falih, A. . (2026). Apache Hadoop for large-scale data processing using machine learning techniques. Future Technology, 5(3), 128–138. Retrieved from https://fupubco.com/futech/article/view/762

Download Citation

References

D. Gupta and R. Rani, “A study of big data evolution and research challenges,” J Inf Sci, vol. 45, no. 3, 2019, pp. 322–340. https://doi.org/10.1177/0165551517742796
A. Anžel, D. Heider, and G. Hattab, “The visual story of data storage: From storage properties to user interfaces,” Comput Struct Biotechnol J, vol. 19, 2021, pp. 4904–4918. https://doi.org/10.1016/j.csbj.2021.08.031
Sharmila, D. Kumar, P. Kumar, and A. Ashok, “Introduction to multimedia big data computing for IoT,” Multimedia Big Data Computing for IoT Applications: Concepts, Paradigms and Solutions, 2020, pp. 3–36.
M. Ali and K. Iqbal, “The Role of Apache Hadoop and Spark in Revolutionizing Financial Data Management and Analysis: A Comparative Study,” Journal of Artificial Intelligence and Machine Learning in Management, vol. 6, no. 2, 2022, pp. 14–28.
C. Walls and B. Barnard, “Success factors of Big Data to achieve organisational performance: Theoretical perspectives,” Expert Journal of Business and Management, vol. 8, 2020, no. 1.
J. Zhang and M. Lin, “A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020,” International Journal of Intelligent Computing and Cybernetics, vol. 16, no. 1, 2023, pp. 99–120. https://doi.org/10.1108/IJICC-10-2021-0239
A. Zarei, S. Safari, M. Ahmadi, and F. Mardukhi, “Past, Present and Future of Hadoop: A Survey,” arXiv preprint arXiv:2202, 2022.13293.
G. L. Prajapati and R. Raghuwanshi, “Study of Big Data Analytics Tool: Apache Spark,” Big Data Analytics in Cognitive Social Media and Literary Texts: Theory and Praxis, 2021, pp. 65–100.
Sulong, Ghazali, and Ammar Mohammedali. "Human Activities Recognition Via Features Extraction From Skeleton." Journal of Theoretical & Applied Information Technology , 2014, 68.3.
Atiyha, Baqer Turki, et al. "An improved cost estimation for unit commitment using back propagation algorithm." Malaysian Journal of Fundamental and Applied Sciences 15.2, 2019, 243-248.
Fadhil, Ammar Mohammedali, Hayder Nabeel Jalo, and Omar Farook Mohammad. "Improved Security of a Deep Learning-Based Steganography System with Imperceptibility Preservation." International journal of electrical and computer engineering systems 14.1, 2023,: 73-81.
Abed, Nibras Kadhim, Arfan Shahzad, and Ammar Mohammedali. "An improve service quality of mobile banking using deep learning method for customer satisfaction." AIP Conference Proceedings. Vol. 2746, 2023. No. 1. AIP Publishing.
Y. Benlachmi and M. L. Hasnaoui, “Big data and spark: Comparison with hadoop,” in 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), IEEE, 2020, pp. 811–817. https://doi.org/10.1109/WorldS451998.2020.9210398
D. Delchev and V. Lazarova, “Big Data Analysis Architecture,” Economic Alternatives, no. 2, 2021, pp. 315–328.
S. R. Salkuti, “A survey of big data and machine learning.,” International Journal of Electrical & Computer Engineering (2088-8708), vol. 10, 2020, no. 1.
M. A. Raza, H. U. R. Kayani, M. T. Aslam, A. Suleman, and M. Gul, “BIG DATA V’S MODELS, CHALLENGES, HADOOP ECOSYSTEM, ISSUES, USES, BENEFITS AND APPLICATIONS,” Pakistan Journal of Scientific Research, vol. 3, no. 1, 2023, pp. 47–60.
S. K. Punia, M. Kumar, T. Stephan, G. G. Deverajan, and R. Patan, “Performance analysis of machine learning algorithms for big data classification: Ml and ai-based algorithms for big data analysis,” International Journal of E-Health and Medical Communications (IJEHMC), vol. 12, no. 4, 2021, pp. 60–75.
Singh RK. Developing a big data analytics platform using Apache Hadoop Ecosystem for delivering big data services in libraries. Digital Library Perspectives. 2024 Feb 22.
Falih, M., Fadhil, A., Shakir, M., & Atiyah, B. T. " Exploring the potential of deep learning in smart grid: Addressing power load prediction and system fault diagnosis challenges" In AIP Conference Proceedings ,2024, March, Vol. 3092, No. 1.
V. Thesma, G. C. Rains, and J. Mohammadpour Velni, “Development of a Low-Cost Distributed Computing Pipeline for High-Throughput Cotton Phenotyping,” Sensors, vol. 24, no. 3, p. 970, Feb. 2024. https://doi.org/10.3390/s24030970
Longkumer, I., Hussain Mazumder, D., "Improving cancer prediction using feature selection in Spark environment. Concurrency and Computation: Practice and Experience, vol. 36, no. 2, p. e7903, 2024. https://doi.org/10.1002/cpe.7903
N. M. Mirza, A. Ali, and M. K. Ishak, “The scheduling techniques in the Hadoop and Spark of smart cities environment: a systematic review,” Bulletin of Electrical Engineering and Informatics, vol. 13, no. 1, pp. 453–464, Feb. 2024. https://doi.org/10.11591/eei.v13i1.5657
J. Chaudhary and V. Vyas, “Propositional aspects of Big Data tools: A comprehensive guide to Apache Spark,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12, no. 12s, pp. 631–639, Jan. 2024.
M. S. Farhan, A. Youssef, and L. Abdelhamid, “A Model for Enhancing Unstructured Big Data Warehouse Execution Time,” Big Data and Cognitive Computing, vol. 8, no. 2, p. 17, Feb. 2024. https://doi.org/10.3390/bdcc8020017
D. M. Al-Kerboly, M. M. Hamad, and O. A. Dawood, “Big data applications based on web mining techniques and recommender systems: Survey,” in AIP Conference Proceedings, vol. 3009, no. 1, AIP Publishing, Feb. 2024.
M. D. Indirman, G. W. Wiriasto, and L. A. Akbar, “Distributed Machine Learning using HDFS and Apache Spark for Big Data Challenges,” in E3S Web of Conferences, vol. 465, p. 02058, EDP Sciences, 2023. https://doi.org/10.1051/e3sconf/202346502058
Demirbaga U. HTwitt: A Hadoop-based platform for analysis and visualization of streaming Twitter data. Neural Computing and Applications, vol. 35, no. 33, pp. 23893–23908, Nov. 2023. https://doi.org/10.1007/s00521-023-08837-5
T. Hajji, R. Loukili, I. El Hassani, and T. Masrour, “Optimizations of distributed computing processes on Apache Spark platform,” IAENG International Journal of Computer Science, vol. 50, no. 2, Jun. 2023.

References

D. Gupta and R. Rani, “A study of big data evolution and research challenges,” J Inf Sci, vol. 45, no. 3, 2019, pp. 322–340. https://doi.org/10.1177/0165551517742796

A. Anžel, D. Heider, and G. Hattab, “The visual story of data storage: From storage properties to user interfaces,” Comput Struct Biotechnol J, vol. 19, 2021, pp. 4904–4918. https://doi.org/10.1016/j.csbj.2021.08.031

Sharmila, D. Kumar, P. Kumar, and A. Ashok, “Introduction to multimedia big data computing for IoT,” Multimedia Big Data Computing for IoT Applications: Concepts, Paradigms and Solutions, 2020, pp. 3–36.

M. Ali and K. Iqbal, “The Role of Apache Hadoop and Spark in Revolutionizing Financial Data Management and Analysis: A Comparative Study,” Journal of Artificial Intelligence and Machine Learning in Management, vol. 6, no. 2, 2022, pp. 14–28.

C. Walls and B. Barnard, “Success factors of Big Data to achieve organisational performance: Theoretical perspectives,” Expert Journal of Business and Management, vol. 8, 2020, no. 1.

J. Zhang and M. Lin, “A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020,” International Journal of Intelligent Computing and Cybernetics, vol. 16, no. 1, 2023, pp. 99–120. https://doi.org/10.1108/IJICC-10-2021-0239

A. Zarei, S. Safari, M. Ahmadi, and F. Mardukhi, “Past, Present and Future of Hadoop: A Survey,” arXiv preprint arXiv:2202, 2022.13293.

G. L. Prajapati and R. Raghuwanshi, “Study of Big Data Analytics Tool: Apache Spark,” Big Data Analytics in Cognitive Social Media and Literary Texts: Theory and Praxis, 2021, pp. 65–100.

Sulong, Ghazali, and Ammar Mohammedali. "Human Activities Recognition Via Features Extraction From Skeleton." Journal of Theoretical & Applied Information Technology , 2014, 68.3.

Atiyha, Baqer Turki, et al. "An improved cost estimation for unit commitment using back propagation algorithm." Malaysian Journal of Fundamental and Applied Sciences 15.2, 2019, 243-248.

Fadhil, Ammar Mohammedali, Hayder Nabeel Jalo, and Omar Farook Mohammad. "Improved Security of a Deep Learning-Based Steganography System with Imperceptibility Preservation." International journal of electrical and computer engineering systems 14.1, 2023,: 73-81.

Abed, Nibras Kadhim, Arfan Shahzad, and Ammar Mohammedali. "An improve service quality of mobile banking using deep learning method for customer satisfaction." AIP Conference Proceedings. Vol. 2746, 2023. No. 1. AIP Publishing.

Y. Benlachmi and M. L. Hasnaoui, “Big data and spark: Comparison with hadoop,” in 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), IEEE, 2020, pp. 811–817. https://doi.org/10.1109/WorldS451998.2020.9210398

D. Delchev and V. Lazarova, “Big Data Analysis Architecture,” Economic Alternatives, no. 2, 2021, pp. 315–328.

S. R. Salkuti, “A survey of big data and machine learning.,” International Journal of Electrical & Computer Engineering (2088-8708), vol. 10, 2020, no. 1.

M. A. Raza, H. U. R. Kayani, M. T. Aslam, A. Suleman, and M. Gul, “BIG DATA V’S MODELS, CHALLENGES, HADOOP ECOSYSTEM, ISSUES, USES, BENEFITS AND APPLICATIONS,” Pakistan Journal of Scientific Research, vol. 3, no. 1, 2023, pp. 47–60.

S. K. Punia, M. Kumar, T. Stephan, G. G. Deverajan, and R. Patan, “Performance analysis of machine learning algorithms for big data classification: Ml and ai-based algorithms for big data analysis,” International Journal of E-Health and Medical Communications (IJEHMC), vol. 12, no. 4, 2021, pp. 60–75.

Singh RK. Developing a big data analytics platform using Apache Hadoop Ecosystem for delivering big data services in libraries. Digital Library Perspectives. 2024 Feb 22.

Falih, M., Fadhil, A., Shakir, M., & Atiyah, B. T. " Exploring the potential of deep learning in smart grid: Addressing power load prediction and system fault diagnosis challenges" In AIP Conference Proceedings ,2024, March, Vol. 3092, No. 1.

V. Thesma, G. C. Rains, and J. Mohammadpour Velni, “Development of a Low-Cost Distributed Computing Pipeline for High-Throughput Cotton Phenotyping,” Sensors, vol. 24, no. 3, p. 970, Feb. 2024. https://doi.org/10.3390/s24030970

Longkumer, I., Hussain Mazumder, D., "Improving cancer prediction using feature selection in Spark environment. Concurrency and Computation: Practice and Experience, vol. 36, no. 2, p. e7903, 2024. https://doi.org/10.1002/cpe.7903

N. M. Mirza, A. Ali, and M. K. Ishak, “The scheduling techniques in the Hadoop and Spark of smart cities environment: a systematic review,” Bulletin of Electrical Engineering and Informatics, vol. 13, no. 1, pp. 453–464, Feb. 2024. https://doi.org/10.11591/eei.v13i1.5657

J. Chaudhary and V. Vyas, “Propositional aspects of Big Data tools: A comprehensive guide to Apache Spark,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12, no. 12s, pp. 631–639, Jan. 2024.

M. S. Farhan, A. Youssef, and L. Abdelhamid, “A Model for Enhancing Unstructured Big Data Warehouse Execution Time,” Big Data and Cognitive Computing, vol. 8, no. 2, p. 17, Feb. 2024. https://doi.org/10.3390/bdcc8020017

D. M. Al-Kerboly, M. M. Hamad, and O. A. Dawood, “Big data applications based on web mining techniques and recommender systems: Survey,” in AIP Conference Proceedings, vol. 3009, no. 1, AIP Publishing, Feb. 2024.

M. D. Indirman, G. W. Wiriasto, and L. A. Akbar, “Distributed Machine Learning using HDFS and Apache Spark for Big Data Challenges,” in E3S Web of Conferences, vol. 465, p. 02058, EDP Sciences, 2023. https://doi.org/10.1051/e3sconf/202346502058

Demirbaga U. HTwitt: A Hadoop-based platform for analysis and visualization of streaming Twitter data. Neural Computing and Applications, vol. 35, no. 33, pp. 23893–23908, Nov. 2023. https://doi.org/10.1007/s00521-023-08837-5

T. Hajji, R. Loukili, I. El Hassani, and T. Masrour, “Optimizations of distributed computing processes on Apache Spark platform,” IAENG International Journal of Computer Science, vol. 50, no. 2, Jun. 2023.

Apache Hadoop for large-scale data processing using machine learning techniques

Article Sidebar

Main Article Content

Abstract

Keywords

Article Details

References

References