Big Data Analysis of Households Income and Expenditure by Applying Hadoop Distributed File System

Alipour, Reza; Entezari Maleki, Reza

Back to the articles list

Back to browse issues page

Big Data Analysis of Households Income and Expenditure by Applying Hadoop Distributed File System

Reza Alipour ^*

, Reza Entezari Maleki

Iran University of Science and Technology

Abstract: (254 Views)

Big data is one of the most important resources in today's world, from which valuable information and knowledge is obtained by using various analyzes that are performed on it. Over the last two decades, the volume of this data has been expanding and its volume is gradually increasing. The Hadoop framework for distributing and processing metadata is one of the most widely used tools written in the Java programming language. Hadoop is a convenient tool that allows the processing of large data sets with clustering and facilitates the management of semi-structured and unstructured data.

In Iran, as in other countries, Household data is collected every year in the field of official statistics. These data contain valuable information, the results of which are published only in the whole country and province, and so far no results and information have been extracted in the city. The purpose of this study is to use the Hadoop framework for the distribution and processing of household data in the cities of the province, then the extracted information is used for analysis.

Based on the proposed model, data clustering of 31 provinces of the country was done in 4 clusters and 4 virtual machine servers with 4 nodes were considered. The raw data was converted from sql to csv and uploaded into HDFS files and then Map/Reduce operations were performed. Therefore, based on the objectives of this research, the outputs such as the average communication expenditure of a household and the Internet indicator at the level of the cities of 01 province were extracted and the comparisons were also shown.

It is obvious that the same information and indicators can be extracted and analyzed at a wider level, at the level of other cities of other provinces and even at the village level. According to the results of this research, it is suggested that by using the Hadoop distributed file system, household data can be prepared faster than in the past, which is now collected centrally, offline and with a delay. By providing timely outputs and information, faster and better analyzes can be performed than in the past. It is also suggested that by using the Hadoop distributed system, it will be possible to establish a relationship between the extracted annual household information at the city level with the population census information of the country and fill the statistical gap and household access indicators.

Keywords: Hadoop framework, Hadoop Distributed File System, MapReduce, Big Data, Household Data.

Full-Text [PDF 1055 kb] (22 Downloads)

Type of Study: Research | Subject: Special
Received: 2024/06/12 | Accepted: 2021/08/24

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Back to the articles list

Back to browse issues page