Iranian Journal of Official Statistics Studies

fa تحلیل مه‌داده هزینه و درآمد خانوری کشور با بهره‌گیری از سیستم فایل توزیع‌شده هدوپ Big Data Analysis of Households Income and Expenditure by Applying Hadoop Distributed File System تخصصي Special پژوهشي Research مه‌داده از منابع مهم در دنیای امروز است، که با استفاده از تجزیه و تحلیل‌های گوناگونی که روی آن انجام می‌گیرد اطلاعات و دانش ارزشمندی از آن بدست می‌آید. طی دو دهه اخیر حجم این داده‌ها در حال گسترش بوده و رفته رفته بر حجم آن نیز افزوده می‌شود. چارچوب هدوپ برای توزیع و پردازش مه‌داده یکی از پرکاربردترین ابزاری است که با زبان برنامه‌نویسی جاوا نوشته شده است. هدوپ یک ابزار مناسب است که این امکان را می‌دهد تا پردازش بر روی مجموعه داده‌های بزرگ با خوشه‌بندی انجام پذیرد و مدیریت داده‌های نیمه‌ساختاریافته و ساختارنیافته را تسهیل کند. در ایران نیز همچون کشورهای دیگر هر ساله در حوزه آمارهای رسمی کشور داده‌های خانواری جمع‌آوری می‌شود. این داده‌ها حاوی اطلاعات ارزشمندی است که نتایج آن فقط در سطح کل کشور و استان منتشر می‌شود و تا کنون در سطح شهرستان نتایج و اطلاعاتی استخراج نشده است. هدف این تحقیق استفاده از چارچوب هدوپ برای توزیع و پردازش داده‌های خانواری در سطح شهرستان‌های استان است، سپس اطلاعات استخراج‌شده برای تجزیه و تحلیل مورد استفاده قرارمی‌گیرد. بر اساس مدل پیشنهادی، خوشه‌بندی داده‌های ۳۱ استان کشور در ۴ خوشه انجام و برای راه‌اندازی ۴ سرور ماشین مجازی با ۴ گره در نظر گرفته شد. داده خام از sql به csv تبدیل و در فایل‌های HDFS بارگذاری و عملیات نگاشت/کاهش انجام شد. بر اساس اهداف این تحقیق، خروجی‌های مورد نظر و شاخص‌های برخورداری یک خانوار، مانند استفاده از اینترنت در سطح شهرستان‌های استان ۰۱ استخراج شد و مورد مقایسه و تجزیه و تحلیل قرار گرفت. بدیهی است که همین اطلاعات و شاخص‌ها می‌تواند در سطح وسیع‌تر و در سطح شهرستان‌های دیگر استان‌ها و حتی در سطح روستایی نیز استخراج شده و مورد تجزیه و تحلیل قرار گیرد. با توجه به نتایج این تحقیق پیشنهاد می‌شود، با استفاده از سیستم فایل توزیع‌شده هدوپ، مه‌داده خانواری را سریع‌تر از گذشته آماده کرده و با ارایه بهنگام خروجی‌ها و اطلاعات، تحلیل‌های سریع‌تر و بهتری را نسبت به گذشته انجام داد. همچنین پیشنهاد می‌شود با بکارگیری سیستم توزیع‌شده هدوپ بتوان بین اطلاعات استخراج‌شده سالانه خانواری در سطح شهرستان با اطلاعات سرشماری جمعیتی کشور ارتباط برقرار کرده و خلأی آماری و شاخص‌های برخورداری خانوار را تکمیل کرد. Big data is one of the most important resources in today's world, from which valuable information and knowledge is obtained by using various analyzes that are performed on it. Over the last two decades, the volume of this data has been expanding and its volume is gradually increasing. The Hadoop framework for distributing and processing metadata is one of the most widely used tools written in the Java programming language. Hadoop is a convenient tool that allows the processing of large data sets with clustering and facilitates the management of semi-structured and unstructured data. <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> </pre> <pre style="text-align:justify"> In Iran, as in other countries, Household data is collected every year in the field of official statistics. These data contain valuable information, the results of which are published only in the whole country and province, and so far no results and information have been extracted in the city. The purpose of this study is to use the Hadoop framework for the distribution and processing of household data in the cities of the province, then the extracted information is used for analysis.</pre> Based on the proposed model, data clustering of 31 provinces of the country was done in 4 clusters and 4 virtual machine servers with 4 nodes were considered. The raw data was converted from sql to csv and uploaded into HDFS files and then Map/Reduce operations were performed. Therefore, based on the objectives of this research, the outputs such as the average communication expenditure of a household and the Internet indicator at the level of the cities of 01 province were extracted and the comparisons were also shown. <pre style="text-align:justify"> It is obvious that the same information and indicators can be extracted and analyzed at a wider level, at the level of other cities of other provinces and even at the village level. According to the results of this research, it is suggested that by using the Hadoop distributed file system, household data can be prepared faster than in the past, which is now collected centrally, offline and with a delay. By providing timely outputs and information, faster and better analyzes can be performed than in the past. It is also suggested that by using the Hadoop distributed system, it will be possible to establish a relationship between the extracted annual household information at the city level with the population census information of the country and fill the statistical gap and household access indicators.</pre> <pre style="text-align:justify"> </pre> چارچوب هدوپ, سیستم فایل توزیع‌شده, نگاشت کاهش, مه‌داده, داده‌های خانواری. Hadoop framework, Hadoop Distributed File System, MapReduce, Big Data, Household Data. 97 123 http://ijoss.srtc.ac.ir/browse.php?a_code=A-10-341-2&slc_lang=fa&sid=1 Reza Alipour رضا علیپور rezaalipour955@gmail.com 10031947532846001603 10031947532846001603 Yes Iran University of Science and Technology Reza Entezari Maleki رضا انتظاری ملکی entezari@iust.ac.ir 10031947532846001604 10031947532846001604 No Iran University of Science and Technology