KR101378348B1

KR101378348B1 - Basic prototype of hadoop cluster based on private cloud infrastructure

Info

Publication number: KR101378348B1
Application number: KR1020130065978A
Authority: KR
Inventors: 김영배; 차병래
Original assignee: 남도정보통신(주); 차병래
Priority date: 2013-06-10
Filing date: 2013-06-10
Publication date: 2014-03-27
Anticipated expiration: 2033-06-10

Abstract

본 발명은 스케일 아웃방식으로 하드웨어의 추가 및 성능 향상을 가져올 수 있는 블레이드 방식으로 하드웨어를 설계한 프라이빗 클라우드로, 하둡 기반의 다양한 업무에 적용이 가능하며, 하드웨어, 운영, 그리고 애플리케이션 측면의 고가용성과 오픈 소스 관리 기능을 구현할 수 있는 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입에 관한 것이다.
상기의 목적을 달성하기 위한 본 발명에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입은, 하둡분산파일시스템과 맵리듀스를 포함하는 하둡(Hadoop) 클러스터에 있어서, 상기 하둡분산파일시스템은 중앙처리장치 및 메인보드를 포함하는 PC 타입의 네임 노드 1대와, 중앙처리장치 및 메인보드를 포함하는 PC 타입의 데이터 노드가 적어도 1대 포함되고, 상기 네임 노드 및 데이터 노드의 데이터를 저장하기 위해 연결된 스토리지(50)와, 인터넷(70)과 연결되기 위한 네트워크(60)와, 상기 네임 노드 및 데이터 노드를 동작시키기 위한 전원공급장치(40) 및 상기 하둡의 오픈 소스가 업그레이드되면 프라이빗 클라우드의 버전에 의해서 자동 업그레이드를 지원하기 위한 기능을 수행하기 위한 오픈 소프트웨어 관리를 위한 OSM(Open Source Management: 91)을 포함하여 구성되는 것을 특징으로 한다.The present invention is a private cloud designed hardware in a blade method that can add hardware and improve performance by scale-out method, can be applied to a variety of tasks based on Hadoop, high availability and open in terms of hardware, operations, and applications It is about a basic prototype of a Hadoop cluster based on a private cloud infrastructure that can implement source management.
A basic prototype of a private cloud infrastructure based Hadoop cluster according to the present invention for achieving the above object is a Hadoop cluster including a Hadoop distributed file system and MapReduce, wherein the Hadoop distributed file system is a central processing unit And at least one PC type name node including a mainboard, and at least one PC type data node including a central processing unit and a mainboard, and connected to store data of the name node and the data node. 50, the network 60 for connecting to the Internet 70, the power supply 40 for operating the name node and data node, and the open source of Hadoop are upgraded by the version of the private cloud. Open source manags for open software management to perform functions to support automatic upgrades ement: 91).

Description

BASIC PROTOTYPE OF HADOOP CLUSTER BASED ON PRIVATE CLOUD INFRASTRUCTURE}

본 발명은 스케일 아웃방식으로 하드웨어의 추가 및 성능 향상을 가져올 수 있는 블레이드 방식으로 하드웨어를 설계한 프라이빗 클라우드로, 하둡 기반의 다양한 업무에 적용이 가능하며, 하드웨어, 운영, 그리고 애플리케이션 측면의 고가용성과 오픈 소스 관리 기능을 구현할 수 있는 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입에 관한 것이다.
The present invention is a private cloud designed hardware in a blade method that can add hardware and improve performance by scale-out method, can be applied to a variety of tasks based on Hadoop, high availability and open in terms of hardware, operations, and applications It is about a basic prototype of a Hadoop cluster based on a private cloud infrastructure that can implement source management.

"클라우드 컴퓨팅은 빠르게 공급되고, 최소한의 관리로 제공되는 (네트워크, 서버, 스토리지, 어플리케이션, 서비스) 컴퓨팅 리소스 풀을 언제, 어디서나, 편리한 수요자 중심의 네트워크로 접속 가능한 모델"이라고 정의할 수 있는 것으로, 클라우드 컴퓨팅을 이용함으로써 사용자들은 지원하는 기술적인 인프라 스트럭처에 대한 전문 지식이 없어도 또는 제어할 줄 몰라도 인터넷으로부터 서비스를 이용할 수 있고, 특히, 소프트웨어나 기타 컴퓨터 자원을 필요 시 돈을 주고 구입하는 서비스 형태로 제공되기 때문에 초기 비용지출이 적다."Cloud computing can be defined as a model that provides a fast, supplyable, minimal management (network, server, storage, application, service) pool of computing resources anywhere, anytime, into a convenient consumer-centric network." By using cloud computing, users can access services from the Internet without the knowledge or control of the technical infrastructure they support, especially in the form of services that pay for software or other computer resources on demand. The initial cost is low because it is provided.

또한, 가상화 기술과 분산 컴퓨팅 기술로 서버의 자원을 묶거나 분할하여 필요한 사용자에게 서비스 형태로 제공되기 때문에 컴퓨터 가용율이 높다. 그리고 추상화된 서비스를 통한 일치된 사용자 환경을 구현할 수 있고, 사용자의 데이터를 신뢰성 높은 서버에 보관함으로써 안전하게 보관할 수 있다.In addition, the availability of computers is high because virtualization technology and distributed computing technology provide server services to users by tying or dividing server resources. The user can implement a consistent user environment through the abstracted service and keep the user's data securely by storing it on a reliable server.

한편, 하둡(Hadoop)은 너치(Nutch)의 분산처리를 지원하기 위해 개발된 것으로, 수백 기가바이트 ~ 테라바이트 혹은 페타바이트 크기의 데이터를 처리할 수 있는 어플리케이션을 제작하고 운영할 수 있는 기반을 제공해 주는 데이터 처리 플랫폼이다. 하둡이 처리하는 데이터의 크기가 통상 최소 수백 기가바이트 수준이기 때문에 데이터는 하나의 컴퓨터에 저장되는 것이 아니라 여러 개의 블록으로 나누어져 여러 개의 컴퓨터에 분산 저장된다. 따라서 하둡은 입력되는 데이터를 나누어 처리할 수 있도록 하는 하둡 분산 파일 시스템(HDFS: Hadoop Distributed File System)을 포함하며, 분산 저장된 데이터들은 대용량 데이터를 클러스터 환경에서 병렬 처리하기 위해 개발된 맵리듀스(MapReduce) 과정에 의해 처리되어 진다.Hadoop was developed to support Nutch's distributed processing, providing a foundation for building and operating applications that can handle hundreds of gigabytes to terabytes or petabytes of data. The state is a data processing platform. Because Hadoop's data is typically at least several hundred gigabytes in size, data is not stored on a single computer, but divided into blocks and distributed across multiple computers. Therefore, Hadoop includes the Hadoop Distributed File System (HDFS), which allows the processing of incoming data separately.The distributed data stored in MapReduce was developed for parallel processing of large data in a cluster environment. It is processed by the process.

HDFS와 맵리듀스는 물리적으로 같은 서버에 공존하여 실행된다. HDFS나 맵리듀스 모두 하나의 마스터와 다수의 슬레이브로 구성된 마스터/슬레이브 아키텍쳐를 가지고 있다. HDFS의 경우 마스터를 네임 노드(NameNode), 슬레이브를 데이터 노드(DataNode)라 부르며, 맵리듀스의 경우 마스터를 잡 트랙커(JobTracker), 슬레이브를 테스크 트랙커(TaskTracker)라 부른다.HDFS and MapReduce run on the same physical server. Both HDFS and MapReduce have a master / slave architecture consisting of one master and multiple slaves. In HDFS, the master is called the NameNode, and the slave is called the DataNode. In the case of MapReduce, the master is called the Job Tracker and the slave is called the TaskTracker.

HDFS에서는 마스터인 네임 노드가 파일의 메타정보를 관리하고 실제 데이터는 여러 대의 데이터 노드에 분산 및 복제하여 저장한다. 하나의 맵리듀스 프로그램을 잡(Job)이라 부르며, 하나의 Job은 보통 하나 이상의 맵 테스크와 리듀스 맵 테스크로 구성된다.In HDFS, the name node, the master, manages the file's meta-information, and the actual data is distributed and replicated across multiple data nodes. A map reduce program is called a job, and a job usually consists of one or more map tasks and a reduce map task.

잡 트랙커는 하둡 맵리듀스 프레임워크의 마스터 서비스로 사용자로부터 요청된 하둡 잡 실행요청을 받아 잡이 종료될 때까지 관리하는 역할을 수행한다.The job tracker is a master service of the Hadoop MapReduce framework, which receives a Hadoop job execution request from a user and manages the job until the job is terminated.

도 5는 하둡 맵리듀스에서 잡(Job) 처리 시의 데이터의 흐름을 보여주는 개념도이다. 입력파일(input file)은 맵리듀스가 수행될 데이터가 저장된 것으로 통상은 HDFS에 저장되어 진다. 하둡은 텍스트 포맷의 데이터뿐 아니라 다양한 형태의 데이터 포맷을 지원한다.5 is a conceptual diagram illustrating the flow of data during job processing in Hadoop MapReduce. An input file stores data for map reduction, which is typically stored in HDFS. Hadoop supports a variety of data formats, as well as textual data.

클라이언트의 요청에 의해 Job이 시작되면, 입력포맷(InputFormat, 101)은 입력파일을 어떻게 나누고, 읽을 것인가를 결정하게 된다. 즉 해당 블록의 데이터에 대해 입력 파일을 나누어 InputSplit을 반환하는 한편, InputSplit을 맵퍼(mapper)가 읽을 수 있는 (key, value) 형태로 변환한 RecordReader(102)를 생성하여 반환한다. InputSplit는 맵리듀스에서 단일의 맵작업이 처리하는 데이터의 단위이다. 하둡에서는 TextInputFormat, KeyValueInputFormat, SequenceInputFormat과 같은 유형의 입력포맷이 있다. 대표적인 입력포맷은 TextInputFormat으로서 각 라인을 기준으로 블록단위로 저장된 입력파일을 나누어 논리적인 입력단위인 InputSplit을 구성하며, 이 InputSplit으로부터 (LongWritable, Text)의 형태의 레코드를 추출하는 임무를 수행하는 LineRecordReader를 반환한다.When a Job is started at the client's request, the InputFormat (101) determines how to divide and read the input file. In other words, inputSplit is returned by dividing the input file with respect to the data of the block, while generating and returning a RecordReader 102 that converts the InputSplit into a mapper-readable (key, value) form. InputSplit is the unit of data processed by a single map job in MapReduce. Hadoop has several types of input formats: TextInputFormat, KeyValueInputFormat, and SequenceInputFormat. The typical input format is TextInputFormat, which divides input files stored in block units based on each line to form InputSplit, which is a logical input unit, and extracts a record of the form (LongWritable, Text) from this InputSplit. Returns.

반환된 RecordReader는 통상적인 Map 과정 중에 InputSplit에서 키와 값의 쌍으로 구성된 레코드를 읽어 맵퍼에 넘겨주는 역할을 수행한다. 맵퍼는 이 레코드를 Map에 정의된 처리과정을 거치면서 새로운 키와 값으로 구성된 레코드로 생성한다. 출력포맷(OutputFormat, 103)은 맵리듀스 과정에서 생성한 데이터를 HDFS에 파일로 출력하기 위한 포맷으로서, 출력포맷은 subclass인 RecordWriter(104)를 통하여 맵리듀스 처리의 결과로 받은 키와 값의 쌍으로 구성된 레코드를 HDFS에 쓰는 것에 의해 데이터 처리 과정을 종료하게 된다.The returned RecordReader is responsible for reading the key-value pair record from InputSplit and passing it to the mapper during normal Map process. Mapper creates this record as a record of new keys and values through the process defined in Map. OutputFormat (103) is a format for outputting data generated during the MapReduce process to a file to HDFS. The output format is a key and value pair received as a result of MapReduce processing through RecordWriter 104, a subclass. Writing the configured record to HDFS terminates the data processing process.

한편, 빅 데이터(Big Data)란 통상 일반적인 데이터베이스, 소프트웨어로 관리가 어려운 대용량의 데이터를 의미하며, 기존 데이터베이스 처리 방식의 데이터 수집, 저장, 관리, 분석 역량을 넘어서는 데이터 셋이라고 정의할 수 있다. 빅 데이터의 특징으로는 기존의 데이터 단위를 넘어서는 엄청난 양(Volume), 데이터의 생성과 흐름이 매우 빠르게 진행되는 속도(Velocity), 사진, 동영 등 기존의 구조화된 데이터가 아닌 다양한(Variety) 형태의 정보 등 3가지 속성을 들고 있다. 또한 최근에는 빅 데이터를 3V+1C(Volume, Velocity, Variety, Complexity)로 나타내기도 한다. 이러한 정보들은 과거 형식이 정해져 있는 텍스트 위주의 데이터에서 그림, 동영상, 음성 위주의 형식이 정해져 있지 않은 비정형 데이터들이 주를 이루고 있다. 이로 인해 수천만 건의 텍스트 중심의 정형 데이터를 처리했던 기존 방법이나 도구로는 수백억 건의 비정형 데이터에 대한 수집, 저장, 검색, 분석, 시각화 등이 어렵게 되었다. 따라서 빅 데이터를 위한 분석기술의 연구 및 개발을 통하여 다가오는 빅 데이터시대에 효과적으로 대응할 필요성이 대두되고 있다.On the other hand, big data refers to a large amount of data that is difficult to manage with a general database and software, and can be defined as a data set that exceeds data collection, storage, management, and analysis capabilities of existing database processing methods. Big data is characterized by a huge amount (volume) beyond the existing data unit, the speed at which data is generated and flowed very rapidly (Velocity), photographs, movies, etc. It has three attributes including information. Recently, big data is also expressed as 3V + 1C (Volume, Velocity, Variety, Complexity). This information is mainly composed of unstructured data that is not formatted in the form of pictures, videos, and voices, from text-oriented data in the past. This makes it difficult to collect, store, retrieve, analyze, and visualize billions of unstructured data using traditional methods or tools that have processed tens of millions of text-driven structured data. Therefore, the necessity of effectively responding to the coming big data era is emerging through research and development of analysis technology for big data.

빅 데이터를 해결하기 위한 여러 방안 중의 하나로 클라우드 컴퓨팅 기술이 사용된다. 클라우드 컴퓨팅은 인프라를 가상화시켜 IaaS(Infrastructure as a Service) 형태로 서비스를 하거나 IaaS를 이용해 플랫폼을 구축하여 소프트웨어 개발자 등에게 제공하는 PaaS(Platform as a Service) 형태, 또는 PaaS를 이용해 소프트웨어를 개발하여 개개인의 사용자에게 제공하는 SaaS(Software as a Service) 형태로 서비스를 제공한다. 서비스를 제공하는 클라우드 컴퓨팅 제공자 입장에서는 잉여 자원을 줄일 수 있는 장점이 있고, 이를 사용하는 사용자 입장에서는 필요한 만큼의 자원만을 사용하거나 여러 소프트웨어를 독립적인 하드웨어로 이용할 수 있는 장점이 있다.Cloud computing technology is one of the many ways to address big data. Cloud computing virtualizes infrastructure to service in the form of infrastructure as a service (IaaS), or builds a platform using IaaS to provide software developers, etc., or to develop software using PaaS. The service is provided in the form of SaaS (Software as a Service) provided to the user. Cloud computing providers that provide services have the advantage of reducing surplus resources, and users who use them have the advantage of using only as many resources as they need or using multiple software as independent hardware.

또한, 데이터 마이닝은 의미있는 패턴과 규칙을 발견하기 위해서 자동화되거나 반자동화된 도구를 이용하여 대량의 데이터를 탐색하고 분석하는 과정으로 정의 할 수 있다. 빅 데이터 시대에 접어들면서 데이터 마이닝은 다양한 분야에서 활용되어져 왔으며, 또한 다양한 분야와 결합되어 연구 및 개발되어 져 왔다. 현재 가장 활발하게 연구되고 있는 데이터 마이닝의 주요 연구 분야는 다음과 같다.Data mining can also be defined as the process of exploring and analyzing large amounts of data using automated or semi-automated tools to discover meaningful patterns and rules. In the age of big data, data mining has been utilized in various fields and has been researched and developed in combination with various fields. The major research areas of data mining that are currently being actively researched are as follows.

비즈니스 테이터 마이닝(Business Data mining) - 방대한 비즈니스 데이터베이스를 분석하고 분석된 정보를 최종 사용자가 통계에 대한 지식이 없더라도 쉽게 활용할 수 있도록 인터페이스를 제공하여 합리적인 의사결정에 도움을 주는 것이다.Business Data mining-Analyzes a large business database and provides an easy-to-use interface for end-users to make informed decisions, even if they do not have knowledge of statistics.

바이오 데이터 마이닝(Bio Data mining) - 수많은 분자 생물학 연구로부터 생성되고 저장된 방대한 생물 분자 서열 데이터에서 생명체의 진화, 유전, 환경에의 적응, 학습 등의 생명 현상에 대한 지식을 얻어내는 과정으로 신약 개발, 새로운 치료법의 개발, 예방학의 발전, 새로운 항생물질의 개발뿐만 아니라 약리학, 화학, 생태학 등의 발전에 기여할 수 있다.Bio Data mining-The development of new drugs by obtaining knowledge of life phenomena such as life evolution, heredity, adaptation to environment, and learning from vast amounts of biological molecular sequence data generated and stored from numerous molecular biology studies. It can contribute to the development of pharmacology, chemistry and ecology as well as the development of new therapies, the development of prevention and the development of new antibiotics.

공간 데이터 마이닝(Spatial Data mining) - 공간 데이터베이스 내에 잠재되어 있는 흥미로운 정보와 공간적 상관관계, 다양한 공간적 패턴을 찾아내는 과정이다. 공간 데이터는 텍스트로 이루어진 일반 속성 정보뿐 아니라 2, 3차원 공간에서 존재하는 점, 선, 면의 다양한 객체로 이루어진 공간 정보를 포함한다.Spatial Data Mining-The process of finding interesting information, spatial correlations, and various spatial patterns in a spatial database. Spatial data includes not only general attribute information composed of text but also spatial information composed of various objects such as points, lines, and faces existing in two- and three-dimensional spaces.

3차원 시각화(3D Visualization) - 시각화 기술과 데이터 마이닝 기술의 융합을 통하여 전체적인 데이터 마이닝 과정에 능률을 높이는 데에 목표를 두고 있으며, 두 과정의 융합을 통하여 적은 노력으로 큰 효과를 얻을 수 있다.3D Visualization-The goal is to improve the overall data mining process through the convergence of visualization and data mining technologies.

한편, 클라우드 컴퓨팅은 배치 방식에 따라 퍼블릭(Public) 클라우드 컴퓨팅과 프라이빗(Private) 클라우드 컴퓨팅으로 분류할 수 있다. 퍼블릭 클라우드 컴퓨팅은 대중을 대상으로 인터넷 기반으로 운영되는 클라우드 컴퓨팅으로, 포털 사이트처럼 외부 데이터 센터를 이용하는 유틸리티 컴퓨팅 형태로 제공된다. 대상을 특별히 제한하지 않으며, 사용량에 따라 사용료를 지불할 수 있다. 사용 목적에 따른 클라우드 컴퓨팅 서비스를 이용하여 서비스 이용의 탄력성과 활용도를 최대화할 수 있으며, 최소의 투자로 최대의 성과를 낼 수 있다. 서비스를 적기에 제공받을 수 있고 이용한 만큼만 요금을 지불하면 된다는 장점이 있다. 그러나, 비용 지불 방식에 따라 매월 이용료를 납부해야 하는 번거로움이 발생할 수도 있으며 각 서비스에 따른 전문적인 제공이 어려워 지원 비용이 증가할 수도 있다. 또한 서비스가 어디서, 어떻게 제공되는 지에 대한 것은 고객은 알 수 없기 때문에 서비스 이용에 대한 통제 권한이 부족하다는 문제점을 안고 있다. AWS(Amazon Web Service), Google Apps, Salesforce, Twitter 등 서비스로 제공되고 있으며 Carolyn Purcell & David Floyer에 따르면 매출 10억 달러 이하 기업에서 이용하는 것이 적절하다.Meanwhile, cloud computing may be classified into public cloud computing and private cloud computing according to a deployment method. Public cloud computing is an Internet-based cloud computing service for the public, which is provided in the form of utility computing using an external data center like a portal site. There is no particular restriction on the subject, and the fee can be paid according to the amount used. By using cloud computing services according to the purpose of use, the elasticity and utilization of the service usage can be maximized, and the maximum performance can be achieved with minimum investment. You can get the service in a timely manner and pay only for what you use. However, depending on the payment method, it may be cumbersome to pay the monthly fee, and it may be difficult to provide professional services for each service, thereby increasing the support cost. In addition, there is a problem that there is a lack of control over the use of the service because customers do not know where and how the service is provided. It is available as a service such as Amazon Web Service (AWS), Google Apps, Salesforce, Twitter, etc., and according to Carolyn Purcell & David Floyer, it is appropriate for companies with less than $ 1 billion in revenue.

프라이빗 클라우드 컴퓨팅은 기업 내부의 클라우드 컴퓨팅 데이터 센터를 중심으로 클라우드 컴퓨팅 환경을 구성하여 내부 고객에게 서비스를 제공하는 방식으로 구성원 개개인의 시스템에 대한 관리 부담이 적다. 특정 임무 중심의 애플리케이션 구성이 일반적이기 때문에 기업 입장에서 자료를 통합하여 관리하기 용이하며 전체 인프라에 대한 통제권을 가질 수 있다는 장점이 있다. 인프라에 대한 통제권을 가질 수 있기 때문에 보안 및 신뢰성이 제고되며 네트워크 대역폭의 제약이 줄어들며 서비스 수준 관리(SLA: Service Level Agreement)가 가능하다. 그에 반해 사용량에 따른 비용을 정산할 수는 없다는 단점이 있으며 별도의 구축비용이 발생할 수 있다. 장비, 하드웨어, 가상화 기술에 대한 비용이 발생하며 별도의 데이터 센터 구축비용과 높은 인력 비용이 예상되며 탄력성이 비교적 낮게 나타난다. IBM, HP, VMware, EMC 등 대형 벤더사에서 서비스 제공이 가능하며 Carolyn Pur-cell & David Floyer에 따르면 매출이 10억 달러 이상 되는 기업에서 구축하는 것이 유리하다고 한다.Private cloud computing is a cloud computing environment centered on an internal cloud computing data center that provides services to internal customers with less management burden on each member's system. Specific mission-oriented application configurations are common, allowing companies to consolidate and manage data and gain control over the entire infrastructure. Having control over the infrastructure increases security and reliability, reduces network bandwidth constraints and enables service level agreements (SLAs). On the other hand, there is a disadvantage in that it is not possible to settle the costs according to usage, and a separate construction cost may occur. There are costs for equipment, hardware, and virtualization technologies, with separate data center deployment costs and high workforce costs expected, with relatively low resiliency. Services are available from large vendors such as IBM, HP, VMware, and EMC, and Carolyn Pur-cell & David Floyer says it's advantageous to build in companies with revenues of more than $ 1 billion.

이러한 상황에서 중소기업(SMB)에 적합한 빅데이터 처리용 프라이빗 클라우드 인프라 구축을 위한 프로토타입이 절실히 필요한 실정이다. 특히 중소기업이 클라우드 컴퓨팅을 채용하는 이유로는 크게 세 가지로 볼 수 있는데, 첫 번째는 인프라/플랫폼/서비스를 아웃소싱에 의한 정보 보안, IT 지원, S/W, H/W의 자본 지출을 피할 수 있다는 것이고, 두 번째는 IT 자원의 융통성과 확장성, 그리고 세 번째는 비즈니스의 연속성과 재난 극복 능력 때문에 클라우드 컴퓨팅을 채용한다는 것이다. 따라서, 중소기업(SMB)에 적합한 빅데이터 처리용 프라이빗 클라우드 인프라 구축을 위한 프로토타입이 절실히 요구되고 있다.
In this situation, a prototype for building a private cloud infrastructure for big data processing suitable for SMBs is urgently needed. In particular, there are three main reasons for SMEs to adopt cloud computing. The first is to avoid capital expenditures of information security, IT support, software and hardware by outsourcing infrastructure / platforms / services. The second is to employ cloud computing because of the flexibility and scalability of IT resources, and the third because of business continuity and disaster recovery capabilities. Therefore, there is an urgent need for a prototype for building a private cloud infrastructure for big data processing suitable for SMBs.

본 발명은 상술한 문제점을 해결하기 위해 제안된 것으로, 본 발명의 목적은 중소기업(SMB)을 지원하기 위한 특별한 노력중의 일환으로 프라이빗 클라우드 인프라 기반의 하둡 클러스터 구축을 위한 기본 프로토타입을 제공하는 것이다.The present invention has been proposed to solve the above problems, and an object of the present invention is to provide a basic prototype for building a Hadoop cluster based on a private cloud infrastructure as part of a special effort to support SMBs. .

다시 말해, 중소기업의 프라이빗 클라우드는 다양한 업무에 적용이 가능하며 컴퓨팅 및 스토리지 저장 측면에서 고가용성과 확장성을 제공하고, 프라이빗 클라우드의 기능으로는 블레이드 서버 기술, 보안, 고가용성, 오픈 소스 관리, 확장성 등의 기능들을 제공할 수 있는 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입을 제공하는 목적이 있다.
In other words, SMBs' private clouds can be applied to a variety of tasks, providing high availability and scalability in terms of computing and storage storage, and private cloud features include blade server technology, security, high availability, open source management, and scalability. Its purpose is to provide a basic prototype of a Hadoop cluster based on a private cloud infrastructure that can provide such features.

상기의 목적을 달성하기 위한 본 발명에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입은, 하둡분산파일시스템과 맵리듀스를 포함하는 하둡(Hadoop) 클러스터에 있어서, 상기 하둡분산파일시스템은 중앙처리장치 및 메인보드를 포함하는 PC 타입의 네임 노드 1대와, 중앙처리장치 및 메인보드를 포함하는 PC 타입의 데이터 노드가 적어도 1대 포함되고, 상기 네임 노드 및 데이터 노드의 데이터를 저장하기 위해 연결된 스토리지(50)와, 인터넷(70)과 연결되기 위한 네트워크(60)와, 상기 네임 노드 및 데이터 노드를 동작시키기 위한 전원공급장치(40) 및 상기 하둡의 오픈 소스가 업그레이드되면 프라이빗 클라우드의 버전에 의해서 자동 업그레이드를 지원하기 위한 기능을 수행하기 위한 오픈 소프트웨어 관리를 위한 OSM(Open Source Management: 91)을 포함하여 구성되는 것을 특징으로 한다.A basic prototype of a private cloud infrastructure based Hadoop cluster according to the present invention for achieving the above object is a Hadoop cluster including a Hadoop distributed file system and MapReduce, wherein the Hadoop distributed file system is a central processing unit And at least one PC type name node including a mainboard, and at least one PC type data node including a central processing unit and a mainboard, and connected to store data of the name node and the data node. 50, the network 60 for connecting to the Internet 70, the power supply 40 for operating the name node and data node, and the open source of Hadoop are upgraded by the version of the private cloud. Open source manags for open software management to perform functions to support automatic upgrades ement: 91).

이때, 상기 프라이빗 클라우드는 블레이드 서버 방식의 스케일 아웃(100)방식으로 구성되고, 상기 하둡에 보안을 위한 클라우드의 외부 측면은 방화벽(80)으로, 내부는 이상탐지와 허니팟으로 구성되고, 고가용성을 제공하기 위한 XA(Extended Availability: 92)와, 컴퓨팅 측면의 확장성(Scalability)을 위해 외부의 퍼블릭 클라우드(210)에서 자원을 네트워크(60)를 통해 연결되어 하이브리드 클라우드(200)를 구성할 수 있다.
At this time, the private cloud is configured in a scale-out 100 method of the blade server method, the outer side of the cloud for security in Hadoop is a firewall 80, the inside is composed of abnormal detection and honeypot, high availability The hybrid cloud 200 may be configured by connecting resources through the network 60 in an external public cloud 210 for XA (Extended Availability: 92) and computing side scalability. .

삭제delete

상술한 바와 같이, 중소기업(SMB)에 적합한 빅 데이터 처리용 프라이빗 클라우드 인프라 구축을 위한 프로토타입은, 인프라/플랫폼/서비스를 아웃소싱에 의한 정보 보안, IT 지원, S/W, H/W의 자본 지출을 피할 수 있고, IT 자원의 융통성과 확장성, 그리고 비즈니스의 연속성과 재난 극복 능력을 갖춘 클라우드 컴퓨팅을 제공할 수 있다.As described above, the prototype for building a private cloud infrastructure for big data processing suitable for small and medium-sized businesses (SMB) includes information security by outsourcing infrastructure / platforms / services, IT support, S / W, and H / W capital expenditure. To provide cloud computing with the flexibility and scalability of IT resources, and the business continuity and disaster recovery capabilities.

또한, 다양한 업무에 적용이 가능하며 컴퓨팅 및 스토리지 저장 측면에서 고가용성과 확장성을 제공하고, 프라이빗 클라우드의 기능으로는 블레이드 서버 기술, 보안, 고가용성, 오픈 소스 관리, 확장성 등의 기능들을 저비용 및 고효율적으로 중소기업에 제공하는 효과가 있다.
In addition, it can be applied to various tasks and provides high availability and scalability in terms of computing and storage storage, and private cloud functions include blade server technology, security, high availability, open source management, and scalability. It is effective to provide SMEs with high efficiency.

도 1은 본 발명의 일실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이고,
도 2는 본 발명의 다른 실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이며,
도 3은 본 발명의 또 다른 실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이고,
도 4는 본 발명의 일실시 예에 의한 프라이빗 클라우드의 스케일 아웃과 결합된 하이브리드 클라우드,
도 5는 종래의 일반적인 하둡에서 잡(job) 처리 시의 데이터의 흐름을 보여주는 개념도.1 is a basic prototype of a private cloud infrastructure based Hadoop cluster according to an embodiment of the present invention,
2 is a basic prototype of a private cloud infrastructure based Hadoop cluster according to another embodiment of the present invention.
3 is a basic prototype of a private cloud infrastructure based Hadoop cluster according to another embodiment of the present invention,
4 is a hybrid cloud combined with scale-out of a private cloud according to an embodiment of the present invention;
5 is a conceptual diagram showing the flow of data during job processing in a conventional general Hadoop.

이하, 첨부된 도면을 참조하여 본 발명에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입을 상세히 설명한다.Hereinafter, a basic prototype of a private cloud infrastructure based Hadoop cluster according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이고, 도 2는 본 발명의 다른 실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이며, 도 3은 본 발명의 또 다른 실시 예에 의한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입이고, 도 4는 본 발명의 일실시 예에 의한 프라이빗 클라우드의 스케일 아웃과 결합된 하이브리드 클라우드이다.1 is a basic prototype of a private cloud infrastructure based Hadoop cluster according to an embodiment of the present invention, Figure 2 is a basic prototype of a private cloud infrastructure based Hadoop cluster according to another embodiment of the present invention, Figure 3 A basic prototype of a private cloud infrastructure based Hadoop cluster according to another embodiment of the present invention, Figure 4 is a hybrid cloud combined with the scale-out of the private cloud according to an embodiment of the present invention.

상기 도면의 구성 요소들에 인용부호를 부가함에 있어서, 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한 동일한 부호를 가지도록 하고 있으며, 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 공지 기능 및 구성에 대한 상세한 설명은 생략한다. 또한, '상부', '하부', '앞', '뒤', '선단', '전방', '후단' 등과 같은 방향성 용어는 개시된 도면(들)의 배향과 관련하여 사용된다. 본 발명의 실시 예의 구성요소는 다양한 배향으로 위치설정될 수 있기 때문에 방향성 용어는 예시를 목적으로 사용되는 것이지 이를 제한하는 것은 아니다.In the drawings, the same reference numerals are given to the same elements even when they are shown in different drawings. In the drawings, the same reference numerals as used in the accompanying drawings are used to designate the same or similar elements. And detailed description of the configuration will be omitted. Also, directional terms such as "top", "bottom", "front", "back", "front", "forward", "rear", etc. are used in connection with the orientation of the disclosed drawing (s). Since the elements of the embodiments of the present invention can be positioned in various orientations, the directional terminology is used for illustrative purposes, not limitation.

본 발명의 일실시 예에 의한 바람직한 프라이빗 클라우드 인프라 기반 하둡 클러스터의 기본 프로토타입은, 상기 도 1에 도시된 바와 같이, PC 타입의 기본 프로토타입으로, 케이스를 제거하고 네임 노드 1대와 적어도 1대의 데이터 노드로 구성된 하둡과, 상기 네임 노드 및 데이터 노드의 데이터를 저장하기 위해 연결된 스토리지(50)와, 인터넷(70)과 연결되기 위한 네트워크(60) 및 상기 네임 노드 및 데이터 노드를 동작시키기 위한 전원공급장치(40)를 포함하여 구성된다.The basic prototype of the preferred private cloud infrastructure based Hadoop cluster according to an embodiment of the present invention is a basic prototype of PC type, as shown in FIG. 1, with the case removed and one name node and at least one Hadoop configured as a data node, storage 50 connected to store data of the name node and data node, power supply for operating the network 60 and the name node and data node to be connected to the Internet 70. It is configured to include a supply device (40).

여기서, 상기 네임 노드 및 데이터 노드는 공지된 바와 같이 CPU(11, 21, 31)와 메인보드(12, 22, 32)를 포함하는 PC로 구성한다.Here, the name node and the data node are constituted of a PC including the CPU 11, 21, 31 and the main boards 12, 22, 32, as known.

또한, 랙 형태로 개발하기 위하여 도 1 내지 도 4 중 어느 하나와 같이 구성할 수 있다.In addition, to develop in a rack form may be configured as shown in any one of Figs.

다시 말해, 랙 형태의 구축된 Hadoop에 보안을 위한 방화벽(80)과 고가용성, 그리고 오픈 소프트웨어 관리를 위한 OSM(Open Source Management: 91)을 추가하여 도 2와 같이 구성하는 것도 바람직하다.In other words, as shown in FIG. 2, a firewall 80 for security, high availability, and open source software management (OSM) 91 are added to the rack-type Hadoop.

중소기업(SMB)의 프라이빗 클라우드(110)는 다양한 업무에 적용이 가능하며 컴퓨팅 및 스토리지 저장 측면에서 고가용성과 확장성을 도 5와 같이 제공하여야 한다. 즉, 프라이빗 클라우드의 기능으로는 블레이드 서버 기술, 보안, 고가용성, 오픈 소스 관리, 확장성 등의 기능들이 필요하다.Private cloud 110 of the SMB (SMB) is applicable to a variety of tasks and must provide high availability and scalability in terms of computing and storage storage as shown in FIG. In other words, private cloud features require blade server technology, security, high availability, open source management, and scalability.

여기서 서버 가상화 기술은 크게 파티셔닝과 가상 머신을 이용하는 방식과 블레이드 서버 방식으로 구분이 가능하며, 특히 SMB를 위한 프라이빗 클라우드(110)는 블레이드 서버 방식으로 구축하게 된다. 블레이드 방식으로 구축함으로써 스케일 아웃(100)방식으로 저비용 투입으로 하드웨어 추가 및 성능 향상을 가져올 수 있으며, 대용량의 컴퓨팅 및 스토리지(50) 등이 필요하게 되면 클라우드 컴퓨팅의 장점인 확장성과 탄력성에 의한 퍼블릭 클라우드(210)를 도 5와 같이 이용할 수 있어, 하이브리드 클라우드(200)를 구성할 수 있다.Here, the server virtualization technology can be largely divided into a partitioning method and a blade server method using a virtual machine, and in particular, the private cloud 110 for the SMB is built by the blade server method. By implementing the blade method, it is possible to add hardware and improve performance at a low cost input by scale-out method 100, and when a large amount of computing and storage 50 is required, the public cloud by scalability and elasticity, which is an advantage of cloud computing 210 may be used as shown in FIG. 5, and a hybrid cloud 200 may be configured.

프라이빗 클라우드(110)의 보안 전략은 크게 클라우드를 기준으로 내부(inside)와 외부(outside)로 구분하게 된다. 클라우드의 외부 측면은 방화벽(80)에 의한 보안 전략을, 내부에서는 이상탐지와 허니팟에 의한 보안 전략을 운영한다.The security strategy of the private cloud 110 is largely divided into inside and outside based on the cloud. The external side of the cloud operates a security strategy by the firewall 80, and the internal security strategy by anomaly detection and honeypot.

XA(Extended Availability: 92)는 고가용성을 제공하는 것으로, SMB의 프라이빗 클라우드(110)는 하둡 기반의 다양한 업무에 적용이 가능하며, 이를 지원하기 위해서는 고가용성(High Availability)과 오픈 소스 관리 기능(Open Source Managemant)을 구현한다. 고가용성(High Availability)으로는 하드웨어, 운영, 그리고 애플리케이션 측면의 가용성을 제공한다. 하드웨어 가용성은 SMB 프라이빗 클라우드 디바이스들, 스토리지 장치 그리고 네트워크 장비 등의 하드웨어의 고장 탐지 및 리포트 기능을 제공한다. 운영 가용성은 하드웨어의 시스템 영역인 운영체제, 스토리지 시스템, 네트워크 드라이버 등에 대한 가용성을 지원한다. 또한 애플리케이션 가용성은 하둡 등의 애플리케이션의 연속 운영을 지원한다.XA (Extended Availability: 92) provides high availability, and SMB's private cloud 110 can be applied to various tasks based on Hadoop, and to support this, high availability and open source management functions ( Implement Open Source Managemant. High Availability provides hardware, operational, and application aspects of availability. Hardware availability provides fault detection and reporting of hardware such as SMB private cloud devices, storage devices and network equipment. Operational availability supports the availability of operating systems, storage systems, network drivers, and so on, in the system area of hardware. Application availability also supports continuous operation of applications such as Hadoop.

오픈 소스 관리(OSM: Open Software Management, 91)는 SMB 프라이빗 클라우드(110)를 구성하는 하둡 등의 오픈 소스가 업그레이드되면 SMB를 위한 프라이빗 클라우드의 버전에 의해서 자동 업그레이드를 지원하기 위한 기능을 수행하게 된다.Open Software Management (OSM) will perform a function to support automatic upgrade by the version of the private cloud for SMB when the open source such as Hadoop constituting the SMB private cloud 110 is upgraded. .

컴퓨팅 측면의 확장성(Scalability)은, 프라이빗 클라우드의 가상화 자원의 사용량에 한계에 도달하면 클라우드 컴퓨팅의 장점인 확장성에 의해서 가용 자원을 탄력적으로 사용한다. 탄력적 사용을 지원하기 위하여 퍼블릭 클라우드(210)에서 자원을 인터넷으로 연결하여 사용한다.Scalability on the computing side is elastically using available resources due to scalability which is an advantage of cloud computing when the usage of virtualized resources in the private cloud is reached. In order to support elastic usage, the public cloud 210 uses resources connected to the Internet.

본 발명에서 SMB를 위한 프라이빗 클라우드를 설계하기 위하여 먼저, PC를 네임 노드(10)와 데이터 노드(20, 30)로 구성하여 Hadoop 설치를 기본 프로토타입으로 구성하였다.In the present invention, in order to design a private cloud for SMB, first, the Hadoop installation was configured as a basic prototype by configuring a PC as a name node 10 and a data node 20 and 30.

여기서, PC 타입의 기본 프로토타입은 케이스를 제거하고 랙 형태로 개발하기 위하여 도 1 내지 도 4 중 어느 하나와 같이 구성할 수 있다.Here, the basic prototype of the PC type can be configured as shown in any one of Figures 1 to 4 in order to remove the case and develop in a rack form.

한편, 기본 프로토타입에 확장성(Scalability)을 제공하도록 하는 것이 더욱 바람직하다. 이러한 확장성(Scalability)을 제공하기 위한 방법으로 boto 라이브러리(93) 등의 오픈 소스에 의한 AWS(Amazon Web Services: 95) 등의 퍼블릭 클라우드(210)에 확장성을 부여함으로써 진정한 의미의 프라이빗 클라우드인 기본 프로토타입을 구성하는 것이 바람직하다.On the other hand, it is more desirable to provide scalability to the basic prototype. As a way to provide such scalability, by providing scalability to public cloud 210 such as Amazon Web Services (AWS) by open source such as boto library 93, It is desirable to construct a basic prototype.

한편, 중소기업을 위한 프라이빗 클라우드의 구성을 PC 기반으로 네임 노드 1대와 데이터 노드 3대로 구현한 경우의 실시 예로 하드웨어 스펙은 표 1과 같다.Meanwhile, the hardware specification is shown in Table 1 as an example of implementing a private cloud for small and medium businesses as one name node and three data nodes based on a PC.

구분division 상세 내역details 수량Quantity CPUCPU 인텔 i5, 2.8GHz,
4 Core cpu Intel i5, 2.8 GHz,
4 Core cpu 44 메모리Memory 4GB4 GB 44 디스크disk 500GB500GB 44 NetworkNetwork 3Com, 8 포트 스위칭 허브3Com, 8-Port Switching Hub 1One

다른 한편, 기본 프로토타입을 인텔 i3 CPU, RAM 4GB, 하드디스크 320GB로 구성된 소형 메인보드를 4개를 공유기로 연결하여 네임 노드와 데이터 노드를 구성한 경우에는 하드웨어 스펙은 표 2와 같다.On the other hand, the hardware specification is shown in Table 2 when the name prototype and the data node are configured by connecting four small motherboards consisting of an Intel i3 CPU, 4GB of RAM, and 320GB of hard disk as a router.

구분division 상세 내역details 수량Quantity CPUCPU 인텔 i3 3.3GHz,
Dual Core cpu Intel i3 3.3GHz,
Dual Core cpu 44 메모리Memory 4GB4 GB 44 디스크disk 320GB320 GB 44 NetworkNetwork NetGear, 4 포트 스위칭 허브NetGear, 4-Port Switching Hub 1One

또한, 기본 프로토타입에 고가용성의 결함 감내 기능을 위하여 무선 랜을 추가할 수 있다. 유선과 무선에 의한 네트워크 이중화에 의한 결함 감내 및 네트워크 속도 향상의 효과를 갖게 된다.In addition, WLANs can be added to the base prototype for high availability fault tolerance. It will have the effect of fault tolerance and network speed improvement by wired and wireless network redundancy.

상술한 바와 같은 프라이빗 클라우드 기본 프로토타입의 구성은 도 1 내지 도 4를 대상으로 빅데이터에 대한 성능테스트를 표 3과 같이 수행하였다.In the configuration of the basic private cloud prototype as described above, performance tests on big data were performed as shown in Table 3 in FIGS. 1 to 4.

표 3은 ASA(American Standard Association)에서 공개한 미국 항공편 운항 통계 데이터 11GB를 이용하여 테스트를 수행하였으며, 대부분의 경우 5분에서 6분미만의 성능을 보였다.Table 3 tests were conducted using 11 GB of US flight flight statistics published by the American Standard Association (ASA), with performance in most cases of less than five to six minutes.

Test DataTest data 프로토타입 버전Prototype version 처리 시간Processing time ASA
미국운항 데이터ASA
US Flight Data 실시 예(도 1)Example (Fig. 1) 5분 10초5 minutes 10 seconds 다른 실시 예(도 2)Another embodiment (FIG. 2) 5분 42초5 minutes 42 seconds 또 다른 실시 예(도 3)Another embodiment (Fig. 3) 5분 42초5 minutes 42 seconds

한편 본 발명은 기본 프로토타입에서 확장성과 탄력성을 제공하기 위하여 퍼블릭 클라우드와의 연합(federation) 기능을 구현할 수 있을 것이다.Meanwhile, the present invention may implement a federation function with the public cloud to provide scalability and elasticity in the basic prototype.

앞에서 설명되고, 도면에 도시된 본 발명의 실시 예들은 본 발명의 기술적 사상을 한정하는 것으로 해석되어서는 안 된다. 본 발명의 보호범위는 청구범위에 기재된 사항에 의하여만 제한되고, 본 발명의 기술분야에서 통상의 지식을 가진 자는 본 발명의 기술적 사상을 다양한 형태로 개량 변경하는 것이 가능하다. 따라서 이러한 개량 및 변경은 통상의 지식을 가진 자에게 자명한 것인 경우에는 본 발명의 보호범위에 속하게 될 것이다.
The embodiments of the present invention described above and shown in the drawings should not be construed as limiting the technical idea of the present invention. The scope of protection of the present invention is limited only by the matters described in the claims, and those skilled in the art will be able to modify the technical idea of the present invention in various forms. Accordingly, such improvements and modifications will fall within the scope of the present invention if they are apparent to those skilled in the art.

10: 네임 노드 11, 21, 31: CPU
20, 30: 데이터 노드 40: 전원공급장치
50: 스토리지 60: 네트워크
70: 인터넷 80: 방화벽
91: OSM(Open Source Management) 92: XA(Extended Availability)
93: boto 라이브러리 95: AWS(Amazon Web Services)
100: 스케일 아웃 110: 프라이빗 클라이드
200: 하이브리드 클라우드 210: 퍼블릭 클라우드10: name nodes 11, 21, 31: CPU
20, 30: data node 40: power supply
50: storage 60: network
70: Internet 80: Firewall
91: Open Source Management 92: Extended Availability (XA)
93: boto Library 95: Amazon Web Services (AWS)
100: scale out 110: private clyde
200: hybrid cloud 210: public cloud

Claims

For Hadoop clusters containing Hadoop Distributed File System and MapReduce,
The Hadoop distributed file system includes a PC type name node including a central processing unit and a main board, and at least one PC type data node including a central processing unit and a main board,
A storage 50 connected to store data of the name node and the data node;
A network 60 for connecting with the Internet 70,
A power supply device 40 for operating said name node and data node;
When the open source of Hadoop is upgraded, the private cloud infrastructure is configured to include an open source management (OSM) 91 for open software management to perform a function for supporting an automatic upgrade by a version of the private cloud. Basic prototype of the underlying Hadoop cluster.

The method according to claim 1,
The private cloud is configured in a scale-out 100 method of the blade server method,
The outer side of the cloud for security in Hadoop is the firewall 80, the inside is composed of abnormal detection and honeypot,
Extended Availability (XA) to provide high availability,
Basic prototype of a private cloud infrastructure based Hadoop cluster, characterized in that the hybrid cloud 200 is configured by connecting resources from the external public cloud 210 through the network 60 for scalability on the computing side. .

delete