1. Data acquisition and preprocessing: FlumeNG real-time log acquisition system, which supports customizing various data senders in the log system for data acquisition; Zookeeper is a distributed open source distributed application coordination service, which provides data synchronization service.
2. Data storage: Hadoop, as an open source framework, is specially designed for offline and large-scale data analysis, and HDFS, as its core storage engine, has been widely used in data storage. HBase is a distributed, column-oriented open source database, which can be considered as the encapsulation of hdfs, and its essence is data storage and NoSQL database.
3. Data cleaning: As the query engine of Hadoop, MapReduce is used for parallel computing of large-scale data sets.
4. Data query analysis: The core work of Hive is to translate SQL statements into MR programs, which can map structured data into a database table and provide HQL(HiveSQL) query function. Spark supports in-memory distributed data sets, which can not only provide interactive queries, but also optimize iterative workload.
5. Data visualization: For some BI platforms, the data obtained from the analysis will be visualized to guide decision-making services.