基于Hadoop的海量广告日志分析系统的设计与实现

资源描述

《基于Hadoop的海量广告日志分析系统的设计与实现》由会员分享，可在线阅读，更多相关《基于Hadoop的海量广告日志分析系统的设计与实现（85页珍藏版）》请在金锄头文库上搜索。

1、硕士学位论文（工程硕士）基于 Hadoop 的海量广告日志分析系统的设计与实现 THE DESIGN AND IMPLEMENTATION OF MASSIVE ADVERTISING LOG ANALYSIS SYSTEM BASED ON HADOOP 章伟星章伟星 2012013 3 年年 6 6 月月国内图书分类号：TP311 学校代码：10213 国际图书分类号：621.3 密级：公开工程工程硕士学位论文硕士学位论文基于 Hadoop 的海量广告日志分析系统的设计与实现硕士研究生：章伟星导师：苏统华高级讲师副导师：戚佳音高级工程师申请学位：工程硕

2、士学科、专业：软件工程所在单位：软件学院答辩日期：2013 年 6 月授予学位单位：哈尔滨工业大学 Classified Index：TP311 U.D.C.: 621.3 Dissertation for the Masters Degree in Engineering THE DESIGN AND IMPLEMENTATION OF MASSIVE ADVERTISING LOG ANALYSIS SYSTEM BASED ON HADOOP Candidate: Supervisor: Associate Supervisor: Academic Degree

3、Applied for: Speciality: Affiliation： Date of Defence: Degree-Conferring-Institution: Zhang Weixing Senior Lecturer Su Tonghua Senior Engineer Qi Jiayin Master of Engineering Software Engineering School of Software June, 2013 Harbin Institute of Technology 哈尔滨工业大学工程硕士学位论文 - - I 摘要百度凤巢是百度推出的全新广告拍卖系

4、统，其以每天数以亿计的网页搜索量为强大后盾，为推广商户带来巨大经济效益的同时也为百度带来了巨大的经济收入，截至 2010 年第三季度末，来自凤巢的营收已占百度总营收的 20%以上。然而从线上运行以及用户反馈情况来看，凤巢在广告质量度计算、展现以及广告优化等功能方面仍存在较多问题，这些问题不仅会给用户带来经济损失还会为凤巢造成负面影响。为此，本论文针对凤巢的重要业务点，设计并实现了基于 Hadoop 的海量广告日志分析系统，旨在从海量的广告日志中分析挖掘出异常数据，并从不同维度对异常数据进行统计以及可视化展示，以帮助凤巢有效发现潜在的问题，对异常产生的内因进行深入分析研究，找出问题来源以提出

5、有效的解决方案。首先，本文基于凤巢的业务功能确定了日志分析系统的实际需求；然后针对该需求，设计了本海量日志分析系统的功能结构，分为日志解析模块、日志分析挖掘模块以及 Web 展示模块。日志解析模块完成原始日志的数据预处理操作。日志分析挖掘模块作为系统的核心部分，为不同的业务监控项建立计算规则模型，从经过预处理的海量日志数据中分析挖掘出各个业务点的异常数据，然后对异常数据进行多维度的过滤统计，该模块主要包括广告质量度、广告审核以及广告优化建议三个业务专题。Web 展现模块通过动态趋势图以及表格等形式在网页上对分析统计结果进行可视化展现。在系统的实现技术上，日志解析和日志分析挖掘模块充分利用了

6、 Hadoop 在处理海量数据方面的优势，将海量的原始日志及分析结果存储于 HDFS （Hadoop Distributed File System）中，基于 Hadoop 的 MapReduce 算法建立不同的MapReduce 计算程序集来实现数据的处理。 Web 展现模块使用 LAMP（Linux+Apache+MySQL+PHP）技术，采用较流行的 Web 应用程序开源框架CakePHP 实现。最后，系统从功能和非功能上进行了测试并得到验证。从商用效果来看，通过本系统及时发现了潜在问题，有效减少了凤巢的线上错误率，为决策发展提供了有效依据。关键

7、词：日志分析；海量数据；Hadoop；MapReduce 哈尔滨工业大学工程硕士学位论文 - - II Abstract Baidu FengChao is a newly promoted advertisement auction system, exploiting the daily billions of web searches, which brings huge income for both business customers and Baidu. Till 2010, the income from FengChao occupies more than 20% of B

8、aidus total income. However, according to the online running and customer feedbacks, FengChao still faces many problems in advertisement quality measuring, presence and optimization. These problems will cause economic loss for customers and brings bad effects for FengChao. To address these problems,

9、 this paper designs and implemented a massive advertising log analysis system based on Hadoop, aiming to mine abnormal data from massive advertisement log, and further provide visual statistics on the abnormal data from different views to help FengChao find potential problems, after a thorough analy

10、sis of the reasons for the abnormal data, finally propose effective solutions. First, this paper determines the requirement of this log analysis system based on Fengchaos business functionalities, then designs the function structure of this log analysis system, which can be divided into: log parsing

11、 module, log analysis and mining module and web presentation module. Log parsing module complete the preprocessing of the original log data. Log analysis and mining module is the key part of this system. It builds computation model for different business monitoring and mine abnormal data in differen

12、t business, then do a multi-view statistics on the abnormal data. The log analysis and mining module mainly consists of three business themes: advertisement quality, advertisement census and advertisement optimization. The web presentation module provides statistics result on a web page with dynamic

13、 trend graph and tables. In implementation, log parsing and log mining modules fully utilized the advantages of Hadoop in processing big data. The massive original log data and analysis result are both stored in HDFS(Hadoop Distributed File System), establishing a different set of MapReduce computin

14、g program to realize the data processing based on Hadoop MapReduce algorithm .The web module adopts LAMP (Linux+Apache+MySQL+PHP) and a popular web application open source framework CakePHP. Finally, the log analysis systems function and performance are tested and verified From commercial effect, th

15、e log analysis system can help Fenchao find potential problems, effectively reduce the Fengchaos online error rate, provides effective basis for decision making. 哈尔滨工业大学工程硕士学位论文 - - III Keywords: log analysis, massive data, Hadoop, MapReduce 哈尔滨工业大学工程硕士学位论文 - - IV 目录摘要 . I Abstract . II 目录 . IV 第 1 章绪论 .

展开阅读全文