多层数据中心应用性能剖析

上传人:xzh****18 文档编号:44655092 上传时间:2018-06-14 格式:PDF 页数:29 大小:1.98MB
返回 下载 相关 举报
多层数据中心应用性能剖析_第1页
第1页 / 共29页
多层数据中心应用性能剖析_第2页
第2页 / 共29页
多层数据中心应用性能剖析_第3页
第3页 / 共29页
多层数据中心应用性能剖析_第4页
第4页 / 共29页
多层数据中心应用性能剖析_第5页
第5页 / 共29页
点击查看更多>>
资源描述

《多层数据中心应用性能剖析》由会员分享,可在线阅读,更多相关《多层数据中心应用性能剖析(29页珍藏版)》请在金锄头文库上搜索。

1、Profiling Network Performance in Mul5-5er Datacenter Applica5ons Minlan Yu Princeton University 1 Joint work with Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, Changhoon Kim Scalable Net-App Profiler Applica5ons inside Data Centers 2 Front end Server Aggregator Worker

2、s . . . . Challenges of Datacenter Diagnosis Large complex applica5ons Hundreds of applica5on components Tens of thousands of servers New performance problems Update code to add features or fix bugs Change components while app is s5ll in opera5on Old performance problems (Human factors) Developers m

3、ay not understand network well Nagles algorithm, delayed ACK, etc. 3 Diagnosis in Todays Data Center 4 Host App OS Packet sniffer App logs: #Reqs/sec Response 5me 1% req. 200ms delay Switch logs: #bytes/pkts per minute Packet trace: Filter out trace for long delay req. SNAP: Diagnose net-app interac

4、5ons Applica5on-specific Too expensive Too coarse-grained Generic, fine-grained, and lightweight SNAP: A Scalable Net-App Profiler that runs everywhere, all the 5me 5 SNAP Architecture 6 At each host for every connec5on Collect data Collect Data in TCP Stack TCP understands net-app interac5ons Flow

5、control: How much data apps want to read/write Conges5on control: Network delay and conges5on Collect TCP-level sta5s5cs Defined by RFC 4898 Already exists in todays Linux and Windows OSes 7 TCP-level Sta5s5cs Cumula5ve counters Packet loss: #FastRetrans, #Timeout RTT es5ma5on: #SampleRTT, #SumRTT R

6、eceiver: RwinLimitTime Calculate the difference between two polls Instantaneous snapshots #Bytes in the send buffer Conges5on window size, receiver window size Representa5ve snapshots based on Poisson sampling 8 SNAP Architecture 9 At each host for every connec5on Collect data Performance Classifier

7、 Life of Data Transfer Applica5on generates the data Copy data to send buffer TCP sends data to the network Receiver receives the data and ACK 10 Sender App Send Buffer Receiver Network Taxonomy of Network Performance No network problem Send buffer not large enough Fast retransmission Timeout Not re

8、ading fast enough (CPU, disk, etc.) Not ACKing fast enough (Delayed ACK) 11 Sender App Send Buffer Receiver Network Iden5fying Performance Problems Not any other problems #bytes in send buffer #Fast retransmission #Timeout RwinLimitTime Delayed ACK diff(SumRTT) diff(SampleRTT)*MaxQueuingDelay 12 Sen

9、der App Send Buffer Receiver Network Direct measure Sampling Inference Management System SNAP Architecture 13 At each host for every connec5on Collect data Performance Classifier Cross- connec5on correla5on Topology, rou5ng Conn proc/app Offending app, host, link, or switch Pinpoint Problems via Cor

10、rela5on 14 Correla5on over shared switch/link/host Packet loss for all the connec5ons going through one switch/host Pinpoint the problema5c switch Pinpoint Problems via Correla5on 15 Correla5on over applica5on Same applica5on has problem on all machines Report aggregated applica5on behavior Manageme

11、nt System SNAP Architecture 16 At each host for every connec5on Collect data Performance Classifier Cross- connec5on correla5on Topology, rou5ng Conn proc/app Offending app, host, link, or switch Online, lightweight processing & diagnosis Offline, cross-conn diagnosis Reducing SNAP Overhead SNAP ove

12、rhead Data volume: 120 Bytes per connec5on per poll CPU overhead: 5% for polling 1K connec5ons with 500 ms interval Increases with #connec5ons and polling freq. Solu5on: Adap5ve tuning of polling frequency Reduce polling frequency to stay within a target CPU Devote more polling to more problema5c co

13、nnec5ons 17 SNAP in the Real World 18 Key Diagnosis Steps Iden5fy performance problems Correlate across connec5ons Iden5fy applica5ons with severe problems Expose simple, useful informa5on to developers Filter important sta5s5cs and classifica5on results Iden5fy root cause and propose solu5ons Work

14、with operators and developers Tune TCP stack or change applica5on code 19 SNAP Deployment Deployed in a produc5on data center 8K machines, 700 applica5ons Ran SNAP for a week, collected terabytes of data Diagnosis results Iden5fied 15 major performance problems 21% applica5ons have network performan

15、ce problems 20 Characterizing Perf. Limita5ons 21 Send Buffer Receiver Network #Apps that are limited for 50% of the 5me 1 App 6 Apps 8 Apps 144 Apps Send buffer not large enough Fast retransmission Timeout Not reading fast enough (CPU, disk, etc.) Not ACKing fast enough (Delayed ACK) Three Example Problems Delayed ACK affects delay sensi5ve apps Conges5on window allows sudden burst Significant 5meouts for low-rate flows 22 Problem 1: Delayed ACK Delayed ACK affected many delay sensi5ve apps even #pkts per record 1,000 records/sec odd #pkts per record 5 record

展开阅读全文
相关资源
相关搜索

当前位置:首页 > IT计算机/网络 > 多媒体应用

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号