etl性能问题分析与调整方法

资源描述

《etl性能问题分析与调整方法》由会员分享，可在线阅读，更多相关《etl性能问题分析与调整方法（35页珍藏版）》请在金锄头文库上搜索。

1、BOCN 数据仓库 ETL 性能分析与调整方法2 *前言随着数据仓库基础数据和应用的不断增加，数据仓库的性能问题日益突出。目前超过1小时的作业有2个 30分钟到1小时的作业有10个 10分钟到30分钟的作业有48个因此对数据仓库进行性能优化的需求越来越迫切3 *作业时长分析共有12个作业运行时间在30分钟以上系统作业运行时间 SDDT05_EVENT_FUND_TXN_EXP1:47:03 ALSALM_DEPOSIT_FIX_DATA1:05:00 BRSBRS_EXP_DATA0:51:52 MAST_AGMT_INFO_A0:47:01 SDDSRDFTPONE0:46

2、:08 MAST_AGMT_INFO_B0:46:02 PCRPCR_EXPORT0:43:20 SDDSRDFTPTHREE0:43:07 MASOTHER0:39:33 SDDSRDFTPTWO0:38:43 OCM99_OCRM_JOB_END0:37:20 ALSALM_GROUP_INT_H0:33:194 *共有48个作业运行时间为 10分钟到30分钟系统作业运行时间系统作业运行时间 BAKBAK_ESSAPS_SRC0:29:14DQCT03_AGMT_AMT_H0:15:18 T03T03_AGREEMENT_RT0:28:20CCRGRDJKFEXPDATA0:15:03

3、 PMMJXKH_C_LDM0:28:19DQCDQC_CHECK_KPI0:14:50 BAKBAK_ETLENV_710:27:34PMMJXKH_P_LDM_NET0:14:08 T03T03_CARD_ACCT_BAL_COMPO_H0:24:39ALSALM_LOAN_DATA0:13:57 T03T03_AGREEMENT_S020:24:19MAST_MA_BALANCE0:13:53 T03T03_AGMT_ACCU_H0:23:59CBST_CBS_BAL_DL0:13:18 MAST_AGMT_INFO_C0:22:49BAKBAK_ETLENV_400:12:58 T03

4、T03_AGMT_BAL_H0:22:17MASFTPFILE0:12:32 PMMJXKH_P_LDM_SAV0:21:33ACMCUST_FIX_IND_POST0:12:31 CFST_CFS_AC_TREE0:21:31T05T05_CRDCARD_TXN_DTL0:12:05 SDDBCNTDEPO0:20:57ECLNB_CUST_TXN_SUMM0:12:03 DQCT03_CARD_ACCT_BAL_COMPO_H0:20:32ECLNB_NEW_CUST_SUMM0:11:58 MDLM03_FIN_ACCT0:19:37BDET03_ACCT_CRDCARD0:11:53

5、BDET03_AGREEMENT0:19:26IFST_IFS_BAL_BASIC0:11:20 IRPT_DEP_POT_INT_RATE0:17:24MDLM01_CUST_SIGN_INFO0:11:16 PCRCC_CR_ID_I0:17:08IRPT_LOAN_POT_INT_RATE0:11:07 SDDREPAYDET0:17:06RIRDEPOS_BAL_DATA0:11:07 T03T03_AGMT_BAL_H_S010:17:01T05T05_EVT_MESSAGE_SEND0:10:47 ACMCUST_BASIC_INFO_PRE0:16:51CRST_REP_GJYW

6、_ZB0:10:46 CABCAB_EXP_DATA0:16:40MDLM03_CARD_TXN_STAT0:10:44 PMMJXKH_P_LDM_CARD0:16:37IRPT_IRP_FLUIDITY_BASIC0:10:37 MDLM03_FIN_ACCT_STAT0:16:20PCRCC_CR_BASE_I0:10:28 PFXT05_VCH_FINANCE_TMP0:15:31ACMTHIRD_PTY_MON_SUMM0:10:205 *月末超长作业列表月末跑批：系统名作业名跑批时间(分钟/min)IRPT_IRP_SAME_BIZ_AVGBALSQL语句无法跑批IRPT_IRP

7、_DEPOSIT_ACTIVITYSQL语句无法跑批IFST_IFS_AC_DLSQL语句无法跑批OCMOCRM_HIGH_CUST_INFOSQL语句无法跑批6 *ETL脚本性能问题与原因分析业务规则理解数据访问路径选取的合理性 SQL策略与算法 Multi-Statement 实现并行InsertSelect 比对算法不标准加载策略不优-大数据量Update、向非空表进行Insert 一个任务内多脚本，过多的数据库录入/录出 ACRM中居多模型优化 PPI和PI的选择对小表或代码表的部分字段进行Collect Statistics,优化sql的执行路径实际并发运行环境的影响7

8、 *Duplicate RowsA duplicate row is a row of a table whose column values are all identical to another row in the same table.col_a col_bcol_c2050A2550A2550ADuplicate RowsBecause a PK uniquely identifies each row, ideally a relational table should not have duplicate rows! The ANSI standard, however, pe

9、rmits duplicate rows for specialized situations, thus Teradata permits them as well. You may select whether your table will or will not allow them.* Note: If a UPI is selected on a SET table, the duplicate row check is replaced by a check for duplicate index values.CREATE SET TABLE table_A:CREATE MU

10、LTISET TABLE table_B:Checks for * and disallows duplicate rows.Doesnt check for and allows duplicate rows.The Teradata defaultThe ANSI default8 *Row Distribution Using a UPI Case 1Notes: Often, but not always, the PK column(s) will be used as a UPI. PI values for Order_Number are known to be unique

11、(its a PK). Teradata will distribute different index values evenly across all AMPs. Resulting row distribution among AMPs is very uniform. Assures maximum efficiency for parallel operations.AMPAMPAMPAMPo_#c_#o_dto_st720224/09C741514/13Co_#c_#o_dto_st732524/13O710314/10O740234/16Co_#c_#o_dto_st718814

12、/13C722524/15Co_#c_#o_dto_st732434/13O738414/12COrder9 *Row Distribution Using a NUPI Case 2Notes: Customer_Number may be the preferred access column for ORDER table, thus a good index candidate. Values for Customer_Number are somewhat non-unique. Choice of Customer_Number is therefore a NUPI. Rows

13、with the same PI value distribute to the same AMP. Row distribution is less uniform or skewed.o_#c_#o_dto_st732524/13O720224/09C722524/15Co_#c_#o_dto_st738414/12C710314/10O741514/13C718814/13Co_#c_#o_dto_st740234/16C732434/13OAMPAMPAMPAMPOrder10 *Row Distribution Using a Highly Non-Unique Primary In

14、dex (NUPI) Case 3OrderNotes: Values for Order_Status are “highly” non- unique. Choice of Order_Status column is a NUPI. Only two values exist, so only two AMPs will ever be used for this table. Table will not perform well in parallel operations. Highly non-unique columns are poor PI choices generall

15、y. The degree of uniqueness is critical to efficiency.AMPAMPAMPAMPo_#c_#o_dto_st740234/16C720224/09C722524/15C741514/13C718814/13C738414/12Co_#c_#o_dto_st710314/10O732434/13O732524/13O11 *Join RedistributionSELECT. . . FROMTable1 T1 INNER JOINTable2 T2 ONT1.A = T2.A;Join columns are from the same do

16、main.No Redistribution needed.T1 ABCPI100214433T2 ABCPI100725002SELECT. . . FROMTable3 T3 INNER JOINTable4 T4 ONT3.A = T4.B;Join columns are from the same domain.Redistribution needed.T3 ABCPI255345225T4 ABCPI867255566SPOOL ABCPI867255566Redistribute T4 rows in spool on column B.12 *Join Redistribution (cont.)Join is on columns that isnt the Primary Index of either

展开阅读全文