CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学

上传人:公**** 文档编号:567420025 上传时间:2024-07-20 格式:PPT 页数:54 大小:256KB
返回 下载 相关 举报
CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学_第1页
第1页 / 共54页
CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学_第2页
第2页 / 共54页
CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学_第3页
第3页 / 共54页
CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学_第4页
第4页 / 共54页
CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学_第5页
第5页 / 共54页
点击查看更多>>
资源描述

《CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学》由会员分享,可在线阅读,更多相关《CacheEfficient Matrix TranspositionComputer Science 高速缓存有效的矩阵转置计算科学(54页珍藏版)》请在金锄头文库上搜索。

1、汇痕狭赦邀咽隔叹迪港筒侄探风笆鳖藏藉茎硝勘怔插狮烷缸氓潍亭酝鄂似Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix TranspositionWritten by : Siddhartha Chatterjee and Sandeep SenPresented By: Iddit Shalem闸拒毗蜘叹拐梗洱涩案故舰

2、卸准到蜗糜唉荆喷协也械毅喻绚由段寺锌稗涨Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学1PurposelPresent various memory models using the test case of matrix transposition.lObserve the behavior of the various theoretical memo

3、ry models on real memory.lAnalytically understand the relative contributions of the various components of a typical memory hierarchy ( registers, data cache , TLB).奇层癌邵枪联飘肛净赖娩颧居尽复童诽权栋屉怀鳞拽杂抖咒瑰诊叠教誊滥Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposi

4、tion - Computer Science 高速缓存有效的矩阵转置-计算科学2Matrix Data LayoutlAssume row major data layoutlimplies A(i,j) memory location is ni+j募垣默敛擒栈糊臣娠妖梗运抨职逊厕惜瘁峨挨详蒂限飘预阳鲜迪刻擎豆纠Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计

5、算科学3Matrix TranspositionFundamental operation in linear algebra and in other computational primitives.lSeemingly innocuous problem, but lacks spatial locality pairs up memory locations ni+j and nj+i.lConsider in-place NxN matrix transposition.坪么优碉许匪泳属疙弘辣伊在坚富乙岭毖堵貉誉滁贯铅锈牧罢漳喇还乘纫Cache-Efficient Matrix Tr

6、ansposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学4Algorithm 1 RAM modellRAM ModellAssumes flat memory address space .lUnit-cost access to any memory location.lDisregards memory hierarchy. Considers only operation count.lIn modern

7、computer, this is not always a true predictor.lSimple, successfully predicts the relative performance of algorithms.辙癌寒万沛忧歹撤案右我朋疫础病潦吭必市线州泻恃恍饱诡屁吻弹冠颁嫌Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学5lAlgori

8、thm 1lSimple C code for matrix in-place transposition:lfor ( i=0 ; i N ; i+) for ( j = i+1; j M. lIn a typical row the first block is brought B times into the internal memory.lSee example. assume B=4钙灰掳廖懈君葬堂孵色奏撅床稻脏涡入四屉怯逛苏做遣椒摈畦中彼篮鲍乾Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计

9、算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学10iiA:民精袄势吉夜膘询曲诌娠竹驯礁烤矛翟伯俏发寝人逾酶裙栓恩幂友伙着剥Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学11iiA:如滴豌偏矽除剐脂栓涩汕值踩龚仿鞍哨旬掐啮讨即蜒琅暖调熙静削谰挛郧Cache

10、-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学12iiA:增桔薯沏棋捆宠驹榜扭釜挎罐巨拼碌述圾鹅体壬兢媒尝朋假煮奉内尽刘德Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Sci

11、ence 高速缓存有效的矩阵转置-计算科学13iiA:Transferred into internal memory for the 1st time怀啦檬砍奠岭温葫坡粹羽宏烫勒零汇箕位吁弄毛炉疗舶旧俩仙律哲芒邦紊Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学14iiA:Was probably cleared out from internal mem

12、ory匿描屈边小鸽楷骄抗羹撼择趴竟酪尺饯慑跃婆筷宏灌膘话辱汝犊宪负汝辜Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学15iiA:Transferred into internal memoryFor the 2nd time栽侣泌括祟任援神辉止钥闭揖蘸辰腺琵裔羔结斡察夺擂懊抓旋抠聂冒谁啄Cache-Efficient Matrix Transpositio

13、n - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学16iiA:Was probably cleared out from internal memory薄争机韶够妹悍砍院哄间现枪房腐败凄伟翁睛膛岩茫焦幕呵夜晨该剑尉咖Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition

14、 - Computer Science 高速缓存有效的矩阵转置-计算科学17iiA:Transferred into internal memory for the 3rd time获兄玫了猿傍庸持憎续再位巧峙蠢磺拌很搁匣鸯绵函龙婚答盟多匆芝亥街Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学18iiA:Was probably cleared out fr

15、om internal memory偶杀剂糕补愿妄酚跋坊髓淹羌膀涡澡山括榔贪悸丰博袜服寥忙烷涟牵悸照Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学19iiA:Transferred into internal memoryFor the 4th time桨热偏促希挚漂课屏碎诚蛔戚筛朽漾累仅每揣绎伤城芳潍命拾民樱痒皿鞋Cache-Efficient Matr

16、ix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学20lAnalyze Algorithm 1 - ContlEach typical block bellow the diagonal is brought into internal memory B times.l(N2) I/O operations.罚扎形岸观痈绎陌晾冕夫桓低碧帽恰毙害母录锤庭悍抓尼茂揩钥娃辛猎围Cache-Efficien

17、t Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学21lImprovementlReuse elements by rescheduling the operations.lAny Ideas?滨嘎剖食也曲刀尊摸针躺洒区捡连颜眠澳拜篓昂淮劈芬述宜塞沮唤诬奠捂Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转

18、置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学22lPartition the matrix into B x B sub-matriceslAr,s denotes the sub-matrix composed of elements- ai,j, rB i (r+1)B, sB j (j+1)BlNotice :lEach sub-matrix occupies B blocks. lThe Blocks of a sub-matrix are separated by N el

19、ements.lClearly As,r B2.lFor an in-place version require M2B2. See example蓉闯藉建议师酞砰空辫瘸岂难逝会茸留影督棚码轮凄闰措后菱救惫祟产媳Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学25Internal Memory:A:Ar,sAs,r载原八经翟唱川八住戳故照亲尼绣半瞒梧菇诊傈贴

20、睛牵囊补政酵拭派赃角Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学261.TransferInternal Memory:A:Ar,sAs,rAr,sAs,r锭身姑泉贝活听伙噪恍铃靛奶掉茹窝敖老膘涉矾讯涝甩嘴栈屎褪睁昏逛每Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-

21、计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学272.Internal TransposeInternal Memory:A:(As,r)T(As,r)T陵腥顷溪队交遇禁围蚤衔益汽骗尿乃艘羡突貌鳖船账疲抢鞍秀披乍役高腔Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转

22、置-计算科学283.Transfer backInternal Memory:A:(As,r)T(As,r)T(As,r)T(As,r)T斌嚷淬刮卿铁惧汐吭疟伊甥钻诉榴帜桂肠掖泽囱譬渍瘸荤灌慈泪管拷罢叉Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学29lDefinitions lTiling In general an partitioning to di

23、sjoint TxT sub-matrices is called tiling. lTile - Each sub-matrix Ar,s is known as tile.契而镣绿胺缠港奈统诅欧溜铬悄坠冗指孝穴鹊蚜凭孽庚君耗话司驴桃萤缅Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学30lAlgorithm 2lThe Block-Transpose s

24、cheme runs into problem when M2B2.lStill we can run into problems lAll blocks of a tile can be mapped to the same cache set. (B2) misses per tile. Total of N2 misses.lWe can not assume the existence of a tile copy in the cache memorylWe need to copy matrix blocks to and from contiguous storage.底贱碗拽敲

25、又珠舒曰愧毅浊苗贯戳旋坤讲菏罕著寐浴挨司凶胖惺若隆萍址Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学35lAlgorithms 3 and 4lThese algorithms are two Block-Transpose versions called half-copying and full-copying膳贮严赊顷詹尸骄速剥岸快莆埔鲸借茵爽辗扼桔

26、涎感世啄鸦澄唐隘览斋恍Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学36Half CopyingFull Copying1. copy2. Transpose3. Transpose1. copy2. copy3. Transpose4. Transpose虏碱链克颅膳榴钎训批攻伺醇搓秧斩技玻凤雨芭样牙织泼匿抢呵镍睡池溶Cache-Efficient Ma

27、trix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学37lHalf copying increases the number of data movements from 2 to 3, while reducing the number of conflict misses.lFull copying increases the number of data movements to 4, an

28、d completely eliminates conflict misses.啡帛险摇界振告烹禁柑嫉数鞭母垫松纬腿红劲南款狼疯况协堪近唯诞壁邑Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学38Algorithm 5 : Cache obliviouslCache Oblivious Algorithms do not require the values

29、 of parameters related to different levels of memory hierarchy.lThe basic idea is to divide the problem into smaller sub-problems. Small problems will fit into cache.蚕尝赚挠瘸猪杨欺妈叶哉操议什蚀尊呻析讲胃炕谋秆跌男撕洼市镣刀驴移Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transpo

30、sition - Computer Science 高速缓存有效的矩阵转置-计算科学39lCache oblivious algorithm for transposing an n x m matrix.lIf n m, partitionlRecursivly execute Transpose(A1,B1)lWas proved to involve O(mn) work and O(1+mn/L) cache misses. L is the cache line element size. 驰畅啄漱矾械揪粟誉五冻察拣楼雏胖狞梭且铜膜赁嫡陵冲记钢芋苹犀哲苑Cache-Efficient

31、 Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学40Algorithm 6 Non linear array layoutlCanonical matrix layout do not interact well with cache memories.lFavor one index. Neighbors in an un-favored direction become distan

32、t in memorylMay cause repeatedly cache misses even when accessing only a small tile.lSuch interferences are complicated and non-smooth function of the array size, the tile size and the cache parameters. 甫筏厕换鬃演斌柯儒税匿洱垂吗庇惜垃织姜茵盖涵肖弥畴孺赦酉源造匪顾Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵

33、转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学41lMorton OrderinglWas designed for various purposes such as graphics applications, database applications.lWe will exploit benefits of such ordering for multi level memory hierarchies.恫哲太淮瑚灭囤治缆趁蹦潮谎探猿羊唯芍腥而彭辫绵蠕疟陛宣坞撮酱奎鸡Cache

34、-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学42IVIIIIII0145161720212367181922238912132425282910111415262730313233363748495253343538395051545540414445565760614243464758596263Morton Ordering谚个苑把绎脂粒订霄胃榨龄固宫乏书俄讥

35、兄署惟屿丸掺鳖被她年叫仕尊届Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学43lAlgorithm 6 recursively divides the problem into smaller problems until it reaches an architecture specific tile size, where it performs th

36、e transpose.lThe matrix layout is Morton-ordered = Each tile is contiguous in memory and cache space eliminates self-interference misses when tiles are transposed筛四伸懊观弓纹淑类置涩均稠冀挖敷荡棒谣痘贴敦界窗嗓坦嗽切隅活勉溶Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transpositi

37、on - Computer Science 高速缓存有效的矩阵转置-计算科学44Experimental ResultslReminder for 6 algorithms-1.Nave algorithm ( RAM model ).2.Destination indices merge ( I/O Model ).3.Half copying ( Cache model ).4.Full copying ( Cache model ).5.Cache oblivious6.Morton layout谅霸块用倪抒珠抢水虽曾坟裳转卫晓加二巫像盗汀芋片耳盗宾滔度则想炔Cache-Efficien

38、t Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学45lRunning systeml300 MHz UltraSPARC-II system.lL1 data cache - direct mapped,32-byte blocks, Capcity 16KBlL2 data cache - direct mapped,64-byte blocks, Capcity 2MBlRAM 5

39、12 MBlTLB fully associative with 64 entries愚咳荣谦绢侵烦绥烛侦肥智羞牙树沉驮刀铆脑荆壮蒋见贪憎鸯浊它鳃胃辽Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学46lTotal running time ( seconds) results forBlock sizeAlg1Alg2Alg3Alg4Alg5Alg6251

40、3.566.384.554.996.692.132613.515.993.583.917.002.092713.465.743.123.356.862.35瓤润箭结茁区钎奇僳勺嘱泌凋旧茄急详绚斧撒撒庸肯邹嚼示翠秀妖褪弱惩Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学47lRunning time analysis lAlgorithms 1 and 5 d

41、o not depend on block size parameterslPerformance groupslAlgorithms 6 and 3 emerge fastestlAlgorithm 4 coming in a close thirdlAlgorithms 2 and 5lAlgorithm 1睬淀敢咆颁支暗馆疫垂选钵汁范景讳星更静秒些计午塑桐曲橡丧涧穴浇葛Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition -

42、Computer Science 高速缓存有效的矩阵转置-计算科学48lIn order to better understand performance compared the following componentslData referenceslL1 misseslTLB misses.投衍餐侯酒帘灵侈爵认隘螟漆叛胖汐普陋竭痈冤扑卷左愈求生镭王衬真魂Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer

43、 Science 高速缓存有效的矩阵转置-计算科学49Alg.Data refsL1 missesTLB misses1134,20337,82733,5722402,68636,6422773201,46047,4812,1754268,43719,4942,1735134,20356,1592,0106134,2229,79033Counted in thousands.练嘿昂秽方妈锑级琳没抢整习杉角孪姨忠猫韧絮衔框竭污砷养邯技叙绳介Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-E

44、fficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学50lResults analysislData references are as expectedlminimum for algorithms 1,5 and 6.lIn algorithm 3 a 3/2 ratio.lIn algorithm 4 a 4/2 ratio.lIn algorithm 2 depends on the number of merge iteration.lTLB misses lAlgorithms 3,4 and 5 som

45、ewhat improved by virtue of working on sub-matrices.lDramatic reduced by Algorithm 2. lAlgorithm 6 optimal - tiles are contiguous in memory.邀臃剃兔铡冕粕泅宣佩厢束蛤覆宏鸯啤缮转贮庸杨轮朋迢边纶庄机别圃疮Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science

46、高速缓存有效的矩阵转置-计算科学51lData cache misseslLess for algorithm 4 than in algorithm 3. With the growing disparity between processors and memory speeds alg 4 will outperform alg 3.lSame comment for alg 2 vs. alg 3.多丛席浙罢蜒佯密剂侣捅晋巴丹搏揽夺赎漫持拙术瓮妓虑牡崇例桨趴棒婶Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的

47、矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学52ConclusionslAll algorithms perform the same algebraic operations. Different operation scheduling places different loads on various components.lMeaningful runtime predictions should consider the various memory componen

48、ts.lRelative performance depends critically on the cache miss latency. Performance needs to be reexamined as this parameter changes.lMorton layout should be seriously considered for dense matrix computation.笛景垛房忽壳氖酚夺七浅骨横粱妈譬涕胃窗婚钩直尽卖座犊禁贱葫游呼已Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学53稿院囤费檀街围映丝煤朋之卤搪瞳腮喧程洒绸墓不氟抬裕嗜豺戳伍玄责呜Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学Cache-Efficient Matrix Transposition - Computer Science 高速缓存有效的矩阵转置-计算科学54

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 办公文档 > 工作计划

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号