資源描述:
《基于指令級(jí)并行的倒排索引壓縮算法-論文.pdf》由會(huì)員上傳分享,免費(fèi)在線閱讀,更多相關(guān)內(nèi)容在行業(yè)資料-天天文庫。
1、計(jì)算機(jī)研究與發(fā)展DOI:10.7544/issn1000—1239.2015.20131548JournalofComputerResearchandDevelopment52(5):995—1004,2015基于指令級(jí)并行的倒排索引壓縮算法閆宏飛張旭東單棟棟。毛先領(lǐng)。趙鑫(北京大學(xué)網(wǎng)絡(luò)與信息系統(tǒng)研究所北京’100871)。(淘寶(中國)軟件有限公司杭州312000)。(北京理工大學(xué)北京100081)(yhf@net.pku.edu.cn)SIMD—BasedInvertedIndexCompressionAlgorithmsYanHongfei,ZhangX
2、udong,ShanDongdong。,MaoXianling。,andZhaoXin(InstituteofNetworkComputingandInformationSystems,PekingUniversity,Beijing100871)(Taobao(China)SoftwareCo.,Ltd,Hangzhou312000)。(BeijingInstituteofTechnology,Beijing100081)AbstractTherapidgrowthoftextinformaUonhasbroughtaboutnewchallengestot
3、raditionalinformationretrieva1.Inlargesearchengines,indexingisrequiredtohelpusersacquireimportantdatatheyneed,andtechniquesofinvertedindexhavegreatinfluenceontheefficiencyofqueryprocessinginsuchsystems.Thedataininvertedindexisstoredintheformofarraysofintegers,andtechniquesofcompress
4、ionarerequiredtoreducethecostofstoringsuchdataindisksandmemory,aswellastoboostthehitrateofCPUcacheandspeeduptransferringdata.Therefore,itisnecessarytochooseahighlyefficientcompressionalgorithmtoprocessqueryeffectively.Inthispaper,weproposetwoinstruction—level—parallelizedalgorithms,
5、i.e.SIMD-PBandSIMD—PFD,whichimprovetwocompetitivecompressionalgorithmsrespectively,i.e.PackedBinaryandPForDelta,andexploitSIMDinstructionstoacceleratethePackandUnpackprocedureinthealgorithms.ExperimentsbasedonpublicdatasetsofGOV2andClueWeb09Bshowthatournovelalgorithmshavegoodperform
6、anceonencodinganddecodingspeedwithoutimpairingthecompressionratio,andoutperformtheformerfastestinvertedlistcompressionalgorithmsbyatmost17,withrespecttodecompressionspeed.Furthermore。experimentsindicatethatournovelalgorithmshavebetterperformanceonIongerposting1istandlargerblocksizew
7、.r.t.decodingspeed.Keywordssingleinstructionmultipledata(SIMD);invertedindex;compression;integerencoding;informationretrieval摘要文本信息數(shù)量的快速增長給傳統(tǒng)的信息檢索技術(shù)帶來了新的挑戰(zhàn).搜索引擎通常使用倒排索引來高效地處理查詢.為了減少存儲(chǔ)開銷和加快訪問速度,倒排索引通常被壓縮存儲(chǔ).因此,如何選擇一個(gè)高性能的壓縮算法對(duì)高效查詢處理是非常有必要的.在已有倒排鏈壓縮算法PackedBinary和PForDelta的基礎(chǔ)上,利用CPU的超標(biāo)量
8、特性和SIMD向量指令集,將其壓縮和解