資源描述:
《基于擴(kuò)展領(lǐng)域模型的有名屬性抽取》由會(huì)員上傳分享,免費(fèi)在線閱讀,更多相關(guān)內(nèi)容在教育資源-天天文庫(kù)。
1、基于擴(kuò)展領(lǐng)域模型的有名屬性抽取*本研究得到國(guó)家自然科學(xué)基金重點(diǎn)課題(項(xiàng)目編號(hào):60933005)、973國(guó)家重點(diǎn)基礎(chǔ)研究發(fā)展計(jì)劃的項(xiàng)目資助(項(xiàng)目編號(hào):2004CB318109,2007CB311100)和863高技術(shù)研究發(fā)展計(jì)劃的項(xiàng)目資助(項(xiàng)目編號(hào):2007AA01Z441,2007AA01Z438)王宇1,2,譚松波1,廖祥文1,2,曾依靈1,21中國(guó)科學(xué)院計(jì)算技術(shù)研究所,北京,1001902中國(guó)科學(xué)院研究生院,北京,100190wangyu2005@software.ict.ac.cnExtend
2、edDomainModelBasedNamedAttributeExtractionWangYu1,2,TanSongbo1,LiaoXiangwen1,2,andZengYiling1,21InstituteofComputingTechnology,ChineseAcademyofSciences,Bejing100190,2GraduateSchoolofChineseAcademyofSciences,Bejing100190Abstract:Webinformationextractioni
3、sanimportanttaskofwebmining.Variousapplicationscouldbenefitfromtheadvancementinthisarea.Theseapplicationsincludesemanticweb,verticalsearch,sentimentanalysis,etc.Currenttechniquesrequirelotsofhumaninteractionwhichprecludetheuniversalapplicationofwebinfor
4、mationextraction.Toautomatetheextractionprocess,recentresearchworksidentifyspecificfeaturesofspecialdomainsandextractinformationbymachinelearningtechniques.However,becauseofthedependenceonspecificfeatures,itisverydifficulttoextendsuchmethodstootherdomai
5、ns.Inthispaper,thewebinformationextractionproblemisanalyzedandasubtaskisproposed.Thisnewsubtaskiscallednamedattributeextractiontask.Statisticsresultsfrommultipledatasetsprovethatnamedattributeextractiontaskcoversmorethan60%attributesinthesedomains,which
6、showtheimportanceofthissubtask.Namedattributesareattributesofobjectswhichareencodedinthename-valuepairform.Thatis,thenamesandvaluesofattributesaresettlednearbyinthewebpages.Therefore,oncethenamesofattributesarelocated,thevaluescanbeextractedautomaticall
7、y.Inthispaper,anextendeddomainmodelisproposedtosummarizeattributenamesofadomain.Andaninformationextractionmethodbasedonthismodelisdeveloped.Experimentsshowthatourmethodcanextractnamedattributesattheprecision80%,andattherecallhigherthan90%.Keyword:Inform
8、ationExtraction,AttributeExtraction,NamedAttribute,ExtendedDomainModel,VisualWebPageAnalysis摘要:網(wǎng)頁(yè)信息抽取是互聯(lián)網(wǎng)挖掘的重要課題。為了自動(dòng)化抽取過(guò)程,最新的研究利用特定領(lǐng)域的特征,通過(guò)機(jī)器學(xué)習(xí)方法對(duì)信息抽取過(guò)程進(jìn)行統(tǒng)一建模。但是,對(duì)領(lǐng)域特征的依賴使得這類方法難以被推廣到其他領(lǐng)域中去。因此,對(duì)信息抽取問(wèn)題進(jìn)行了分析,從中分離出一個(gè)可以完全自動(dòng)化的信息抽取子任務(wù),