資源描述:
《基于向量空間模型的文本過濾系統(tǒng)》由會員上傳分享,免費(fèi)在線閱讀,更多相關(guān)內(nèi)容在行業(yè)資料-天天文庫。
1、1000-9825/2003/14(03)0435?2003JournalofSoftware軟件學(xué)報(bào)Vol.14,No.3?基于向量空間模型的文本過濾系統(tǒng)+黃萱菁,夏迎炬,吳立德(復(fù)旦大學(xué)計(jì)算機(jī)科學(xué)與工程系,上海200433)ATextFilteringSystemBasedonVectorSpaceModel+HUANGXuan-Jing,XIAYing-Ju,WULi-De(DepartmentofComputerScienceandEngineering,FudanUniversity,Shanghai200433,China)+Correspondinga
2、uthor:Phn:86-21-65642192,E-mail:xjhuang@fudan.edu.cnhttp://www.fudan.edu.cnReceived2001-09-14;Accepted2002-04-10HuangXJ,XiaYJ,WuLD.Atextfilteringsystembasedonvectorspacemodel.JournalofSoftware,2003,14(3):435~442.Abstract:Textfilteringistheprocedureofretrievingdocumentsrelevanttotherequ
3、irementsofspecificusersfromalarge-scaletextdatastream.First,theTREC(textretrievalconference)aswellasitstextfilteringtrackareintroduced,whichisthemostauthoritativeinternationalevaluationconferenceontextretrieval,fromtheaspectsoftasks,topics,corpusandevaluationmetrics.Thenatextfilterings
4、ystembasedonvectorspacemodelispresented.Thissystemiscomposedoftwophasesoftrainingandadaptivefiltering.Duringthetrainingphase,featureselectionandpseudofeedbackareusedtoselecttheinitialfilteringprofilesandthresholds.Duringthefilteringphase,userfeedbackisutilizedtomodifytheprofilesandthre
5、sholdsadaptively.Thissystemtookparticipateinthe9thTextRetrievalConferencein2000,andrankedhighamongallthe15systemsfrommanycountries.Goodperformancehasbeenachieved,wheretheaverageprecisionsofadaptiveandbatchfilteringare26.5%and31.7%respectively.Keywords:textretrieval;textfiltering;textca
6、tegorization;machinelearning;vectorspacemodel摘要:文本過濾是指從大量的文本數(shù)據(jù)流中尋找滿足特定用戶需求的文本的過程.首先從任務(wù)、測試主題、語料庫和評測指標(biāo)等方面介紹了文本檢索領(lǐng)域最權(quán)威的國際評測會議——文本檢索會議(TREC)及其中的文本過濾項(xiàng)目,然后詳細(xì)地描述了基于向量空間模型的文本過濾系統(tǒng).該系統(tǒng)由訓(xùn)練和自適應(yīng)過濾兩個(gè)階段組成.在訓(xùn)練階段,通過特征抽取和偽反饋建立初始的過濾模板,并設(shè)置初始閾值;在過濾階段,則根據(jù)用戶的反饋信息自適應(yīng)地調(diào)整模板和閾值.該系統(tǒng)參加了2000年舉行的第9次文本檢索會議的評測,取得了很好的成
7、績,在來自多?SupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNos.69873011,69935010,60103014(國家自然科學(xué)基金);theNationalHighTechnologyDevelopment863ProgramofChinaunderGrantNo.863-306-ZD02-02-4(國家863高科技發(fā)展計(jì)劃);theNationalHigh-TechResearchandDevelopmentPlanofChinaunderGrantNo.2001AA114