資源描述:
《A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora》由會員上傳分享,免費在線閱讀,更多相關內(nèi)容在學術論文-天天文庫。
1、AStatisticalApproachtoExtractChineseChunkCandidatesfromLargeCorporaZHANGLe,LUXue-qiang,SHENYan-na,YAOTian-shun¨InstituteofComputerSoftware&Theory.SchoolofInformationScience&Engineering,NortheasternUniversityShenyang,110004ChinaEmail:ejoy@xinhuanet.com,studystrong@sohu.com,neusyn
2、@sohu.com,tsyao@mail.neu.edu.cnAbstractTheextractionofChunkcandidatesfromrealcorporaisoneofthefundamentaltasksofbuildingexample-basedmachinetranslationmodel.ThispaperpresentsastatisticalapproachtoextractChinesechunkcandidatesfromlargemonolingualcorpora.The?rststepistoextractlarg
3、eN-grams(upto20-gram)fromrawcorpus.ThentwonewlyproposedFastStatisticalSubstringReduction(FSSR)algorithmscanbeappliedtotheinitialN-gramsettoremovesomeunnecessaryN-gramsusingtheirfrequencyinformation.Thetwoalgorithmsaree?cient(bothhaveatimecomplexityofO(n))andcane?ectivelyreduceth
4、esizeofN-gramsetupto50%.Finally,mutualinformationisusedtoobtainchunkcandidatesfromreducedN-gramset.Perhapsthebiggestcontributionofthispaperisthatitisthe?rsttimetoapplyFastStatisticalSubstringReductionalgorithmtolargecorporaanddemonstratethee?ectivenessande?ciencyofthisalgorithmw
5、hich,inourhope,willshednewlightonlargescalecorpusorientedresearch.Experimentsonthreecorporawithdi?erentsizesshowthatthismethodcanextractchunkcandidatesfromcorporaofgigabytese?cientlyundercurrentcomputationalpower.Wegetanextractionaccuracyof86.3%fromPeopleDaily2000newscorpus.KeyW
6、ords:Chunkextraction,N-gram,SubstringReduction,Corpus1IntroductionWiththerapiddevelopmentofcomputationalpowerandtheavailabilityoflargeonlinecorpora(BNC(Clear,1993),PeopleDaily(YUetal,2002)),therehasbeenadramaticshiftincomputatio¨nallinguisticsfrommanuallyconstructionknowledgebas
7、estopartiallyortotallyautomaticknowledgeacquisitionbyapplyingstatisticallearningmethodstolargecorpora(seeSU,1996,foranoverview).Theconceptofchunkwas?rstraisedby(Abney,1991)intheearlyninetiestomakethetaskoflanguageparsingeasier.Hesuggestedtodevelopaparserbasedonchunkthatdecompose
8、ssentencesintochunkswitheachchunkbeingasyntacti