資源描述:
《java網(wǎng)絡(luò)爬蟲簡單實現(xiàn)》由會員上傳分享,免費在線閱讀,更多相關(guān)內(nèi)容在行業(yè)資料-天天文庫。
1、首先介紹每個類的功能:DownloadPage.java的功能是下載此超鏈接的頁面源代碼.FunctionUtils.java的功能是提供不同的靜態(tài)方法,包括:頁面鏈接正則表達式匹配,獲取URL鏈接的元素,判斷是否創(chuàng)建文件,獲取頁面的Url并將其轉(zhuǎn)換為規(guī)范的Url,截取網(wǎng)頁網(wǎng)頁源文件的目標內(nèi)容。HrefOfPage.java的功能是獲取頁面源代碼的超鏈接。UrlDataHanding.java的功能是整合各個給類,實現(xiàn)url到獲取數(shù)據(jù)到數(shù)據(jù)處理類。UrlQueue.java的未訪問Url隊列。Vis
2、itedUrlQueue.java已訪問過的URL隊列。下面介紹一下每個類的源代碼:DownloadPage.java此類要用到HttpClient組件。?1.packagecom.sreach.spider;2.3.importjava.io.IOException;4.importorg.apache.http.HttpEntity;5.importorg.apache.http.HttpResponse;6.importorg.apache.http.client.ClientProtocol
3、Exception;7.importorg.apache.http.client.HttpClient;8.importorg.apache.http.client.methods.HttpGet;9.importorg.apache.http.impl.client.DefaultHttpClient;10.importorg.apache.http.util.EntityUtils;11.12.publicclassDownloadPage13.{14.15.????/**16.??????*根
4、據(jù)URL抓取網(wǎng)頁內(nèi)容17.??????*?18.??????*@paramurl19.??????*@return20.??????*/21.????publicstaticStringgetContentFormUrl(Stringurl)1.????{2.????????/*實例化一個HttpClient客戶端*/3.????????HttpClientclient=newDefaultHttpClient();4.????????HttpGetgetHttp=newHttpGet(url);5
5、.6.????????Stringcontent=null;7.8.????????HttpResponseresponse;9.????????try10.????????{11.????????????/*獲得信息載體*/12.????????????response=client.execute(getHttp);13.????????????HttpEntityentity=response.getEntity();14.15.????????????VisitedUrlQueue.addE
6、lem(url);16.17.????????????if(entity!=null)18.????????????{19.????????????????/*轉(zhuǎn)化為文本信息*/20.????????????????content=EntityUtils.toString(entity);21.22.????????????????/*判斷是否符合下載網(wǎng)頁源代碼到本地的條件*/23.????????????????if(FunctionUtils.isCreateFile(url)24.??????
7、??????????????????&&FunctionUtils.isHasGoalContent(content)!=-1)25.????????????????{26.????????????????????FunctionUtils.createFile(FunctionUtils27.????????????????????????????.getGoalContent(content),url);28.????????????????}29.????????????}30.31.????
8、????}catch(ClientProtocolExceptione)1.????????{2.????????????e.printStackTrace();3.????????}catch(IOExceptione)4.????????{5.????????????e.printStackTrace();6.????????}finally7.????????{8.????????????client.getConnectionManager().shutdow