[R] 使用rvest進行網路爬蟲

在R之中可以用rvest進行網路爬蟲,
url 表示你要抓取資料的連結,
read_html(url) %>% html_nodes("section p") %>% html_text() 表示去抓取這連結底下,標籤section底下的標籤p所包裹的文字。
簡單範例如下:
  We can do web crawling with rvest in R; the code is as below. In the code, read_html(url) %>% html_nodes("section p") %>% html_text() means we want to catch the words coated in the p tag and under the section tag on a web page.



如果要抓取多個已知連結,
則可將連結們丟入vector,
利用迴圈方式去抓取內容,
比較需要注意的為Sys.sleep(runif(1,2,4))
這意思是讓程序休息2~4秒,
免得過於頻繁存取同個網站(類似DDOS),
如果不設定休息時間,
可能會被該網站阻斷,
就會暫時不能進入該網站。
範例如下:
  If we want to crawl more than one web page, we can put urls in vector, using for loop to get all the words on the web pages you want. However, if we are crawling web pages from the same website, we should use something like Sys.sleep(runif(1,2,4)) to let the program pause about 2~4 seconds, so that we won't access the website too often, since sometimes the website will block you for accessing their website too often.

(另一篇寫在Github範例(其中有 Code Example): Simple way to scrape website by rvest

沒有留言:

張貼留言