url 表示你要抓取資料的連結,
read_html(url) %>% html_nodes("section p") %>% html_text() 表示去抓取這連結底下,標籤section底下的標籤p所包裹的文字。
簡單範例如下:
We can do web crawling with rvest in R; the code is as below. In the code, read_html(url) %>% html_nodes("section p") %>% html_text() means we want to catch the words coated in the p tag and under the section tag on a web page.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(rvest) | |
url = 'http://yahooanswerstw.tumblr.com/post/117661866896/%E5%B8%B8%E8%A6%8B%E5%95%8F%E9%A1%8C%E6%94%B9%E7%89%88%E5%BE%8C%E6%88%91%E5%9C%A8%E8%88%8A%E7%89%88yahoo%E5%A5%87%E6%91%A9%E7%9F%A5%E8%AD%98-%E7%99%BC%E8%A1%A8%E9%81%8E%E7%9A%84%E5%85%A7%E5%AE%B9%E6%9C%83%E6%9C%89%E8%AE%8A%E5%8C%96%E5%97%8E' | |
content_css = read_html(url) %>% html_nodes("section p") %>% html_text() | |
temp <- iconv(content_css,'utf8') |
如果要抓取多個已知連結,
則可將連結們丟入vector,
利用迴圈方式去抓取內容,
比較需要注意的為Sys.sleep(runif(1,2,4)),
這意思是讓程序休息2~4秒,
免得過於頻繁存取同個網站(類似DDOS),
如果不設定休息時間,
可能會被該網站阻斷,
就會暫時不能進入該網站。
範例如下:
If we want to crawl more than one web page, we can put urls in vector, using for loop to get all the words on the web pages you want. However, if we are crawling web pages from the same website, we should use something like Sys.sleep(runif(1,2,4)) to let the program pause about 2~4 seconds, so that we won't access the website too often, since sometimes the website will block you for accessing their website too often.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(rvest) | |
links = c('http://yahooanswerstw.tumblr.com/post/127239659396/%E5%B8%B8%E8%A6%8B%E5%95%8F%E9%A1%8C%E6%96%B0%E7%89%88%E8%A8%88%E7%AE%97%E7%AD%89%E7%B4%9A%E6%99%82%E9%82%84%E6%9C%83%E7%9C%8B%E6%8E%A1%E7%94%A8%E7%8E%87%E5%97%8E','http://yahooanswerstw.tumblr.com/post/117661866896/%E5%B8%B8%E8%A6%8B%E5%95%8F%E9%A1%8C%E6%94%B9%E7%89%88%E5%BE%8C%E6%88%91%E5%9C%A8%E8%88%8A%E7%89%88yahoo%E5%A5%87%E6%91%A9%E7%9F%A5%E8%AD%98-%E7%99%BC%E8%A1%A8%E9%81%8E%E7%9A%84%E5%85%A7%E5%AE%B9%E6%9C%83%E6%9C%89%E8%AE%8A%E5%8C%96%E5%97%8E') | |
data = c() | |
for(i in 1:length(links)){ | |
url = links[i] | |
content_css = read_html(url) %>% html_nodes("section p") %>% html_text() | |
temp <- iconv(content_css,'utf8') | |
data = c(data,temp) | |
##sleep time | |
Sys.sleep(runif(1,2,4)) | |
} | |
沒有留言:
張貼留言