例如我要爬marvel版第1657頁的所有文章,我就需要讀取.title底下的a tag的 href屬性,跟上一篇的架構上差不多,差別在要使用 html_attr('href')來取得連結,取得連結後再用這些連結去做爬文章內容即可。如果你要爬marvel版的多個頁面的文章,就用for迴圈處理即可。
下方有範例程式碼。
When crawling web pages, urls do not always have rules, so we can use html_attr('href') to get a web page's url from its main page, and use the urls we get to crawl its content. The related codes are as follows:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tmp <- paste(i, '.html', sep='') | |
url <- 'https://www.ptt.cc/bbs/marvel/index1657.html' | |
links_data_ptt <- read_html(url) %>% html_nodes(".title a") %>% html_attr('href') | |
ptt_data = c() | |
for(i in 1:length(links_data_ptt)){ | |
url = paste0('https://www.ptt.cc',links_data_ptt[i]) | |
content_css = read_html(url) %>% html_nodes("#main-content") %>% html_text() | |
utf8_text_content <- iconv(content_css,'utf8') | |
ptt_data = c(ptt_data,utf8_text_content) | |
Sys.sleep(runif(1,2,5)) | |
} |
(另一篇寫在Github範例(其中有 Code Example): Simple way to scrape website by rvest)
Related article: [R] 使用rvest進行網路爬蟲
受益良多
回覆刪除