[R] 使用rvest進行網路爬蟲（二）

　　在使用網路爬蟲時，有些時候網頁的url並不一定具有規則（例如PTT），那我們可以先從其列表中抓取個網站的url（PTT的每一頁列表的url具有規則），再利用這些爬到的url去做網路爬蟲。
　　例如我要爬marvel版第1657頁的所有文章，我就需要讀取.title底下的a tag的 href屬性，跟上一篇的架構上差不多，差別在要使用 html_attr('href')來取得連結，取得連結後再用這些連結去做爬文章內容即可。如果你要爬marvel版的多個頁面的文章，就用for迴圈處理即可。
　　下方有範例程式碼。
When crawling web pages, urls do not always have rules, so we can use html_attr('href') to get a web page's url from its main page, and use the urls we get to crawl its content. The related codes are as follows:

（另一篇寫在Github範例（其中有 Code Example）: Simple way to scrape website by rvest）

Related article: [R] 使用rvest進行網路爬蟲

Mao's notes.

[R] 使用rvest進行網路爬蟲（二）

1 則留言:

Popular Posts

Blog Archive

Categories

搜尋此網誌

[R] 使用rvest進行網路爬蟲 （二）

1 則留言:

Popular Posts

Blog Archive

Categories

搜尋此網誌

[R] 使用rvest進行網路爬蟲（二）