`robots.txt` を読み取り、`robots.txt` のルールに沿ったクローリングを行う #154

futabato · 2024-04-09T14:59:11Z

概要

robots.txt とは、検索エンジンのクローラーに対して、サイトのどの URL にアクセスしてよいかを伝えるものである。
robots.txt を読み取り、robots.txt のルールに沿ったクローリングを行う。

方針

大きく二段階に分かれると考えられる。

robots.txt の内容を取得し、ルールを理解する

grobotstxtのようなrobots.txt のパーサを利用する。

robots.txt に記載されている情報も活用してクロールする

The text was updated successfully, but these errors were encountered:

futabato · 2024-04-11T04:31:50Z

robots.txt のパーサ: https://github.com/temoto/robotstxt

import "github.com/temoto/robotstxt"

func main() {
    data, err := robotstxt.FromBytes([]byte("user-agent: *\ndisallow: /private\n"))
    if err != nil {
        // handle error
    }
    
    group := data.FindGroup("*")
    fmt.Println(group.Test("/private"))  // false, disallowed
    fmt.Println(group.Test("/public"))   // true, allowed
}

sitemap.xml のパーサ: https://github.com/ikeikeikeike/go-sitemap-generator

import gsitemap "github.com/ikeikeikeike/go-sitemap-generator/stm"

func main() {
    sm := gsitemap.NewSitemap(1)
    sm.Create()
    fmt.Println(sm.XMLContent())
}

futabato added help wanted Extra attention is needed good first issue Good for newcomers priority: high High Priority Issue Crawler Crawler module labels Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`robots.txt` を読み取り、`robots.txt` のルールに沿ったクローリングを行う #154

`robots.txt` を読み取り、`robots.txt` のルールに沿ったクローリングを行う #154

futabato commented Apr 9, 2024 •

edited

Loading

futabato commented Apr 11, 2024

robots.txt を読み取り、robots.txt のルールに沿ったクローリングを行う #154

robots.txt を読み取り、robots.txt のルールに沿ったクローリングを行う #154

Comments

futabato commented Apr 9, 2024 • edited Loading

概要

方針

futabato commented Apr 11, 2024

`robots.txt` を読み取り、`robots.txt` のルールに沿ったクローリングを行う #154

`robots.txt` を読み取り、`robots.txt` のルールに沿ったクローリングを行う #154

futabato commented Apr 9, 2024 •

edited

Loading