Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt を読み取り、robots.txt のルールに沿ったクローリングを行う #154

Open
futabato opened this issue Apr 9, 2024 · 1 comment
Labels
Crawler Crawler module good first issue Good for newcomers help wanted Extra attention is needed priority: high High Priority Issue

Comments

@futabato
Copy link
Collaborator

futabato commented Apr 9, 2024

概要

robots.txt とは、検索エンジンのクローラーに対して、サイトのどの URL にアクセスしてよいかを伝えるものである。
robots.txt を読み取り、robots.txt のルールに沿ったクローリングを行う。

方針

大きく二段階に分かれると考えられる。

  1. robots.txt の内容を取得し、ルールを理解する

grobotstxtのようなrobots.txt のパーサを利用する。

  1. robots.txt に記載されている情報も活用してクロールする
@futabato futabato added help wanted Extra attention is needed good first issue Good for newcomers priority: high High Priority Issue Crawler Crawler module labels Apr 9, 2024
@futabato
Copy link
Collaborator Author

robots.txt のパーサ: https://github.com/temoto/robotstxt

import "github.com/temoto/robotstxt"

func main() {
    data, err := robotstxt.FromBytes([]byte("user-agent: *\ndisallow: /private\n"))
    if err != nil {
        // handle error
    }
    
    group := data.FindGroup("*")
    fmt.Println(group.Test("/private"))  // false, disallowed
    fmt.Println(group.Test("/public"))   // true, allowed
}

sitemap.xml のパーサ: https://github.com/ikeikeikeike/go-sitemap-generator

import gsitemap "github.com/ikeikeikeike/go-sitemap-generator/stm"

func main() {
    sm := gsitemap.NewSitemap(1)
    sm.Create()
    fmt.Println(sm.XMLContent())
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Crawler Crawler module good first issue Good for newcomers help wanted Extra attention is needed priority: high High Priority Issue
Projects
None yet
Development

No branches or pull requests

1 participant