-
-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Golang package #348
Merged
Merged
Add Golang package #348
Changes from 15 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
f440902
fix typos in dates in JSON
starius fb4ca37
add Go package
starius 1e03d93
README: update Go instructions
starius 099a79d
golang: add benchmark
starius eb84428
golang: use go-re2 for regular expresion matching
starius b741c0b
add github workflow for Go
starius eacd6fd
golang: add benchmark for MatchingCrawlers
starius 65e66d1
golang: speed-up MatchingCrawlers
starius 51acf89
README: instruct to install C++ RE2
starius 3f3adbf
fix link in README
starius 18e9e1b
Merge branch 'master' into golang
monperrus ecb680d
Merge branch 'master' into golang
monperrus 9262954
simplify CI
monperrus 473be9d
simplify CI
monperrus ee4872b
explicit validation
monperrus cc59a03
golang: println(pattern), use pattern as subtest
starius c402815
README: add example of Go program
starius 0bc397e
golang: remove copy-paste from benchmark test
starius 9fc7a2e
golang: benchmark on browser UA
starius 266245e
golang: don't use stretchr/testify
starius f08c3de
golang: check results in banchmark, test FP
starius a32a149
golang: switch back to standard Go Regexp
starius 2945d30
golang: add test against false negatives
starius File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,3 +23,4 @@ jobs: | |
- run: py.test -vv | ||
- run: python3 validate.py | ||
- run: php validate.php | ||
- run: go test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
module github.com/monperrus/crawler-user-agents | ||
|
||
go 1.19 | ||
|
||
require ( | ||
github.com/stretchr/testify v1.9.0 | ||
github.com/wasilibs/go-re2 v1.5.1 | ||
) | ||
|
||
require ( | ||
github.com/davecgh/go-spew v1.1.1 // indirect | ||
github.com/magefile/mage v1.14.0 // indirect | ||
github.com/pmezard/go-difflib v1.0.0 // indirect | ||
github.com/tetratelabs/wazero v1.7.0 // indirect | ||
gopkg.in/yaml.v3 v3.0.1 // indirect | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= | ||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= | ||
github.com/magefile/mage v1.14.0 h1:6QDX3g6z1YvJ4olPhT1wksUcSa/V0a1B+pJb73fBjyo= | ||
github.com/magefile/mage v1.14.0/go.mod h1:z5UZb/iS3GoOSn0JgWuiw7dxlurVYTu+/jHXqQg881A= | ||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= | ||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= | ||
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg= | ||
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY= | ||
github.com/tetratelabs/wazero v1.7.0 h1:jg5qPydno59wqjpGrHph81lbtHzTrWzwwtD4cD88+hQ= | ||
github.com/tetratelabs/wazero v1.7.0/go.mod h1:ytl6Zuh20R/eROuyDaGPkp82O9C/DJfXAwJfQ3X6/7Y= | ||
github.com/wasilibs/go-re2 v1.5.1 h1:a+Gb1mx6Q7MmU4d+3BCnnN28U2/cnADmY1oRRanQi10= | ||
github.com/wasilibs/go-re2 v1.5.1/go.mod h1:UqqxQ1O99boQUm1r61H/IYGiGQOS/P88K7hU5nLNkEg= | ||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= | ||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= | ||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,183 @@ | ||
package agents | ||
|
||
import ( | ||
_ "embed" | ||
"encoding/json" | ||
"fmt" | ||
"strings" | ||
"time" | ||
|
||
regexp "github.com/wasilibs/go-re2" | ||
) | ||
|
||
//go:embed crawler-user-agents.json | ||
var crawlersJson []byte | ||
|
||
// Crawler contains information about one crawler. | ||
type Crawler struct { | ||
// Regexp of User Agent of the crawler. | ||
Pattern string `json:"pattern"` | ||
|
||
// Discovery date. | ||
AdditionDate time.Time `json:"addition_date"` | ||
|
||
// Official url of the robot. | ||
URL string `json:"url"` | ||
|
||
// Examples of full User Agent strings. | ||
Instances []string `json:"instances"` | ||
} | ||
|
||
// Private time needed to convert addition_date from/to the format used in JSON. | ||
type jsonCrawler struct { | ||
Pattern string `json:"pattern"` | ||
AdditionDate string `json:"addition_date"` | ||
URL string `json:"url"` | ||
Instances []string `json:"instances"` | ||
} | ||
|
||
const timeLayout = "2006/01/02" | ||
|
||
func (c Crawler) MarshalJSON() ([]byte, error) { | ||
jc := jsonCrawler{ | ||
Pattern: c.Pattern, | ||
AdditionDate: c.AdditionDate.Format(timeLayout), | ||
URL: c.URL, | ||
Instances: c.Instances, | ||
} | ||
return json.Marshal(jc) | ||
} | ||
|
||
func (c *Crawler) UnmarshalJSON(b []byte) error { | ||
var jc jsonCrawler | ||
if err := json.Unmarshal(b, &jc); err != nil { | ||
return err | ||
} | ||
|
||
c.Pattern = jc.Pattern | ||
c.URL = jc.URL | ||
c.Instances = jc.Instances | ||
|
||
if c.Pattern == "" { | ||
return fmt.Errorf("empty pattern in record %s", string(b)) | ||
} | ||
|
||
if jc.AdditionDate != "" { | ||
tim, err := time.ParseInLocation(timeLayout, jc.AdditionDate, time.UTC) | ||
if err != nil { | ||
return err | ||
} | ||
c.AdditionDate = tim | ||
} | ||
|
||
return nil | ||
} | ||
|
||
// The list of crawlers, built from contents of crawler-user-agents.json. | ||
var Crawlers = func() []Crawler { | ||
var crawlers []Crawler | ||
if err := json.Unmarshal(crawlersJson, &crawlers); err != nil { | ||
panic(err) | ||
} | ||
return crawlers | ||
}() | ||
|
||
func joinRes(begin, end int) string { | ||
regexps := make([]string, 0, len(Crawlers)) | ||
for _, crawler := range Crawlers[begin:end] { | ||
regexps = append(regexps, "("+crawler.Pattern+")") | ||
} | ||
return strings.Join(regexps, "|") | ||
} | ||
|
||
var allRegexps = joinRes(0, len(Crawlers)) | ||
|
||
var allRegexpsRe = regexp.MustCompile(allRegexps) | ||
|
||
// Returns if User Agent string matches any of crawler patterns. | ||
func IsCrawler(userAgent string) bool { | ||
return allRegexpsRe.MatchString(userAgent) | ||
} | ||
|
||
// With RE2 it is fast to check the text against a large regexp. | ||
// To find matching regexps faster, built a binary tree of regexps. | ||
|
||
type regexpNode struct { | ||
re *regexp.Regexp | ||
left *regexpNode | ||
right *regexpNode | ||
index int | ||
} | ||
|
||
var regexpsTree = func() *regexpNode { | ||
nodes := make([]*regexpNode, len(Crawlers)) | ||
starts := make([]int, len(Crawlers)+1) | ||
for i, crawler := range Crawlers { | ||
nodes[i] = ®expNode{ | ||
re: regexp.MustCompile(crawler.Pattern), | ||
index: i, | ||
} | ||
starts[i] = i | ||
} | ||
starts[len(Crawlers)] = len(Crawlers) // To get end of interval. | ||
|
||
for len(nodes) > 1 { | ||
// Join into pairs. | ||
nodes2 := make([]*regexpNode, (len(nodes)+1)/2) | ||
starts2 := make([]int, 0, len(nodes2)+1) | ||
for i := 0; i < len(nodes)/2; i++ { | ||
leftIndex := 2 * i | ||
rightIndex := 2*i + 1 | ||
nodes2[i] = ®expNode{ | ||
left: nodes[leftIndex], | ||
right: nodes[rightIndex], | ||
} | ||
if len(nodes2) != 1 { | ||
// Skip regexp for root node, it is not used. | ||
joinedRe := joinRes(starts[leftIndex], starts[rightIndex+1]) | ||
nodes2[i].re = regexp.MustCompile(joinedRe) | ||
} | ||
starts2 = append(starts2, starts[leftIndex]) | ||
} | ||
if len(nodes)%2 == 1 { | ||
nodes2[len(nodes2)-1] = nodes[len(nodes)-1] | ||
starts2 = append(starts2, starts[len(starts)-2]) | ||
} | ||
starts2 = append(starts2, starts[len(starts)-1]) | ||
|
||
nodes = nodes2 | ||
starts = starts2 | ||
} | ||
|
||
root := nodes[0] | ||
|
||
if root.left == nil { | ||
panic("the algoriths does not work with just one regexp") | ||
} | ||
|
||
return root | ||
}() | ||
|
||
// Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers. | ||
func MatchingCrawlers(userAgent string) []int { | ||
indices := []int{} | ||
|
||
var visit func(node *regexpNode) | ||
visit = func(node *regexpNode) { | ||
if node.left != nil { | ||
if node.left.re.MatchString(userAgent) { | ||
visit(node.left) | ||
} | ||
if node.right.re.MatchString(userAgent) { | ||
visit(node.right) | ||
} | ||
} else { | ||
// Leaf. | ||
indices = append(indices, node.index) | ||
} | ||
} | ||
|
||
visit(regexpsTree) | ||
|
||
return indices | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
package agents | ||
|
||
import ( | ||
"fmt" | ||
"testing" | ||
|
||
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func TestPatterns(t *testing.T) { | ||
// loading all crawlers wwith go:embed | ||
// some validation happens in UnmarshalJSON | ||
allCrawlers := Crawlers | ||
|
||
// there is at least 10 crawlers | ||
require.GreaterOrEqual(t, len(allCrawlers), 10) | ||
|
||
for i, crawler := range allCrawlers { | ||
t.Run(crawler.URL, func(t *testing.T) { | ||
// print pattern to console for quickcheck in CI | ||
fmt.Print(crawler.Pattern) | ||
|
||
for _, instance := range crawler.Instances { | ||
require.True(t, IsCrawler(instance), instance) | ||
require.Contains(t, MatchingCrawlers(instance), i, instance) | ||
} | ||
}) | ||
} | ||
} | ||
|
||
func BenchmarkIsCrawler(b *testing.B) { | ||
userAgent := "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google-PageRenderer Google (+https://developers.google.com/+/web/snippet/)" | ||
b.SetBytes(int64(len(userAgent))) | ||
for n := 0; n < b.N; n++ { | ||
IsCrawler(userAgent) | ||
} | ||
} | ||
|
||
func BenchmarkMatchingCrawlers(b *testing.B) { | ||
userAgent := "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google-PageRenderer Google (+https://developers.google.com/+/web/snippet/)" | ||
b.SetBytes(int64(len(userAgent))) | ||
for n := 0; n < b.N; n++ { | ||
MatchingCrawlers(userAgent) | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fmt.Println to print each pattern on a separate line.
Also maybe it is better to put crawler.Pattern as subtest name (first argument of t
.Run
) and run withgo test -v
, it will print each subtest name (which would be a pattern).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea, could you do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I pushed to the branch.