Jay Taylor's notes

back to listing index

fanyang01/crawler

[web search]
Original source (github.com)
Tags: golang go web-crawler github.com
Clipped on: 2018-06-14
Web crawler
Go JavaScript Other
_sitemeta Improve queue implementation 2 years ago
bloom Add package document 2 years ago
cache Fix CI 2 years ago
codec Add package document 2 years ago
download Small changes inspired by example project 2 years ago
electron Add an example project and fix problems found by it 2 years ago
example/static-crawler Update README 2 years ago
extract Change rate limit implementation 2 years ago
media Improve error handling 2 years ago
mux Small changes inspired by example project 2 years ago
proxy Fix CI 2 years ago
queue Fix diskqueue test 2 years ago
ratelimit Small changes inspired by example project 2 years ago
sample Change rate limit implementation 2 years ago
sim Small changes inspired by example project 2 years ago
sitemap Add package document 2 years ago
storage FixThis commit closes issue #4. #4; add Close() to the Store interface 2 years ago
urlx Small changes inspired by example project 2 years ago
util Don't enqueue redirect URL by default; reduce alloction 2 years ago
.gitignore Improve queue implementation 2 years ago
README.md Update README 2 years ago
all_test.go Update README 2 years ago
circle.yml Fix CI 2 years ago
client.go Make client error retryable 2 years ago
client_test.go Improve error handling 2 years ago
config.go Add an example project and fix problems found by it 2 years ago
context.go Small changes inspired by example project 2 years ago
crawler.go Update README 2 years ago
crawler.png Update README 2 years ago
crawler.xml Update README 2 years ago
ctrl.go Don't enqueue redirect URL by default; reduce alloction 2 years ago
error.go FixThis commit closes issue #4. #4; add Close() to the Store interface 2 years ago
fetch.go Add an example project and fix problems found by it 2 years ago
fetch_test.go Adjust unit test 2 years ago
glide.lock Add an example project and fix problems found by it 2 years ago
glide.yaml Add an example project and fix problems found by it 2 years ago
godoc_test.go Update README 2 years ago
handle.go Make client error retryable 2 years ago
make.go Small changes inspired by example project 2 years ago
memqueue.go Derive new context at the scheduler 2 years ago
memqueue_test.go Derive new context at the scheduler 2 years ago
option.go Don't enqueue redirect URL by default; reduce alloction 2 years ago
request.go Small changes inspired by example project 2 years ago
response.go Make client error retryable 2 years ago
schedule.go Small changes inspired by example project 2 years ago
store.go FixThis commit closes issue #4. #4; add Close() to the Store interface 2 years ago
url.go Improve error handling 2 years ago
wercker.yml Fix CI 2 years ago
worker.go Add an example project and fix problems found by it 2 years ago

README.md

crawler

crawler is a flexible web crawler framework written in Go.

Quick Start

package main

import (
	"log"
	"net/url"
	"strings"

	"github.com/fanyang01/crawler"
)

type controller struct {
	crawler.NopController
}

func (c *controller) Accept(_ *crawler.Response, u *url.URL) bool {
	return u.Host == "golang.org" && strings.HasPrefix(u.Path, "/pkg/")
}

func (c *controller) Handle(r *crawler.Response, ch chan<- *url.URL) {
	log.Println(r.URL.String())
	crawler.ExtractHref(r.NewURL, r.Body, ch)
}

func main() {
	ctrl := &controller{}
	c := crawler.New(&crawler.Config{
		Controller: ctrl,
	})
	if err := c.Crawl("https://golang.org"); err != nil {
		log.Fatal(err)
	}
	c.Wait()
}

Design

Controller interface:

// Controller controls the working progress of crawler.
type Controller interface {
	// Prepare sets options(client, headers, ...) for a http request.
	Prepare(req *Request)

	// Handle handles a response(writing to disk/DB, ...). Handle should
	// also extract hyperlinks from the response and send them to the
	// channel. Note that r.NewURL may differ from r.URL if r.URL has been
	// redirected, so r.NewURL should also be included if following
	// redirects is expected.
	Handle(r *Response, ch chan<- *url.URL)

	// Accept determines whether a URL should be processed. It is redundant
	// because you can do this in Handle, but it is provided for
	// convenience. It acts as a filter that prevents some unneccesary URLs
	// to be processed.
	Accept(r *Response, u *url.URL) bool

	// Sched issues a ticket for a new URL. The ticket specifies the next
	// time that this URL should be crawled at.
	Sched(r *Response, u *url.URL) Ticket

	// Resched is like Sched, but for URLs that have been crawled at least
	// one time. If r.URL is expected to be not crawled any more, return
	// true for done.
	Resched(r *Response) (done bool, t Ticket)

	// Retry gives the delay to retry and the maxmium number of retries.
	Retry(c *Context) (delay time.Duration, max int)

	Etc
}

Flowchart:

Press h to open a hovercard with more details.