fanyang01/crawler

Code Issues 1 Pull requests 0 Projects 0 Wiki Insights

Web crawler

Go JavaScript Other

New pull request

Upload files Find file

Clone or download

fanyang01 Update README

Latest commit 30af792 on May 14, 2016

_sitemeta	Improve queue implementation	2 years ago
bloom	Add package document	2 years ago
cache	Fix CI	2 years ago
codec	Add package document	2 years ago
download	Small changes inspired by example project	2 years ago
electron	Add an example project and fix problems found by it	2 years ago
example/static-crawler	Update README	2 years ago
extract	Change rate limit implementation	2 years ago
media	Improve error handling	2 years ago
mux	Small changes inspired by example project	2 years ago
proxy	Fix CI	2 years ago
queue	Fix diskqueue test	2 years ago
ratelimit	Small changes inspired by example project	2 years ago
sample	Change rate limit implementation	2 years ago
sim	Small changes inspired by example project	2 years ago
sitemap	Add package document	2 years ago
storage	FixThis commit closes issue #4. #4 ; add Close() to the Store interface	2 years ago
urlx	Small changes inspired by example project	2 years ago
util	Don't enqueue redirect URL by default; reduce alloction	2 years ago
.gitignore	Improve queue implementation	2 years ago
README.md	Update README	2 years ago
all_test.go	Update README	2 years ago
circle.yml	Fix CI	2 years ago
client.go	Make client error retryable	2 years ago
client_test.go	Improve error handling	2 years ago
config.go	Add an example project and fix problems found by it	2 years ago
context.go	Small changes inspired by example project	2 years ago
crawler.go	Update README	2 years ago
crawler.png	Update README	2 years ago
crawler.xml	Update README	2 years ago
ctrl.go	Don't enqueue redirect URL by default; reduce alloction	2 years ago
error.go	FixThis commit closes issue #4. #4 ; add Close() to the Store interface	2 years ago
fetch.go	Add an example project and fix problems found by it	2 years ago
fetch_test.go	Adjust unit test	2 years ago
glide.lock	Add an example project and fix problems found by it	2 years ago
glide.yaml	Add an example project and fix problems found by it	2 years ago
godoc_test.go	Update README	2 years ago
handle.go	Make client error retryable	2 years ago
make.go	Small changes inspired by example project	2 years ago
memqueue.go	Derive new context at the scheduler	2 years ago
memqueue_test.go	Derive new context at the scheduler	2 years ago
option.go	Don't enqueue redirect URL by default; reduce alloction	2 years ago
request.go	Small changes inspired by example project	2 years ago
response.go	Make client error retryable	2 years ago
schedule.go	Small changes inspired by example project	2 years ago
store.go	FixThis commit closes issue #4. #4 ; add Close() to the Store interface	2 years ago
url.go	Improve error handling	2 years ago
wercker.yml	Fix CI	2 years ago
worker.go	Add an example project and fix problems found by it	2 years ago

README.md

crawler

crawler is a flexible web crawler framework written in Go.

Quick Start

package main

import (
	"log"
	"net/url"
	"strings"

	"github.com/fanyang01/crawler"
)

type controller struct {
	crawler.NopController
}

func (c *controller) Accept(_ *crawler.Response, u *url.URL) bool {
	return u.Host == "golang.org" && strings.HasPrefix(u.Path, "/pkg/")
}

func (c *controller) Handle(r *crawler.Response, ch chan<- *url.URL) {
	log.Println(r.URL.String())
	crawler.ExtractHref(r.NewURL, r.Body, ch)
}

func main() {
	ctrl := &controller{}
	c := crawler.New(&crawler.Config{
		Controller: ctrl,
	})
	if err := c.Crawl("https://golang.org"); err != nil {
		log.Fatal(err)
	}
	c.Wait()
}

Design

Controller interface:

// Controller controls the working progress of crawler.
type Controller interface {
	// Prepare sets options(client, headers, ...) for a http request.
	Prepare(req *Request)

	// Handle handles a response(writing to disk/DB, ...). Handle should
	// also extract hyperlinks from the response and send them to the
	// channel. Note that r.NewURL may differ from r.URL if r.URL has been
	// redirected, so r.NewURL should also be included if following
	// redirects is expected.
	Handle(r *Response, ch chan<- *url.URL)

	// Accept determines whether a URL should be processed. It is redundant
	// because you can do this in Handle, but it is provided for
	// convenience. It acts as a filter that prevents some unneccesary URLs
	// to be processed.
	Accept(r *Response, u *url.URL) bool

	// Sched issues a ticket for a new URL. The ticket specifies the next
	// time that this URL should be crawled at.
	Sched(r *Response, u *url.URL) Ticket

	// Resched is like Sched, but for URLs that have been crawled at least
	// one time. If r.URL is expected to be not crawled any more, return
	// true for done.
	Resched(r *Response) (done bool, t Ticket)

	// Retry gives the delay to retry and the maxmium number of retries.
	Retry(c *Context) (delay time.Duration, max int)

	Etc
}

Flowchart:

Terms
Privacy
Security
Status
Help

Press h to open a hovercard with more details.