gocolly/colly

Code Issues 5 Pull requests 1 Projects 0 Wiki Insights

Elegant Scraper and Crawler Framework for Golang http://go-colly.org/

golang scraper framework crawler scraping crawling spider go

Go 100.0%

Clone or download

Upload files Find file

New pull request

Latest commit 488e630 a day ago

asciimoo [fix] reduce cyclomatic complexity

.github	Added .github/ISSUE_TEMPLATE.md (optional)	6 months ago
_examples	[fix] update instagram scraper according to site changes	2 days ago
cmd/colly	[mod] add license header	2 months ago
debug	[mod] add license header	2 months ago
extensions	[fix] package comment should be of the form "Package extensions ..."	2 months ago
proxy	[enh] extend proxy docs with https proxy support	6 days ago
queue	Add Request.Do() method which repects AllowURLRevisit	6 days ago
storage	[mod] simplify the cookie layer in storage interface	2 months ago
.codecov.yml	turn off codecov comments	4 months ago
.travis.yml	[mod] update go versions for travis tests	27 days ago
CHANGELOG.md	[enh] v1.0.0	2 days ago
CONTRIBUTING.md	Update CONTRIBUTING.md	5 months ago
LICENSE.txt	[enh] add request & response callbacks ++ cookie handling ++ readme	8 months ago
README.md	[doc] add gowap to projects	28 days ago
VERSION	[enh] v1.0.0	2 days ago
colly.go	[fix] reduce cyclomatic complexity	a day ago
colly_test.go	Add support to URL regexp blacklisting	18 days ago
context.go	[mod] add license header	2 months ago
context_test.go	[mod] add license header	2 months ago
htmlelement.go	[mod] lint: shorten []string declaration	a month ago
http_backend.go	[mod] add license header	2 months ago
request.go	[fix] invert AllowURLRevisit parameter in request.Do()	2 days ago
response.go	[fix] valid response filename handling - closesThis commit closes issue #119. #119	a month ago
unmarshal.go	[mod] add license header	2 months ago
unmarshal_test.go	[mod] add license header	2 months ago
xmlelement.go	fix ChildText panic if xpath does not exist in xml	5 days ago
xmlelement_test.go	[mod] add license header	2 months ago

README.md

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support
Distributed scraping
Configuration via environment variables
Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

go get -u github.com/gocolly/colly/...

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive
altsab/gowap Wappalyzer implementation in Go

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! [Become a backer]

License

Terms
Privacy
Security
Status
Help

Press h to open a hovercard with more details.

Jay Taylor's notes

gocolly/colly

gocolly/colly

README.md

Colly

Features

Example

Installation

Bugs

Other Projects Using Colly

Contributors

Backers

Sponsors

License