Jay Taylor's notes

back to listing index

gocolly/colly

[web search]
Original source (github.com)
Tags: golang go scraping crawling github.com
Clipped on: 2018-05-14
Go
.github Added .github/ISSUE_TEMPLATE.md (optional) 6 months ago
_examples [fix] update instagram scraper according to site changes 2 days ago
cmd/colly [mod] add license header 2 months ago
debug [mod] add license header 2 months ago
extensions [fix] package comment should be of the form "Package extensions ..." 2 months ago
proxy [enh] extend proxy docs with https proxy support 6 days ago
queue Add Request.Do() method which repects AllowURLRevisit 6 days ago
storage [mod] simplify the cookie layer in storage interface 2 months ago
.codecov.yml turn off codecov comments 4 months ago
.travis.yml [mod] update go versions for travis tests 27 days ago
CHANGELOG.md [enh] v1.0.0 2 days ago
CONTRIBUTING.md Update CONTRIBUTING.md 5 months ago
LICENSE.txt [enh] add request & response callbacks ++ cookie handling ++ readme 8 months ago
README.md [doc] add gowap to projects 28 days ago
VERSION [enh] v1.0.0 2 days ago
colly.go [fix] reduce cyclomatic complexity a day ago
colly_test.go Add support to URL regexp blacklisting 18 days ago
context.go [mod] add license header 2 months ago
context_test.go [mod] add license header 2 months ago
htmlelement.go [mod] lint: shorten []string declaration a month ago
http_backend.go [mod] add license header 2 months ago
request.go [fix] invert AllowURLRevisit parameter in request.Do() 2 days ago
response.go [fix] valid response filename handling - closesThis commit closes issue #119. #119 a month ago
unmarshal.go [mod] add license header 2 months ago
unmarshal_test.go [mod] add license header 2 months ago
xmlelement.go fix ChildText panic if xpath does not exist in xml 5 days ago
xmlelement_test.go [mod] add license header 2 months ago

README.md

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

go get -u github.com/gocolly/colly/...

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

Press h to open a hovercard with more details.