Jay Taylor's notes

back to listing index

Top 50 open source web crawlers for data mining

[web search]

Original source (bigdata-madesimple.com)

Tags: open-source web-crawler bigdata-madesimple.com

Clipped on: 2016-08-08

Big Data - Made Simple

Data Mining

Baiju NT

Top 50 open source web crawlers for data mining

23rd Jan `15, 02:22 PM in Data Mining

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or…

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.

There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

Name	Language	Platform
Heritrix	Java	Linux
Nutch	Java	Cross-platform
Scrapy	Python	Cross-platform
DataparkSearch	C++	Cross-platform
GNU Wget	C	Linux
GRUB	C#, C, Python, Perl	Cross-platform
ht://Dig	C++	Unix
HTTrack	C/C++	Cross-platform
ICDL Crawler	C++	Cross-platform
mnoGoSearch	C	Windows
Norconex HTTP Collector	Java	Cross-platform
Open Source Server	C/C++, Java PHP	Cross-platform
PHP-Crawler	PHP	Cross-platform
YaCy	Java	Cross-platform
WebSPHINX	Java	Cross-platform
WebLech	Java	Cross-platform
Arale	Java	Cross-platform
JSpider	Java	Cross-platform
HyperSpider	Java	Cross-platform
Arachnid	Java	Cross-platform
Spindle	Java	Cross-platform
Spider	Java	Cross-platform
LARM	Java	Cross-platform
Metis	Java	Cross-platform
SimpleSpider	Java	Cross-platform
Grunk	Java	Cross-platform
CAPEK	Java	Cross-platform
Aperture	Java	Cross-platform
Smart and Simple Web Crawler	Java	Cross-platform
Web Harvest	Java	Cross-platform
Aspseek	C++	Linux
Bixo	Java	Cross-platform
crawler4j	Java	Cross-platform
Ebot	Erland	Linux
Hounder	Java	Cross-platform
Hyper Estraier	C/C++	Cross-platform
OpenWebSpider	C#, PHP	Cross-platform
Pavuk	C	Lunix
Sphider	PHP	Cross-platform
Xapian	C++	Cross-platform
Arachnode.net	C#	Windows
Crawwwler	C++	Java
Distributed Web Crawler	C, Java, Python	Cross-platform
iCrawler	Java	Cross-platform
pycreep	Java	Cross-platform
Opese	C++	Linux
Andjing	Java
Ccrawler	C#	Windows
WebEater	Java	Cross-platform
JoBo	Java	Cross-platform

Have something to say? Share it in the comments.

7 Comments
Share
17.8K
Favorite

Cohort analysis with R – “layer-cake-graph”

Cohort analysis is one of the most powerful and popular techniques available to marketers for assessing long-term trends in…

In Data Science , by Sergey Bryl' on Jul 09

Possibly the simplest way to explain K-Means algorithm

Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals…

In Machine Learning , by Manu Jeevan on May 27

Will machines help us to be better people?

Cisco defines the Internet of Everything (IoE) as bringing together people, process, data, and things to make networked connections…

In Machine Learning , by Francisco Maroto on Jul 01

6 ways big data analytics can drive smarter customer service

Big data analytics finds immense application across the entire business. One area which can have direct, measurable and…

In Analytics , by Ketan Pandit on Feb 12

A Tour of Machine Learning Algorithms

After we understand the type of machine learning problem we are working with, we can think about the…

In Machine Learning

Good data scientist hunting – the sexiest job of the 21st century

It may be the “sexiest job of the 21st century”, but beyond that there isn’t a great deal…

In Data Science

8 best python Data Science books

Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running…

In Data Science , by Manu Jeevan on Nov 03

Behavioral algorithm predicts RG3 will divorce wife before 2017

MFP was able to predict that Seattle Seahawks quarterback Russell Wilson would dump his wife one year before…

In Analytics

Big Data: The perfect pill to boost the hospitality industry

Over the past several years, we have heard so many definitions of Big Data; some say it’s a…

In Travel / Hospitality

What makes exceptional customer service(and how to get everyone talking about it)

How much time do you think about the current state of customer service at your company? Do you…

In Marketing , by Danny Wong on Jul 28

How to install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

Download the simple Vagrant setup from here. Depending on the hardware of your computer, installation will probably take…

In Hadoop

5 ways how social data can boost customer retention for insurers

Everyone has been talking about how social data can help them retain customers and improve revenues and increase…

In Marketing , by Ketan Pandit on Jun 02

Subscribe to our Newsletter

Jay Taylor's notes