Jay Taylor's notes

back to listing index

Top 50 open source web crawlers for data mining

[web search]
Original source (bigdata-madesimple.com)
Tags: open-source web-crawler bigdata-madesimple.com
Clipped on: 2016-08-08

Data Mining
Image (Asset 13/35) alt=
Image (Asset 14/35) alt= Image (Asset 15/35) alt= Image (Asset 16/35) alt=

Top 50 open source web crawlers for data mining

23rd Jan `15, 02:22 PM in Data Mining

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or…

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.

There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

Name Language Platform
Heritrix Java Linux
Nutch Java Cross-platform
Scrapy Python Cross-platform
DataparkSearch C++ Cross-platform
GNU Wget C Linux
GRUB C#, C, Python, Perl Cross-platform
ht://Dig C++ Unix
HTTrack C/C++ Cross-platform
ICDL Crawler C++ Cross-platform
mnoGoSearch C Windows
Norconex HTTP Collector Java Cross-platform
Open Source Server C/C++, Java PHP Cross-platform
PHP-Crawler PHP Cross-platform
YaCy Java Cross-platform
WebSPHINX Java Cross-platform
WebLech Java Cross-platform
Arale Java Cross-platform
JSpider Java Cross-platform
HyperSpider Java Cross-platform
Arachnid Java Cross-platform
Spindle Java Cross-platform
Spider Java Cross-platform
LARM Java Cross-platform
Metis Java Cross-platform
SimpleSpider Java Cross-platform
Grunk Java Cross-platform
CAPEK Java Cross-platform
Aperture Java Cross-platform
Smart and Simple Web Crawler Java Cross-platform
Web Harvest Java Cross-platform
Aspseek C++ Linux
Bixo Java Cross-platform
crawler4j Java Cross-platform
Ebot Erland Linux
Hounder Java Cross-platform
Hyper Estraier C/C++ Cross-platform
OpenWebSpider C#, PHP Cross-platform
Pavuk C Lunix
Sphider PHP Cross-platform
Xapian C++ Cross-platform
Arachnode.net C# Windows
Crawwwler C++ Java
Distributed Web Crawler C, Java, Python Cross-platform
iCrawler Java Cross-platform
pycreep Java Cross-platform
Opese C++ Linux
Andjing Java
Ccrawler C# Windows
WebEater Java Cross-platform
JoBo Java Cross-platform

Have something to say? Share it in the comments.


Cohort analysis with R – “layer-cake-graph”

Cohort analysis is one of the most powerful and popular techniques available to marketers for assessing long-term trends in…

In Data Science , by Sergey Bryl' on Jul 09

Possibly the simplest way to explain K-Means algorithm

Image (Asset 18/35) alt=

Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals…

In Machine Learning , by Manu Jeevan on May 27

Will machines help us to be better people?

Image (Asset 19/35) alt=

Cisco defines the Internet of Everything (IoE) as bringing together people, process, data, and things to make networked connections…

6 ways big data analytics can drive smarter customer service

Image (Asset 20/35) alt=

Big data analytics finds immense application across the entire business. One area which can have direct, measurable and…

In Analytics , by Ketan Pandit on Feb 12

A Tour of Machine Learning Algorithms

Image (Asset 21/35) alt=

After we understand the type of machine learning problem we are working with, we can think about the…

Good data scientist hunting – the sexiest job of the 21st century

Image (Asset 22/35) alt=

It may be the “sexiest job of the 21st century”, but beyond that there isn’t a great deal…

8 best python Data Science books

Image (Asset 23/35) alt=

Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running…

In Data Science , by Manu Jeevan on Nov 03

Behavioral algorithm predicts RG3 will divorce wife before 2017

Image (Asset 24/35) alt=

MFP was able to predict that Seattle Seahawks quarterback Russell Wilson would dump his wife one year before…

Big Data: The perfect pill to boost the hospitality industry

Image (Asset 25/35) alt=

Over the past several years, we have heard so many definitions of Big Data; some say it’s a…

What makes exceptional customer service(and how to get everyone talking about it)

Image (Asset 26/35) alt=

How much time do you think about the current state of customer service at your company? Do you…

In Marketing , by Danny Wong on Jul 28

How to install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

Image (Asset 27/35) alt=

Download the simple Vagrant setup from here. Depending on the hardware of your computer, installation will probably take…

In Hadoop

5 ways how social data can boost customer retention for insurers

Image (Asset 28/35) alt=

Everyone has been talking about how social data can help them retain customers and improve revenues and increase…

In Marketing , by Ketan Pandit on Jun 02


Subscribe to our Newsletter
Powered by Image (Asset 35/35) alt=
Copyright © 2016 Crayon Data. All rights reserved.