Show HN: Till – Unblock and scale your web scrapers, with minimal code changes

Jay Taylor's notes

back to listing index

Show HN: Till – Unblock and scale your web scrapers, with minimal code changes | Hacker News

[web search]

Original source (news.ycombinator.com)

Tags: scraping till news.ycombinator.com

Clipped on: 2021-10-08

> Till helps you circumvent detected as a web scraper by identifying your scraper as a real web browser. It does this by generating random user-agent headers and randomizing proxy IPs (that you supply) on every HTTP request.

How did we start with consent using robots.txt and end up here?

A good neighbor doesn't use circumvention.

realusername 65 days ago | flag| favorite [–] [collapse root]

We ended up here since some people wrongly decided that only Google has a right to scrape.

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

I've run sites that have a lot of pages where 80%+ of the traffic is web crawlers.

Google sends some traffic so I can afford to let them scrape. Bing crawls almost as much as Google but sends 5% as much traffic. Baidu crawls more than Google and never sends a single visitor.

I hate reinforcing the Google monopoly, but a crawler that doesn't send any traffic is expensive to serve.

dredmorbius 65 days ago | flag| favorite [–] [collapse root]

You might want to ask yourself, or your readers, what it is people are trying to access on your site that they cannot by other means.

The interfaces for many sites actively and with brutal effectiveness deny ready access to any content not currently featured on the homepage or stream feed. Search features are frequently nonexistent, crippled, or dysfunctional.

Last week I found myself stumbling across a now-archived radio programme on a website which afforded exceedingly poor access to the content. The show ran weekly for over a decade, with 531 episodes. Those are visible ... ten at a time ... through either the Archive or Search features.

Scraping the site gives me the full archive listing, all 11 years, in a single webpage, that loads in under a second. I can search that by date, title, or guest to find episodes of interest.

The utility of this (a few hours work on my part) is much higher than that of the site itself.

Often current web sites / apps are little more than wrappers around a JSON delivery-and-parsing engine. Dumping the raw JSON can be much more useful for a technical user. (Reddit, Diaspora, Mastodon, and Ello are several sites in which this is at least somewhat possible.)

Much of the suck is imposed by monetisation schemes. One project of mine, decrufting the Washington Post's website, resulted in article pages with two percent the weight of the originally-delivered payload. The de-cruftified version strips not only extraneous JS and CSS, but nags and teasers which are really nothing but distraction to me. Again, that's typical.

I'm aware that many scrapers are not benign. More than you might think are, and the fact that casual scraping is a problem for your delivery system reflects more poorly on it than them.

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

Adjunk, Sidebarjunk, Javascriptjunk, Popupwindowjunk, and the outlook that the most precious resource in the world are a few seconds when you are distracted are what motivates the Washington Post and most of the commercial web.

What you are doing stripping out the junk threatens those organizations at the core.

dredmorbius 65 days ago | flag| favorite [–] [collapse root]

Good.

https://news.ycombinator.com/item?id=26893033

https://news.ycombinator.com/item?id=27803591

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

On mobile ads, trackers and all that crap cost the consumer more than the the ads make.

If mobile phone companies kicked back a fraction of the revenue they get to content creators they'd be better paid than they are now and Verizon would get the love that it has sought in vain. (e.g. who would say a bad word about the phone company?)

dredmorbius 65 days ago | flag| favorite [–] [collapse root]

That's my argument.

Gobal ad spend, which mostly accrues to the wealthiest 1 billion or so, is about $600 billion. Some complex maths tells us that's $600 per person in the industrialised countries (G-20 / OECD, close enough). Global content spend is somewhere around $100 -- 200/year per capita. That's roughly the annual online ad spend.

Bundled into network provisioning, that's about $30--40 per household per month, all-you-can-eat. Information as a public good.

(My preference is for higher rates in more affluent areas, ideally by income.)

Trying to figure out WCPGW.

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

My personal model (emerging, there is a manifesto but I am rewriting it as we speak) is to rigorously control costs, focus on quality, stay small.

Think of the old phone company slogan "reach out and touch someone." If I can accomplish that and spend less than I do on food or clothes or my car then I win.

dredmorbius 64 days ago | flag| favorite [–] [collapse root]

I'd be interested in seeing what you're developing.

The challenge, as I see it is that information is a public good (in the economic sense: nonrivalrous, nonexcludable, zero marginal cost, high fixed costs), and provision at scale requires either a complementary rents service (advertising, patronage, propaganda, fancy professional-services "shingle") or a tax. Busking or its public-broadcasting is another option, though that's highly lossy.

Any truthful publishing also requires a strong self-defence mechanism (protection against lawsuits, coercion, intimidation, protection rackets, etc.), a frequently underappreciated role played by publishers.

Charles Perrow's descriptions of the music industry (recorded and broadcast) circa 1945 -- 1985 is informative here (see his Complex Organizations https://www.worldcat.org/title/complex-organizations-a-criti...), notably the roles of publishers vs. front-line and studio musicians.

Kiro 65 days ago | flag| favorite [–] [collapse root]

Not sure why you are attacking a poster specifically talking about Google, Bing and Baidu doing massive scraping. What you are talking about is something entirely different.

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

I don't feel attacked. I also don't blame him for being inflamed about the problem he's been inflamed at because I am inflamed about it too!

dredmorbius 64 days ago | flag| favorite [–] [collapse root]

Fortunately I think we both managed to realise that before too many rounds of this ;-)

maddyboo 64 days ago | flag| favorite [–] [collapse root]

Bing powers other search engines like DuckDuckGo, Ecosia, and Yahoo!. But I’m sure that even cumulatively the numbers are still small.

1vuio0pswjnm7 64 days ago | flag| favorite [–] [collapse root]

Then Google decided that no one has a right to scrape Google.

The web needs a more efficient system for distributing an index of its content. Having web developers design websites in a gazillion different permutations, when they are all bascially doing the same thing, and then having a handful of companies "scrape" them is neither an efficient nor sensibly-designed system.

The web (more generally, the internet) is a giant copy machine. Google, NSA and others have copies. Yet if we were to allow everyone to have copies by faciltating this through technical means (e.g., Wikipedia-style data dumps), many folks would panic. When Google first started indexing, it was not a business, and many folks did panic and there were many complaints. It's 2021; folks are still spooked by others being able to easily copy what they publish to the web. However it's OK to them if it's Google doing the copying. If there were healthy competition and many search engines to choose from, if one search engine did not have the lion's share of web traffic, it's doubtful Google would be given "exclusive" permission in robots.txt.

ok2938 65 days ago | flag| favorite [–] [collapse root]

We ended up here, because a) the promise of easy to use structured data - that is inherent in data processing systems - has not been fulfilled: you have to scrape HTML to reconstruct some relational table, and b) information, despite being almost free to transmit in large amounts today, is treated as if it were a physical good, and each copy costs money. There is just lots of money in there, because our economic system looks more backwards than forwards.

Nextgrid 65 days ago | flag| favorite [–] [collapse root]

If a real user can access the content then that user should be able to delegate the work to a machine.

Doctor_Fegg 65 days ago | flag| favorite [–] [collapse root]

But what when there are separate interfaces specifically designed for real users and machines?

Case in point: OpenStreetMap (run by volunteer sysadmins) provides completely open, machine-readable data dumps. You can use these to set up your own geocoding, rendering or similar service. There is copious documentation, several post-processed dumps for particular purposes, etc. etc.

OSM also provides a friendly, human-scale, user-facing interface for map browsing and search. There are clearly documented limitations/TOUs for automated use of this interface.

Does that prevent people from setting up massive scrapers to scrape the human-facing interface, rather than using the machine-facing data dumps? No, it does not; and the volunteer sysadmins have to spend an inordinate amount of time protecting the service against these people.

DataHen's proudly admitted practices ("No need to worry about IP bans, we auto rotate IPs on any requests that are made."; "our massive pool of auto-rotating proxies, user agents, and 'secret-sauce' helps get around them") is directly antithetical to this sort of scenario. I find this incredibly irresponsible and unethical.

PaulHoule 65 days ago | flag| favorite [–] [collapse root]

Almost always the API is nerfed relative to the web site.

Almost all web sites that authenticate use "submit a form with username and password and respect cookies"; often sites that don't authenticate to use the web site require authentication for the API. Every API uses a different authentication scheme and requires custom programming: for web sites you have the URL of the form and the name of the username and password field and you are done.

If you feed most HTML pages through a DOM parser like BeautifulSoup you can extract the links and interpret them through regexes. You might be done right then and there. If you need more usually you can use CSS classes and id(s) and... done!

I wrote a metadata extractor for Wikimedia HTML that I had working provisionally on Flickr HTML immediately and had working at 100% in 20 minutes. No way I could have gotten the authentication working for an API in 20 minutes.

capableweb 65 days ago | flag| favorite [–] [collapse root]

Of course if there is another, unlimited way of getting the data, then it's perfectly fine to "redirect" people to another avenue.

What this conversation is about (since we're talking about web scrapers here, not "data downloaders" or whatever you would call it) is when there is no other avenue to get the data, you should be able to access the same data via a machine as you could when you're a person.

Doctor_Fegg 65 days ago | flag| favorite [–] [collapse root]

Sure, but Till/DataHen appears to have a host of measures in place to ignore that "redirect".

capableweb 65 days ago | flag| favorite [–] [collapse root]

Yes, how are they supposed to know if the "redirect" is good or not? Facebook would helpfully redirect you to their Terms and Conditions, something you definitely should be able to ignore and scrape your data til your heart is content anyways.

Similarly to HTTP, the tool does not decide if usage is "nice" or "evil", only the user with their use case decides that.

tyingq 65 days ago | flag| favorite [–] [collapse root]

>you should be able to access the same data via a machine as you could when you're a person

I generally agree with this, but I do see the problem for certain spaces.

Concert tickets, for example. People write scrapers to get the best seats and sit on them so they can scalp them later at inflated prices. Or other first-come/first-serve situations, online auction "sniping", etc.

capableweb 65 days ago | flag| favorite [–] [collapse root]

Same thing applies here as I said in another comment (https://news.ycombinator.com/item?id=28060567), some people (me included) also write scrapers for getting the best seats in concerts/shows as I'm a big fan of the artist and really want to go there with good seats.

osmarks 64 days ago | flag| favorite [–] [collapse root]

Perhaps it would be better to not gate by whether you happen to have visited a website fast enough relative to other people. For auctions, at least, there is a simple solution in the form of extending the auction timer whenever someone bids.

ethbr0 65 days ago | flag| favorite [–] [collapse root]

Technically, this is the same rabbit hole that leads to DRM and device intent superceding user intent.

Because that's the only way to be sure.

It's unfortunate that it also allows havoc and burdens good services. But device-over-user is not a future I want to live in.

Nextgrid 65 days ago | flag| favorite [–] [collapse root]

Surely you must have a way to throttle human-originated abuse (such as someone spamming F5)? If so then this would work equally well for the machine.

Determine a reasonable rate limit and apply it to the human-facing version, with maybe a link to the machine-readable version in the error message?

Kiro 65 days ago | flag| favorite [–] [collapse root]

Says who?

JulianWasTaken 65 days ago | flag| favorite [–] [collapse root]

And only a few paragraphs down from a heading that says

> Till was architected to follow best practices that DataHen has accumulated

I know there are plenty of sites out there out to prevent any scraping whatsoever, or to improperly prevent some situations most of us would agree is reasonable behavior, but this appears to be blatantly hostile to web admins out there.

cube00 65 days ago | flag| favorite [–] [collapse root]

I agree there are cases where entities don't want things scraped that really should be scraped to preserve the history they'd like to rewrite.

My concern is that tools like these that use circumvention by default become the go-to when someone needs scraping and makes life hell for us running sites on hardware that's enough to service our small customer base but not an army of bots.

paco3346 65 days ago | flag| favorite [–] [collapse root]

Admin of a 5000+ website platform (with customer inventory) here. This the exact kind of thing I've been working so hard to block lately.

Since Feb 2021 we've seen a substantial increase in scraping of our customers' inventory to the point we now have more bot traffic than human traffic. Annoyingly, only 5-10 requests come from a single IP so we've had to resort to always challenging requests that come from certain ASNs (typically those owned by datacenters).

This type of project frustrates me because it knowlingly goes against a site's desired bot action (via robots.txt).

danpalmer 65 days ago | flag| favorite [–] [collapse root]

I work on a lot of web scraping and we have business agreements with every site that we scrape explicitly allowing us to do so and with pre-approval for the scraping rate (which we carefully control).

None of this gets around over-eager Cloudflare or Akamai rules set up years ago by some contractor that the businesses have no real ability to change.

Havoc 65 days ago | flag| favorite [–] [collapse root]

Why scrape at all if there are agreements in place. Seems like a API task

selcuka 64 days ago | flag| favorite [–] [collapse root]

> None of this gets around over-eager Cloudflare or Akamai rules set up years ago by some contractor that the businesses have no real ability to change.

If the business has no ability to even change the CloudFlare settings I don't expect them to be able to provide an API.

danpalmer 63 days ago | flag| favorite [–] [collapse root]

Only about 1 in 20 of our partners has the in-house tech necessary to provide us an API. In other words, we were able to accerelate our integrations by 20x by scraping.

hestefisk 64 days ago | flag| favorite [–] [collapse root]

In that case, the web site IS the API… only difference is data is wrapped in HTML not JSON.

anigbrowl 64 days ago | flag| favorite [–] [collapse root]

Good neighbors don't track people's every move online either, but here we are.

CiaTransMatters 65 days ago [dead] | flag| favorite [–] [collapse root]

In order for the CIA to continue its legendary fight against transphobia, it is important for them to use these technologies to gather OSINT on regions that house millions of backwards non-progressives and all of their content.

I can't believe HN is such a hotbed for transphobics like this to make comments without moderation or punishment.

axiosgunnar 65 days ago | flag| favorite [–]

Why is this on Github if I still need an API and cannot run it locally since the actual code is not public?

Or is this some sort of „source available“ content marketing nonsense?

mellosouls 65 days ago | flag| favorite [–]

This sounds technically interesting but ethically dubious; also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

If you are able to show how this service is administered responsibly with due regard for those running websites and blocking for good reason, best of luck with it.

capableweb 65 days ago | flag| favorite [–] [collapse root]

> This sounds technically interesting but ethically dubious

> If you are able to show how this service is administered responsibly

Imagine if this was the comments about HTTP when it was first launched (honestly, surely there was, but not as much as praise).

Protocols and tools are not ethically dubious or responsible for anything. The users who use those protocols and tools are. Anything that can be used for good/bad can also be used for bad/good.

OJFord 65 days ago | flag| favorite [–] [collapse root]

> also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

I don't doubt it's 'community discouraged' because people don't want to have to, but I think the only rule is that there must be something to try out - no announcement-only, sign up for wait-list, hype-building type thing.

Kiro 65 days ago | flag| favorite [–] [collapse root]

> also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

Look at past Show HNs and tell me how many respect this. It's such an obsolete rule that it's irrelevant and almost OT to even comment on.

aftbit 65 days ago | flag| favorite [–]

What's the difference between the linked repo and the hosted service at https://till.datahen.com/api/v1 ? Why do I need to sign up for a token if I am hosting my own server? Or otherwise, why do I need to host a server if I'm just calling your APIs? What sort of scraped data is exposed to the till hosted service?

paramaw [op] 64 days ago | flag| favorite [–] [collapse root]

We don't have a hosted service at the moment. It's all run on premise. The API link you mentioned is for a Till instance to validate the token and for usage stats reporting for each user. The Auth token validates each user's free and premium features and limits based on usage stats more specifically total requests counts and cache hits counter. We don't record nor track anything related to the requests made on Till.

captainmuon 65 days ago | flag| favorite [–]

This looks really interesting, I'm going to bookmark it!

> Proxy IP address rotation

I wonder where you would get a decent proxy list nowadays. In the late 90s, you could find list of accidental open proxies that "hackers" collected. But nowadays I've only seen really shady "residential proxies" that are basically private PCs with malware. Is there a decent source for proxies that are not widely blocked and not criminal and not too expensive?

And by the way, while many people here question the morality of scraping, I have at least two legitimate use cases:

1) I built a little "vertical" search engine for a niche, using a combination of scraping and RSS. It doesn't consume much traffic and does really help people discover content in this niche.

2) A friend does research on urban development and rent prices and they scrape rent ads to get their data (or let students type it off manually, I'm not sure...)

showerst 65 days ago | flag| favorite [–] [collapse root]

I run a number of proxies for (legally) scraping government data sites with overly restrictive views/hour policies.

They're just squid boxes spun up on various cheap VPS providers. A $5 digitalocean instance will power a lot of squid traffic, and as long as you stay away from AWS, the IP ranges tend not to be banned.

Just make sure to lock down your allowed source list, so you don't become one of those hacked open proxies!

kordlessagain 65 days ago | flag| favorite [–]

If anyone needs a screenshot bot: https://github.com/kordless/grub-2.0

Still need to do some work on it, but it does do the job!

JeremyBanks 65 days ago | flag| favorite [–]

> Till was architected to follow best practices that DataHen has accumulated over the years of scraping at a massive scale.

The best practice is not to build a business that's built on evading the security of the people you're depending on.

Just get into crime if you want to do that.

k1rcher 64 days ago | flag| favorite [–]

Does this have any sort of JavaScript emulation/support? Or is it purely HTTP requests?

The GID signature piece is especially interesting, I have ran into blockers in scaling scrapers in the past and this sort of organizational/tracking platform sounds awesome.

paramaw [op] 64 days ago | flag| favorite [–] [collapse root]

You can connect your local browser to Till. But, you need to have your browser to accept a custom CA(Certificate Authority). Instructions here: https://till.datahen.com/docs/installation#ca

ChrisArchitect 65 days ago | flag| favorite [–]

what kinds of scraping is being talked about here? As a maintainer of many a website I hate seeing so so much bot traffic these days. So many sketchy foreign things. And then throw in the spam/vulnerability testing bots/random URLs etc. sigh. Filling up the logs..Filling up the logs.

Almost totally dependent on Cloudflare to mitigate these days/as an in-between.

DaveExeter 65 days ago | flag| favorite [–] [collapse root]

I have simple PHP code that adds misbehaving IP address to iptables. Keeps the logs clean, because they can't even connect over port 80.

awestroke 65 days ago | flag| favorite [–]

Looks extremely useful. Starred for if I ever need to do scraping

dannyw 65 days ago | flag| favorite [–]

Can this bypass cloudflare?

new_guy 65 days ago | flag| favorite [–] [collapse root]

These guys bypass cloudflare https://microlink.io it's open source but it's just more convenient to pay and forget about it.

jonatron 65 days ago | flag| favorite [–] [collapse root]

Cloudflare have a few configuration options for the level of bot protection, so it's probably more a question of what percentage would be blocked for a particular site.

yourad_io 65 days ago | flag| favorite [–] [collapse root]

This is entirely correct.

It will also depend on the proxy's ASN, the rate of requests, the ability to solve captchas and the presence of an actual js environment in the browser (as opposed to, say, curl - Cloudflare uses a javascript environment check script to detect puppeteer/selenium etc.)

yourad_io 65 days ago | flag| favorite [–] [collapse root]

> repository contains research from CloudFlare's AntiDDoS, JS Challenge, Captcha Challenges, and CloudFlare WAF.

https://github.com/scaredos/cfresearch

tintt 65 days ago [dead] | flag| favorite [–] [collapse root]

Sometimes I, a human being, can’t bypass Cloudflare

Search: