Jay Taylor's notes

back to listing index

Show HN: Till – Unblock and scale your web scrapers, with minimal code changes | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: scraping till news.ycombinator.com
Clipped on: 2021-10-08

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
> Till helps you circumvent detected as a web scraper by identifying your scraper as a real web browser. It does this by generating random user-agent headers and randomizing proxy IPs (that you supply) on every HTTP request.

How did we start with consent using robots.txt and end up here?

A good neighbor doesn't use circumvention.

We ended up here since some people wrongly decided that only Google has a right to scrape.

I've run sites that have a lot of pages where 80%+ of the traffic is web crawlers.

Google sends some traffic so I can afford to let them scrape. Bing crawls almost as much as Google but sends 5% as much traffic. Baidu crawls more than Google and never sends a single visitor.

I hate reinforcing the Google monopoly, but a crawler that doesn't send any traffic is expensive to serve.

You might want to ask yourself, or your readers, what it is people are trying to access on your site that they cannot by other means.

The interfaces for many sites actively and with brutal effectiveness deny ready access to any content not currently featured on the homepage or stream feed. Search features are frequently nonexistent, crippled, or dysfunctional.

Last week I found myself stumbling across a now-archived radio programme on a website which afforded exceedingly poor access to the content. The show ran weekly for over a decade, with 531 episodes. Those are visible ... ten at a time ... through either the Archive or Search features.

Scraping the site gives me the full archive listing, all 11 years, in a single webpage, that loads in under a second. I can search that by date, title, or guest to find episodes of interest.

The utility of this (a few hours work on my part) is much higher than that of the site itself.

Often current web sites / apps are little more than wrappers around a JSON delivery-and-parsing engine. Dumping the raw JSON can be much more useful for a technical user. (Reddit, Diaspora, Mastodon, and Ello are several sites in which this is at least somewhat possible.)

Much of the suck is imposed by monetisation schemes. One project of mine, decrufting the Washington Post's website, resulted in article pages with two percent the weight of the originally-delivered payload. The de-cruftified version strips not only extraneous JS and CSS, but nags and teasers which are really nothing but distraction to me. Again, that's typical.

I'm aware that many scrapers are not benign. More than you might think are, and the fact that casual scraping is a problem for your delivery system reflects more poorly on it than them.

Adjunk, Sidebarjunk, Javascriptjunk, Popupwindowjunk, and the outlook that the most precious resource in the world are a few seconds when you are distracted are what motivates the Washington Post and most of the commercial web.

What you are doing stripping out the junk threatens those organizations at the core.

On mobile ads, trackers and all that crap cost the consumer more than the the ads make.

If mobile phone companies kicked back a fraction of the revenue they get to content creators they'd be better paid than they are now and Verizon would get the love that it has sought in vain. (e.g. who would say a bad word about the phone company?)

That's my argument.

Gobal ad spend, which mostly accrues to the wealthiest 1 billion or so, is about $600 billion. Some complex maths tells us that's $600 per person in the industrialised countries (G-20 / OECD, close enough). Global content spend is somewhere around $100 -- 200/year per capita. That's roughly the annual online ad spend.

Bundled into network provisioning, that's about $30--40 per household per month, all-you-can-eat. Information as a public good.

(My preference is for higher rates in more affluent areas, ideally by income.)

Trying to figure out WCPGW.

My personal model (emerging, there is a manifesto but I am rewriting it as we speak) is to rigorously control costs, focus on quality, stay small.

Think of the old phone company slogan "reach out and touch someone." If I can accomplish that and spend less than I do on food or clothes or my car then I win.

I'd be interested in seeing what you're developing.

The challenge, as I see it is that information is a public good (in the economic sense: nonrivalrous, nonexcludable, zero marginal cost, high fixed costs), and provision at scale requires either a complementary rents service (advertising, patronage, propaganda, fancy professional-services "shingle") or a tax. Busking or its public-broadcasting is another option, though that's highly lossy.

Any truthful publishing also requires a strong self-defence mechanism (protection against lawsuits, coercion, intimidation, protection rackets, etc.), a frequently underappreciated role played by publishers.

Charles Perrow's descriptions of the music industry (recorded and broadcast) circa 1945 -- 1985 is informative here (see his Complex Organizations https://www.worldcat.org/title/complex-organizations-a-criti...), notably the roles of publishers vs. front-line and studio musicians.

Not sure why you are attacking a poster specifically talking about Google, Bing and Baidu doing massive scraping. What you are talking about is something entirely different.

I don't feel attacked. I also don't blame him for being inflamed about the problem he's been inflamed at because I am inflamed about it too!

Fortunately I think we both managed to realise that before too many rounds of this ;-)

Bing powers other search engines like DuckDuckGo, Ecosia, and Yahoo!. But I’m sure that even cumulatively the numbers are still small.

Then Google decided that no one has a right to scrape Google.

The web needs a more efficient system for distributing an index of its content. Having web developers design websites in a gazillion different permutations, when they are all bascially doing the same thing, and then having a handful of companies "scrape" them is neither an efficient nor sensibly-designed system.

The web (more generally, the internet) is a giant copy machine. Google, NSA and others have copies. Yet if we were to allow everyone to have copies by faciltating this through technical means (e.g., Wikipedia-style data dumps), many folks would panic. When Google first started indexing, it was not a business, and many folks did panic and there were many complaints. It's 2021; folks are still spooked by others being able to easily copy what they publish to the web. However it's OK to them if it's Google doing the copying. If there were healthy competition and many search engines to choose from, if one search engine did not have the lion's share of web traffic, it's doubtful Google would be given "exclusive" permission in robots.txt.

We ended up here, because a) the promise of easy to use structured data - that is inherent in data processing systems - has not been fulfilled: you have to scrape HTML to reconstruct some relational table, and b) information, despite being almost free to transmit in large amounts today, is treated as if it were a physical good, and each copy costs money. There is just lots of money in there, because our economic system looks more backwards than forwards.

If a real user can access the content then that user should be able to delegate the work to a machine.

But what when there are separate interfaces specifically designed for real users and machines?

Case in point: OpenStreetMap (run by volunteer sysadmins) provides completely open, machine-readable data dumps. You can use these to set up your own geocoding, rendering or similar service. There is copious documentation, several post-processed dumps for particular purposes, etc. etc.

OSM also provides a friendly, human-scale, user-facing interface for map browsing and search. There are clearly documented limitations/TOUs for automated use of this interface.

Does that prevent people from setting up massive scrapers to scrape the human-facing interface, rather than using the machine-facing data dumps? No, it does not; and the volunteer sysadmins have to spend an inordinate amount of time protecting the service against these people.

DataHen's proudly admitted practices ("No need to worry about IP bans, we auto rotate IPs on any requests that are made."; "our massive pool of auto-rotating proxies, user agents, and 'secret-sauce' helps get around them") is directly antithetical to this sort of scenario. I find this incredibly irresponsible and unethical.

Almost always the API is nerfed relative to the web site.

Almost all web sites that authenticate use "submit a form with username and password and respect cookies"; often sites that don't authenticate to use the web site require authentication for the API. Every API uses a different authentication scheme and requires custom programming: for web sites you have the URL of the form and the name of the username and password field and you are done.

If you feed most HTML pages through a DOM parser like BeautifulSoup you can extract the links and interpret them through regexes. You might be done right then and there. If you need more usually you can use CSS classes and id(s) and... done!

I wrote a metadata extractor for Wikimedia HTML that I had working provisionally on Flickr HTML immediately and had working at 100% in 20 minutes. No way I could have gotten the authentication working for an API in 20 minutes.

Of course if there is another, unlimited way of getting the data, then it's perfectly fine to "redirect" people to another avenue.

What this conversation is about (since we're talking about web scrapers here, not "data downloaders" or whatever you would call it) is when there is no other avenue to get the data, you should be able to access the same data via a machine as you could when you're a person.

Sure, but Till/DataHen appears to have a host of measures in place to ignore that "redirect".

Yes, how are they supposed to know if the "redirect" is good or not? Facebook would helpfully redirect you to their Terms and Conditions, something you definitely should be able to ignore and scrape your data til your heart is content anyways.

Similarly to HTTP, the tool does not decide if usage is "nice" or "evil", only the user with their use case decides that.

>you should be able to access the same data via a machine as you could when you're a person

I generally agree with this, but I do see the problem for certain spaces.

Concert tickets, for example. People write scrapers to get the best seats and sit on them so they can scalp them later at inflated prices. Or other first-come/first-serve situations, online auction "sniping", etc.

Same thing applies here as I said in another comment (https://news.ycombinator.com/item?id=28060567), some people (me included) also write scrapers for getting the best seats in concerts/shows as I'm a big fan of the artist and really want to go there with good seats.

Perhaps it would be better to not gate by whether you happen to have visited a website fast enough relative to other people. For auctions, at least, there is a simple solution in the form of extending the auction timer whenever someone bids.

Technically, this is the same rabbit hole that leads to DRM and device intent superceding user intent.

Because that's the only way to be sure.

It's unfortunate that it also allows havoc and burdens good services. But device-over-user is not a future I want to live in.

Surely you must have a way to throttle human-originated abuse (such as someone spamming F5)? If so then this would work equally well for the machine.

Determine a reasonable rate limit and apply it to the human-facing version, with maybe a link to the machine-readable version in the error message?

Says who?

And only a few paragraphs down from a heading that says

> Till was architected to follow best practices that DataHen has accumulated

I know there are plenty of sites out there out to prevent any scraping whatsoever, or to improperly prevent some situations most of us would agree is reasonable behavior, but this appears to be blatantly hostile to web admins out there.

I agree there are cases where entities don't want things scraped that really should be scraped to preserve the history they'd like to rewrite.

My concern is that tools like these that use circumvention by default become the go-to when someone needs scraping and makes life hell for us running sites on hardware that's enough to service our small customer base but not an army of bots.

Admin of a 5000+ website platform (with customer inventory) here. This the exact kind of thing I've been working so hard to block lately.

Since Feb 2021 we've seen a substantial increase in scraping of our customers' inventory to the point we now have more bot traffic than human traffic. Annoyingly, only 5-10 requests come from a single IP so we've had to resort to always challenging requests that come from certain ASNs (typically those owned by datacenters).

This type of project frustrates me because it knowlingly goes against a site's desired bot action (via robots.txt).

I work on a lot of web scraping and we have business agreements with every site that we scrape explicitly allowing us to do so and with pre-approval for the scraping rate (which we carefully control).

None of this gets around over-eager Cloudflare or Akamai rules set up years ago by some contractor that the businesses have no real ability to change.

Why scrape at all if there are agreements in place. Seems like a API task

> None of this gets around over-eager Cloudflare or Akamai rules set up years ago by some contractor that the businesses have no real ability to change.

If the business has no ability to even change the CloudFlare settings I don't expect them to be able to provide an API.

Only about 1 in 20 of our partners has the in-house tech necessary to provide us an API. In other words, we were able to accerelate our integrations by 20x by scraping.

In that case, the web site IS the API… only difference is data is wrapped in HTML not JSON.

Good neighbors don't track people's every move online either, but here we are.

In order for the CIA to continue its legendary fight against transphobia, it is important for them to use these technologies to gather OSINT on regions that house millions of backwards non-progressives and all of their content.

I can't believe HN is such a hotbed for transphobics like this to make comments without moderation or punishment.

Why is this on Github if I still need an API and cannot run it locally since the actual code is not public?

Or is this some sort of „source available“ content marketing nonsense?

This sounds technically interesting but ethically dubious; also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

If you are able to show how this service is administered responsibly with due regard for those running websites and blocking for good reason, best of luck with it.

> This sounds technically interesting but ethically dubious

> If you are able to show how this service is administered responsibly

Imagine if this was the comments about HTTP when it was first launched (honestly, surely there was, but not as much as praise).

Protocols and tools are not ethically dubious or responsible for anything. The users who use those protocols and tools are. Anything that can be used for good/bad can also be used for bad/good.

> also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

I don't doubt it's 'community discouraged' because people don't want to have to, but I think the only rule is that there must be something to try out - no announcement-only, sign up for wait-list, hype-building type thing.

> also requires sign-up - even though free - which (I think?) is discouraged for Show HN.

Look at past Show HNs and tell me how many respect this. It's such an obsolete rule that it's irrelevant and almost OT to even comment on.

What's the difference between the linked repo and the hosted service at https://till.datahen.com/api/v1 ? Why do I need to sign up for a token if I am hosting my own server? Or otherwise, why do I need to host a server if I'm just calling your APIs? What sort of scraped data is exposed to the till hosted service?

We don't have a hosted service at the moment. It's all run on premise. The API link you mentioned is for a Till instance to validate the token and for usage stats reporting for each user. The Auth token validates each user's free and premium features and limits based on usage stats more specifically total requests counts and cache hits counter. We don't record nor track anything related to the requests made on Till.

This looks really interesting, I'm going to bookmark it!

> Proxy IP address rotation

I wonder where you would get a decent proxy list nowadays. In the late 90s, you could find list of accidental open proxies that "hackers" collected. But nowadays I've only seen really shady "residential proxies" that are basically private PCs with malware. Is there a decent source for proxies that are not widely blocked and not criminal and not too expensive?

And by the way, while many people here question the morality of scraping, I have at least two legitimate use cases:

1) I built a little "vertical" search engine for a niche, using a combination of scraping and RSS. It doesn't consume much traffic and does really help people discover content in this niche.

2) A friend does research on urban development and rent prices and they scrape rent ads to get their data (or let students type it off manually, I'm not sure...)

I run a number of proxies for (legally) scraping government data sites with overly restrictive views/hour policies.

They're just squid boxes spun up on various cheap VPS providers. A $5 digitalocean instance will power a lot of squid traffic, and as long as you stay away from AWS, the IP ranges tend not to be banned.

Just make sure to lock down your allowed source list, so you don't become one of those hacked open proxies!

If anyone needs a screenshot bot: https://github.com/kordless/grub-2.0

Still need to do some work on it, but it does do the job!

> Till was architected to follow best practices that DataHen has accumulated over the years of scraping at a massive scale.

The best practice is not to build a business that's built on evading the security of the people you're depending on.

Just get into crime if you want to do that.

Does this have any sort of JavaScript emulation/support? Or is it purely HTTP requests?

The GID signature piece is especially interesting, I have ran into blockers in scaling scrapers in the past and this sort of organizational/tracking platform sounds awesome.

You can connect your local browser to Till. But, you need to have your browser to accept a custom CA(Certificate Authority). Instructions here: https://till.datahen.com/docs/installation#ca

what kinds of scraping is being talked about here? As a maintainer of many a website I hate seeing so so much bot traffic these days. So many sketchy foreign things. And then throw in the spam/vulnerability testing bots/random URLs etc. sigh. Filling up the logs..Filling up the logs.

Almost totally dependent on Cloudflare to mitigate these days/as an in-between.

I have simple PHP code that adds misbehaving IP address to iptables. Keeps the logs clean, because they can't even connect over port 80.

Looks extremely useful. Starred for if I ever need to do scraping

Can this bypass cloudflare?

These guys bypass cloudflare https://microlink.io it's open source but it's just more convenient to pay and forget about it.

Cloudflare have a few configuration options for the level of bot protection, so it's probably more a question of what percentage would be blocked for a particular site.

This is entirely correct.

It will also depend on the proxy's ASN, the rate of requests, the ability to solve captchas and the presence of an actual js environment in the browser (as opposed to, say, curl - Cloudflare uses a javascript environment check script to detect puppeteer/selenium etc.)

> repository contains research from CloudFlare's AntiDDoS, JS Challenge, Captcha Challenges, and CloudFlare WAF.


Sometimes I, a human being, can’t bypass Cloudflare

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact