At Superfeedr, we do a lot of polling, and we do it in aggressive ways sometimes. We obviously try anything we can to be nice with servers, by implementing ETags, If-Modified-Since headers, but also (and more importantly), by avoiding to make any request that doesn’t need to be made (using PubSubHubbub, RSSCloud, XML-RPC pings… etc). We also do back-off when we see errors from a given url/domain, to avoid making an existing issue worse. Finally, we do not hide ourselves. The list of IPs we use is public and our User-Agents1 are very explicit. Yet, some people block us and we think it’s sad.
Why it is bad
The first thing is that many of the people who are blocked are probably legitimate users. By also implementing a blind traffic shaper, you prevent new, innovative services to come up and build thing that would probably bring your content to a whole new level. The web can make your content better, richer, smarter, and more interesting for your users. If you start rate limiting everyone, you basically put a big red flag on your service and everyone who wants to build something knows that at some point, whatever they’re building will be blocked by your service and may have to shut down. Not really a safe environment to foster innovation!
The worst thing about that is that bad guys_, spammers, spamdexers, MFAs, and other will actually find ways to get your content. No matter how hard you try, they will find ways to access your content and do anything with it, whether it’s scrapping1.php, using Tor networks, botnets… etc. Additionally, as they use black hat techniques, it’s probably going to be very hard for you to enforce any rule or Terms of Service. It’s the jungle! There are already black and grey markets for data from Twitter, Craigslist, Facebook and many many other protected content.
Why do they do it
I understand that I’m in the white knight position here. There are business and legal concerns that should be enforced. I’m not going to go into much details, but Twitter
is selling sold their firehose access; it must not be very simple to actually grant access for free to a bunch of services. Facebook has enough privacy issues that they don’t want 3rd party apps to access content that they shouldn’t have asked in the first place :D.
There may also be technical reasons. Doing thousands of requests per second on a service is costly, in terms of resources. Sites have been taken down by other crawling them. Rate limiting can be considered an as a simple solution to not degrading the whole app experience to the benefit of a single 3rd party app. It’s legit, but it’s the wrong approach. As economists told us reducing supply very rarely reduces the demand!
What can they do
At Superfeedr, we believe there is room for a better content distribution model, where there is no need for rate limit : pushing content.
You can rate limit polling, because it does kill your servers, but make sure that there is a way to get content in an non-throttled way.
By doing this, you can also very easily know what content was pushed to who at any time, and you can certainly enforce your business and legal concerns. Additionally, hackers are lazy, they will always take the simplest route to achieve their goal and if you make sure your way of pushing content away is the simplest way to get your content, there is no doubt that it’s the one they will adopt, rather than using black and grey techniques that you can’t control.
1 Superfeedr: Superparser/1.0 http://superfeedr.com – Please read this http://blog.superfeedr.com/publishers.html or get in touch if we’re polling too hard