You click, you scroll, then a stern page stops you cold. It asks if you are human. Your heart skips a beat.
Across major news sites, protective systems now judge every tap and swipe. One British tabloid group has spelled out its stance, warning that automated access and text or data mining are banned under its terms, even for AI models. That stance affects ordinary readers as much as scrapers, because the same filters decide who gets through and who faces a wall.
Why readers are seeing bot checks
News Group Newspapers Limited, publisher of The Sun, has reiterated that it prohibits automated access, scraping, and text or data mining of its content. The warning includes use for artificial intelligence, machine learning, and large language models. These terms aim to protect original reporting and to prevent wholesale copying.
Automated access and data mining are prohibited under the publisher’s terms, including for AI, machine learning, and LLMs.
The group also notes that its system can misread legitimate behaviour as automated. That means fast scrolling, repeated refreshes, or multiple tabs can trigger a challenge. A short session can become a dead stop, even when you only came to read a match report or check a celebrity update.
The Sun’s warning and what it means
The message is blunt: the outlet does not permit automated collection of its stories by any means, whether direct or via an intermediary. It directs businesses seeking commercial access to email [email protected]. Readers who believe they were blocked by mistake can contact [email protected] for assistance.
If you are a legitimate user who hit a block, the publisher asks you to contact [email protected].
The signals that trigger a block
Bot-detection systems watch dozens of signals at once. They build a risk score in seconds and decide whether to let you pass, show a challenge, or block. While vendors keep exact recipes secret, these common cues often raise flags:
- Opening many tabs for the same site within a short burst.
- Very high scroll speed that repeats the same pattern.
- No mouse movement between clicks on desktop.
- Non-standard browser fingerprints or missing fonts and plugins.
- Requests from data centre IP ranges rather than consumer broadband or mobile networks.
- Using headless browsers or automation frameworks.
- Blocked or constantly changing cookies and local storage.
- Inconsistent time zones and language settings.
- Copying large chunks of text at high frequency.
- Rapid-fire page requests that skip images, ads, or scripts.
- VPN, proxy, or Tor endpoints with known scraping activity.
- Abnormal touch events on mobile that look machine-generated.
One or two of these may not stop you. A cluster often does. The aim is to protect content and keep site performance steady for real readers.
What you can do in 60 seconds
You can reduce false flags with small changes to your routine. These steps keep you on the right side of the gate.
- Use a standard, up-to-date browser with JavaScript turned on.
- Allow first-party cookies for the news site.
- Avoid rapid refreshes and give pages time to load fully.
- Scroll at a natural pace and click normally.
- Turn off ad-blockers or add the site to your allowlist if challenges persist.
- Stay on a stable connection; switch off a noisy VPN for reading sessions.
- If you hit a block, wait a moment, then retry; contact support if it repeats.
Commercial access is separate from casual reading. For licences, email [email protected]; for reader help, use [email protected].
For businesses, researchers, and developers
Different goals require different routes. Here is a quick guide to staying compliant and avoiding disruption.
| Use case | Risk | Action |
|---|---|---|
| AI training or dataset building | High: terms prohibit automated mining | Seek a licence via [email protected]; design compliant pipelines |
| Media monitoring for clients | Medium to high: volume triggers blocks | Use licensed feeds or approved APIs; avoid scraping live pages |
| Academic TDM for non-commercial research | Varies: legal exceptions are narrow | Request permissions; store only permitted excerpts; respect technical controls |
| Personal reading with a VPN | Low to medium: some IPs raise flags | Switch to a residential connection if challenges repeat |
The legal and ethical layer
UK publishers rely on copyright, database rights, and website terms to control reuse. They also deploy technical measures such as rate limits and bot challenges. While some research uses benefit from narrow exceptions, commercial scraping and AI corpus building sit firmly in the restricted zone for many outlets. The practical point is simple: rights holders expect paid licences for systematic reuse, and they will enforce technical barriers when traffic patterns look automated.
This debate now shapes the AI supply chain. Model builders want broad text access. Newsrooms seek fair value for their reporting. Both sides point to public interest, but the balance often rests on contracts and permissions. Readers sit in the middle, affected by the same filters that separate humans from harvesters.
What happens behind the scenes
Modern bot detection looks at behaviour, browser makeup, network clues, and even small quirks such as how fast a device draws a canvas. It asks whether your pattern matches known automation or a typical person with a phone. These checks run in under a second, and they keep running as you move from page to page. If the score rises, you hit a challenge or a hard block.
Sometimes a real person fails that test. A shared office IP, a strict privacy plugin, or a jittery network can push the score over the line. That is why publishers include a backstop: a contact route for readers and a licencing channel for companies.
Key information highlighted
News Group Newspapers Limited states that automated access, collection, and text/data mining of its content are not permitted.
Legitimate users who are blocked are invited to contact [email protected] for assistance.
Extra help, examples, and risks to weigh
Text and data mining, often shortened to TDM, means using software to systematically collect content and extract facts or patterns. A newsroom may use TDM internally on its own archives. An AI firm may try to mine many sites to build a training corpus. Those are very different risk profiles. One is within a publisher’s control. The other often collides with its rights and business model.
Try this simple simulation to avoid a false flag: open one tab, let the page load completely, scroll to the middle, pause to read a paragraph, then continue. Avoid opening ten tabs at once from a social feed. If you need multiple pages, space out your clicks. This mirrors how most readers behave and lowers your chance of a challenge.
There are gains from these systems. Pages stay fast, data theft drops, and ad fraud shrinks. There are also trade-offs. Privacy tools can resemble scrapers, and shared networks can look noisy. If you run privacy extensions, tune them per site and keep a reading profile separate from heavy scraping tools used for work. That separation keeps your day-to-day browsing smooth.
If you are a company building a product that relies on news content, plan for compliant access from day one. Budget for licences, narrow your data needs to the minimum, and store content responsibly. For small teams, start with public headlines and metadata where allowed, and add licensed text later. Early design choices will save you from service blocks and legal dead ends.



Thanks for laying out the triggers so clearly. The distinction between casual reading and commercial access is defintely useful.
I get the need to protect content, but this reads like punishing privay. Why should a VPN or strict cookies be a red flag for readers?