It looks like TechCrunch blocks GoDaddy server access

  sonic0002        2024-08-17 12:31:52       2,443        0    

Recently, I encountered an issue with an app I maintain—it suddenly stopped pulling RSS feeds from TechCrunch. At first, I suspected that the RSS feed URL might have changed. However, after further investigation, I discovered a different story. The URL itself was unchanged, but the results varied depending on where the request was coming from.

To troubleshoot, I started by setting up a local web server and running a test with my script to see if it could still pull the RSS feed. The script, written in PHP, simply sends a cURL request to the TechCrunch RSS feed.

<?php
function get($url, $header = [], $timeout = 120){
    $userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36';

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false); // disable ssl certificate verifcation
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //set our user agent
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); //set our user agent
    curl_setopt($ch, CURLOPT_POST, false); //set how many paramaters to post
    curl_setopt($ch, CURLOPT_URL,$url); //set the url we want to use
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout); //set the url we want to use
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // This is to ensure get the response back

    $result = curl_exec($ch); //execute and get the results
    
    $responseInfo = curl_getinfo($ch);
	var_dump($responseInfo);
    
    curl_close($ch);
    
    return $result;
}

$url = "https://techcrunch.com/feed/?size=20";
$output = get($url);
var_dump($output);

The output on my local machine looks ok as I can get the correct XML response.

array (size=26)
  'url' => string 'https://techcrunch.com/feed/?size=20' (length=36)
  'content_type' => string 'application/rss+xml; charset=UTF-8' (length=34)
  'http_code' => int 200
  'header_size' => int 2809
  'request_size' => int 194
  'filetime' => int -1
  'ssl_verify_result' => int 20
  'redirect_count' => int 0
  'total_time' => float 1.266
  'namelookup_time' => float 0.016
  'connect_time' => float 0.235
  'pretransfer_time' => float 0.75
  'size_upload' => float 0
  'size_download' => float 23380
  'speed_download' => float 18467
  'speed_upload' => float 0
  'download_content_length' => float -1
  'upload_content_length' => float -1
  'starttransfer_time' => float 0.969
  'redirect_time' => float 0
  'redirect_url' => string '' (length=0)
  'primary_ip' => string '98.137.27.7' (length=11)
  'certinfo' => 
    array (size=0)
      empty
  'primary_port' => int 443
  'local_ip' => string '192.168.1.73' (length=12)
  'local_port' => int 6578

C:\wamp64\pxlet\cron\test.php:29:string '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"

However once the script is being uploaded to GoDaddy server, the output becomes totally different.

array(
  'url' => string 'https://techcrunch.com/feed/?size=20' (length=36),
  'content_type' => string 'text/html' (length=9),
  'http_code' => int 200,
  'header_size' => int 341,
  'request_size' => int 192,
  'filetime' => int -1,
  'ssl_verify_result' => int 20,
  'redirect_count' => int 0,
  'total_time' => float 0.28804,
  'namelookup_time' => float 0.000452,
  'connect_time' => float 0.199998,
  'pretransfer_time' => float 0.243256,
  'size_upload' => float 0,
  'size_download' => float 4,
  'speed_download' => float 13,
  'speed_upload' => float 0,
  'download_content_length' => float 4,
  'upload_content_length' => float 0,
  'starttransfer_time' => float 0.287981,
  'redirect_time' => float 0,
  'redirect_url' => string '' (length=0),
  'primary_ip' => string '98.137.27.7' (length=11),
  'certinfo' => array(),
  'primary_port' => int 443,
  'local_ip' => string '132.148.180.244' (length=15),
  'local_port' => int 45552,
  'http_version' => int 3,
  'protocol' => int 2,
  'ssl_verifyresult' => int 0,
  'scheme' => string 'HTTPS' (length=5),
  'appconnect_time_us' => int 443116,
  'connect_time_us' => int 199998,
  'namelookup_time_us' => int 452,
  'pretransfer_time_us' => int 243256,
  'redirect_time_us' => int 0,
  'starttransfer_time_us' => int 287981,
  'total_time_us' => int 288040,
)

string(4) "OK"

The GoDaddy server one just gives an output OK and no actual XML response which I expected. 

After reviewing the response information, the only suspicious part is the IP address of the caller. It looks like TechCrunch is trying to prevent some access from scripts hosted on GoDaddy web servers. 

To verify this, I uploaded the same script to a different web server I have with HostGator. This time, the script successfully retrieved the correct RSS feed data, reinforcing my suspicion that TechCrunch was selectively blocking access from GoDaddy-hosted servers.

It seems plausible that TechCrunch is restricting non-organic access to their data, possibly to prevent data scraping for purposes like AI training.

Has anyone else also seen the same behavior or is there any problem with my script? Welcome to share your findings.

AI  GODADDY  TECH CRUNCH  BLOCK ACCESS 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Indexing