PHP curl - 413 Payload too large on retrieving certain web pages

garethphp · December 24, 2020, 5:08pm

This code works fine for the majority of webpages, but I have started receiving the follow output error for certain web pages.

Unexpected HTTP code: 413
payload too large

I have included a webpage that produces the error in the code variable below.

I have increased the following PHP variables to their max with no luck: post_max_size, upload_max_filesize. I have also tweaked the apache variable LimitRequestBody to the max value also.

Any suggestions would be greatly appreciated.

function get_data($url) 
{
	$ch = curl_init();
	
	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0");
	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);

	$data = curl_exec($ch);
	
	if (!curl_errno($ch)) 
	{
		switch ($http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE)) 
		{
			case 200:  # OK
			break;
			default:
				echo 'Unexpected HTTP code: ', $http_code, "\n";
		}
	}
	curl_close($ch);
	return $data;
}

$url = "https://www.oddschecker.com/golf/open-championship/2021-open-championship/winner";
$returned_content = get_data($url);
echo "<Br>".$returned_content;

ErnieAlex · December 24, 2020, 8:25pm

Most likely it is the PHP memory size causing this error. You can beef up the memory given to your PHP page and the time limit too. Something like this might work:
set_time_limit(0);
ini_set(‘memory_limit’, ‘2048M’);

Just add them at the top of the page right after setting up a session if you use them.

garethphp · December 24, 2020, 8:45pm

Thanks for the suggestion. I modified the memory_limit variable in php.ini and added the set_time_limit(0) but unfortunately got the same error message.

skawid · December 24, 2020, 10:48pm

It looks like the server you are contacting is issuing the 413 error; that server is saying your request payload is too large. The only thing missing from your code above is the value of $timeout; what is the value of that variable?

garethphp · December 24, 2020, 11:00pm

Sorry I forgot to include that. $timeout is currently set to 0.

ErnieAlex · December 28, 2020, 9:48pm

I spent move of the day attempting to scrape that page for you. Nothing worked in any way.
One odd thing I got working was to load the data into an iFrame. Then, you can access it thru the DOM.
You would need to probably use Javascript to access the DOM and save the data you want.

It appears their server is blocking some http requests somehow. You might be able to pay them a membership fee and download the data in Json format!

garethphp · December 28, 2020, 10:18pm

Thanks for spending some time and the input @ErnieAlex.

It is very weird as 99% of pages on that domain work when I scrape them, but a handful, like the one in the example don’t. Glad it isn’t just me pulling my hair out wondering what is going on in the above example

I’ve never played around with iframes and pulling data from them before. After a quick search it appears it might not be possible. Most articles are saying “Cross site contents can not be read by javascript.”.

ErnieAlex · December 28, 2020, 11:35pm

Well, I did load the entire page into an iFrame. No issues there. I did not attempt to play with that as yet.
I will look at it further later on tonight. I am surprised on this one as I scrape a lot of data for various websites that I play with and have never failed up till now… So, it is a puzzle for me to look at in a few hours… I will update you one what I find out.

ErnieAlex · December 29, 2020, 12:00am

Waiting for food to cook, had a few minutes…

Something loosely like this would get you started:

<div><iframe id="ifrm2" src=''></iframe></div>
 
<script>
// reference to iframe with id 'ifrm'
 var ifrm = document.getElementById("ifrm2");
ifrm.style.width = '1024px'; // set width
ifrm.style.height = '1024px';
ifrm.src = "https://www.oddschecker.com/golf/open-championship/2021-open-championship/winner";  // set src to new url

</script>

This should load the page into an iFrame. Then, if you right-click on it, you should see frame or this-frame and you can view the source of the frame. I suspect you could do a simple temporary file creation using this source code and then switch to another PHP page that would actually do the real scraping using the temp file created from the Javascript… Might work…

ErnieAlex · December 29, 2020, 1:38am

Well, you can not do it with an iFrame. Well, unless you handle all the scraping in Javascript.
But, then, you can’t save the data on the server since JS is only client-side! I forgot about that issue.
So, I did a lot of further tests and all of them failed. But, I did locate the error causing it:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource

After looking at the code on the site and the rest of many error messages while debugging, it appears the
cause is due to their site calling libraries and fonts form other sites. Therefore, the cross-domain problems kick in and throw out errors. I did find there is one solution, but, it needs to be handled at their site, not yours. They can add a line at the top of the problem pages that would allow you access to them. I did not take the time to research on their site how they feel about scraping their data. You might know about that. Perhaps you should email them and see if they will help.
Sorry I could not fix this for you…

ErnieAlex · December 29, 2020, 1:17pm

I was looking at their site a bit more and if you are a member, you might be able to scrape this using an SSL login for their site. In other words, you read the site using cURL with login information. Do you currently have an account with them?

garethphp · December 29, 2020, 3:21pm

@ErnieAlex No worries with the javascript. The client-side issue was my understanding as well, from having a quick search on the iframe subject.

Regarding the site calling libraries and fonts from other sites. Did you find they only make this call on certain pages, like the one in the example? Otherwise, it probably wouldn’t explain why the majority of pages work for my scraping?
For example, this page works. Do they not make the call to external information on it?
https://www.oddschecker.com/horse-racing/southwell/18:00/winner

Yes I do currently have an account with them. It is free to sign-up and just needs an email.

ErnieAlex · December 29, 2020, 3:54pm

Well, they might be calling a library only on that page that is different from the others.
I have many scraper systems I created. I will try some more later today and see if I can get it to work.
I might have to create an account so I can test it that way, too. Busy for a few hours though. Later…