Grab HMTL code from a special website and parse it

RamsesV · January 28, 2018, 9:46pm

Hi community,
I registered for a special request, I have gotten and I struggle in resolving it. Let me explain the need:

I should create an Intranetpage, which should show a value taken from our company salesforce. There is a dashboard in salesforce, which I’d need to access and parse off some values there. Dashboards in salesfoce do have a static URL, like “https://companyname.my.salesforce.com/01Z60000000ipt3”. When I download the website locally, e.g. to http://localhost/simulation/index.html, I can use my script perfectly:

[php]<?php

//— Varibales you may use after including this php —//
/*
$taccasecount Displays the number of Backup Cases in Tac EMEA
*/

//— Fetching website and searching for the string above —//
try {
$content = file_get_contents(“http://localhost/simulation/index.html”); // get HTML code from Website
$search = ‘#Backup:(.)([0-9])#’; // search string

//--- Error handling ---//
if (!$content) {
	echo "Error: Unable to fetch website." . PHP_EOL;
	echo "Debugging errno: " . file_get_contents_errno() . PHP_EOL;
	echo "Debugging error: " . file_get_contents_error() . PHP_EOL;
} else {
	
//--- If no error: run the function ---/

	if (!preg_match($search, $content, $casecountmatch)) {
		echo "Error: Unable to find search string." . PHP_EOL;
	} else {
		preg_match($search, $content, $casecountmatch);
		$taccasecount = $casecountmatch[1];
	}
}

print_r("<p>Backup Cases: $taccasecount</p>"); // test output

//— If boolean <> false, but return is still bad —//
} catch (Exception $e) {
echo “Error: Unknown Exception” . PHP_EOL;
}

?>[/php]

I may even use file_get_contents function from PHP. However, when I am trying to directly access the webpage, I am getting stuck at a redirect. I assume it is javascript. The plaintext output from file_get_contents for https://companyname.my.salesforce.com/01Z60000000ipt3 is:

[code]

[/code]

I found that file_get_contents is probably not the rigth way to grab the html, as it does not follow redirects. So i tried working with cURL. No matter, what I did, I am still getting the same output. I know that the website is https and is using an authentication. So i tried both: cURL w/ and w/o post of credentials. However, as long as I work with a browser, which has the salesforce cookies still, this should work w/o auth - they do, when I link to pictures in the same dashboard from my intranet page.

I really tried many things, which can be found with google, so I do not have a list, with stuff i tried. Also it’S quite fun that i need to set header(‘Content-type: text/plain’); before my content output to even see the the above - otherwise I am getting redirected to the correct site.

my last attempt was installing guzzle and using this snippet:
[php]<?php

require ‘…/…/…/vendor/autoload.php’;
use GuzzleHttp\Client;

$client = new Client([
//Base URI is used with relative requests
//MUST BE a qualified url, ex. http://
‘base_uri’ => ‘https://companyname.my.salesforce.com’,
//optionally set a timeout
// You can set any number of default request options.
‘timeout’ => 2.0,
]);

$response = $client->request(‘GET’, ‘https://companyname.my.salesforce.com/01Z60000000ipt3’);

header(‘Content-type: text/plain’);

//returns a GuzzleHttp\Psr7\Stream
$body = $response->getBody();
// Implicitly cast the body to a string and echo it
echo $body;
echo $response->getStatusCode();

?>[/php]

Another attempt was this one: http://at1.php.net/manual/en/ref.curl.php#93163

Everytime the same: The script gots stuck at the redirect. Does anyone have an idea, how I could get the data from the page? I am also open to completely different solutions. The only thing I cannot do at this time is to get permissions to use any API from salesforce (REST or something) to work with. I can only use my personal login and access to the data, which is already there.

Cheers
RamsesV

astonecipher · January 29, 2018, 3:01pm

If your company HAS salesforce, how do you not have access to their API?

Look into cURL.

RamsesV · January 29, 2018, 7:18pm

Because I am not in the department, which is able to use the API. As I also only need some values, noone will give workpower for my small request.

I thought I did Do you have any advice or keyword?

astonecipher · January 29, 2018, 7:22pm

PHP cURL

http://codular.com/curl-with-php

RamsesV · January 29, 2018, 8:55pm

I thougth, I posted that too, but I didn’t. One of the cURL ideas i worked with:
[php]/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)

@return array[0] content
array[1] array of response headers
==================================*/
function get_url( $url, $javascript_loop = 0, $timeout = 10 )
{
$url = str_replace( “&”, “&”, urldecode(trim($url)) );

$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );

if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
    ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

    if ( $headers = get_headers($response['url']) )
    {
        foreach( $headers as $value )
        {
            if ( substr( strtolower($value), 0, 9 ) == "location:" )
                return get_url( trim( substr( $value, 9, strlen($value) ) ) );
        }
    }
}

if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
        $javascript_loop < 5
)
{
    return get_url( $value[1], $javascript_loop+1 );
}
else
{
    return array( $content, $response );
}

}

header(‘Content-type: text/plain’);
print_r(get_url(“https://companyname.my.salesforce.com/01Z60000000ipt3”)); [/php]

The above is not from me, the below is:

[php]<?php

$loginUrl = ‘https://companyname.my.salesforce.com?ec=302&startURL=%2F01Z60000000ipt3’; //action from the login form
$loginFields = array(‘username’=>‘myusername’, ‘password’=>‘mypassword’); //login form field names and values
$remotePageUrl = ‘https://companyname.my.salesforce.com/01Z60000000ipt3’; //url of the page you want to save

#$remotePageUrl = urlencode($remotePageUrl);
print_r($remotePageUrl);

$login = getUrl($loginUrl, ‘post’, $loginFields); //login to the site

$remotePage = getUrl($remotePageUrl); //get the remote page

$ch = curl_init();

function getUrl($url, $method=’’, $vars=’’) {
$ch = curl_init();
#curl_setopt($ch, CURLOPT_HTTPHEADER, array(‘Accept-Encoding: deflate, gzip’));
if ($method == ‘post’) {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $vars);
}
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, ‘cookies/cookies.txt’);
curl_setopt($ch, CURLOPT_COOKIEFILE, ‘cookies/cookies.txt’);
$buffer = curl_exec($ch);
curl_close($ch);
return $buffer;
}

$content = $remotePage;

#$curl_exec = curl_exec($ch);
#$content = @gzdecode($curl_exec);
#return $content !== false ? $content : $curl_exec;
#$content = json_decode($content, true);

print_r($content);

?>[/php]

In the second example I also tried to work with gzip decryption, but that was not the issue.

Edit: When trying to get the page with curl on command line with -L, so it should follow any redirects, it gives me the same output, as I do it with the PHP script.

A workmate made a suggestion, which I probably follow too. Are there ways in python to get the HTML? I guess so and I will also dig in this direction.

astonecipher · January 29, 2018, 9:18pm

Beautiful Soup library in Python is numero Uno

RamsesV · January 29, 2018, 11:05pm

Thanks for the input so far. Beautiful Soup is awesome - as soon as you have the content to parse But that is really something I will use in the future, thanks for that!

I also had a look at Requests, another awesome Python Lib. However, I failed with both to get more content from the webpage, than I posted above.

I tried to follow: https://blog.hartleybrody.com/web-scraping-cheat-sheet/

Contrary to popular belief, you do not need any special tools to scrape websites that load their content via Javascript. In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere.

I am pretty stuck here. I tried so much, I can't even post everything here. I am sure, there is a solution, but I don't see it.

All in all, I think my issue summaerized is that no matter with which script I try to access the webpage, it stucks at some JS or AJAX redirection, which a browser may follow, but my scripts can’t, even when I am using my cookies or my login directly.