Ha! I have been asked about so very many scraping projects lately that I am amazed!
Well, let’s explain… Hope you can keep up… Lots to explain…
Websites display images. They are placed on the screen after the browser downloads the images.
They are placed on the webpage using < img > tags. Therefore, you can see the images and you can locate them if you right-click on them and get the info of them.
Copy is a local function in PHP. It has nothing to do with webpages and can therefore not copy images.
Mostly because the image does not exist inside the webpage, just the “img” link pointing at the image.
There are several ways to scrape pictures out of webpages. You could pull them out of your browser’s cache as they must be on your computer to be able to view them. But, that is extremely tricky. Since you can capture the entire webpage using file_get_contents() function in PHP, you can access all of the images in the webpage using a DOM function. This is the easiest way to handle this time of scraping project.
Before I show you possible code for this, you should know that it is illegal to copy images off of web pages without permission. Most websites have notices that state if it is copyrighted. If they are copyrighted, you can not legally copy or use them except inside the webpage itself. Okay, now I am covered…
The way to do this would be to load the page into a variable. Then, process it with DOM functions. I attempted to view the page with the images on it, but, you must be logged in to view it and I could not.
The DOM functions are XML orientated. This means that you can pull out all of the <IMG>
tags.
Then, using those, you can load the <src>
of the images. This is the pointer to the site’s location of the real image. Then, finally, you can load the image and store it locally. This sounds like a lot of work, but
it is actually quite easy in PHP. Here is a sample to get you started…
<?php
// Set your URL here...
$url="https://www.insectidentification.org/";
// Load the entire page
$html = file_get_contents($url);
// Create DOM object from the HTML
$doc = new DOMDocument();
// Load webpage into the DOM object
@$doc->loadHTML($html);
// Locate all of the image tags
$tags = $doc->getElementsByTagName('img');
// Loop thru them and display them
foreach ($tags as $tag) {
echo $tag->getAttribute('src')."<br>";
}
?>
It’s that simple. Now, use this test script as-is and you will see a problem. The image list is not fully formed. Meaning that they show the insect images are stored in a folder on their server named “imgs” . This is not the full site’s address. If you place one of these image links into a browser it will not show. You need the www.sitename.com/imgs in place. (Just an example.) To do that, you need to process each image source tag to make sure they are exactly formed as they should be. This type of link adjusting depends on the web page that is scraped. In my example, the insects image links need to be adjusted. I pull out the base URL and add it to the front of the “imgs” folder. Here is that example:
<?php
// Set your URL here...
$url="https://www.insectidentification.org/";
// Grab the base URL
$base_end = strpos($url, "/", 9);
$base = substr($url, 0, $base_end+1);
echo "URL: " . $url . "<br>Base url: " . $base . "<br><br>";
// Load the entire page
$html = file_get_contents($url);
// Create DOM object from the HTML
$doc = new DOMDocument();
// Load webpage into the DOM object
@$doc->loadHTML($html);
// Locate all of the image tags
$tags = $doc->getElementsByTagName('img');
// Loop thru them and display them
foreach ($tags as $tag) {
$link = $tag->getAttribute('src');
// Adjust link if missing base at front
if (substr($link, 0, 1)=="/") $link = $base . $link;
echo $link . "<br>";
}
?>
Now that you have a full list of the correct images as you can see if you run this script, each of them can be loaded and saved. To do that, you need to take the filename of each and save where you want it to be. To do that, you just load each and save them in a folder. I used “images” in this final test:
<?php
// Set your URL here...
$url="https://www.insectidentification.org/";
// Grab the base URL
$base_end = strpos($url, "/", 9);
$base = substr($url, 0, $base_end+1);
echo "URL: " . $url . "<br>Base url: " . $base . "<br><br>";
// Load the entire page
$html = file_get_contents($url);
// Create DOM object from the HTML
$doc = new DOMDocument();
// Load webpage into the DOM object
@$doc->loadHTML($html);
// Locate all of the image tags
$tags = $doc->getElementsByTagName('img');
// Loop thru them and display them
foreach ($tags as $tag) {
$link = $tag->getAttribute('src');
// Adjust link if missing base at front
if (substr($link, 0, 1)=="/") $link = $base . $link;
echo $link . "<br>";
// Now pull out the filename for saving
$filename_start = strrpos($link, "/")+1;
$filename = substr($link, $filename_start);
echo " Filename: " . $filename . "<br>";
// Load image
$image = file_get_contents($link);
// Save it in a folder on the server
file_put_contents("images/" . $filename, $image);
}
?>
As you now see, it is not complicated, but, is a process. You can skip the base-URL coding sections if you are 100% sure that the links are well formed. But, you should be able to piece together a new version for your uses. If you need to save the file locally, you would need to zip up the folder and download it to your system. That is quite easy to do also, but, for another post… Good luck!