How to transfer a file from a URL to a local directory

helpneeded22 · December 22, 2018, 10:16pm

I want to copy a bunch of images from a URL source to my computer. All of the images I want to copy are serialized file names (90210,jpg, 90211,jpg, etc).

I have created the following PHP code which I thought would do this, but it is not. From what I read there are 2 different ways to copy files, but neither are working. When I execute the code, nothing happens (the cursor sort of just hangs trying to connect to the page). (I have commented out my actual code and replaced it with the specific variables I am entering in my form.) Any suggestions?

<?php
    //$switch_submit = $_POST['switch_submit'];
    $switch_submit = "y";
    //$rangeStart = $_POST['rangeStart'];
    $rangeStart = "92010";
    //$rangeEnd = $_POST['rangeEnd'];
    $rangeEnd = "92012";
    //$sourcePath = $_POST['rangePath'];
    $sourcePath = "https://52f4e29a8321344e30ae-0f55c9129972ac85d6b1f4e703468e6b.ssl.cf2.rackcdn.com/products/pictures/";
    
    $targetPath = "file://W:\Web Pages\Live Files\images\minis\mage-knight\\";

    if ($switch_submit == "y") {
        $i = $rangeStart;
        while($i <= $rangeEnd) {
            $sourceFile = $sourcePath . $i . '.jpg';
            $targetFile = $targetPath . $i . '.jpg';
        
            // Method 1
            copy($sourceFile, $targetFile);
        
            // Method 2
            /*
            $content = file_get_contents($sourceFile);
            $fp = fopen($targetFile, "w");
            fwrite($fp, $content);
            fclose($fp);
            $i++;
            */
        }
        echo 'done!<br>';
    }

(Note: this is just code that I am using and is not available to any users).

ErnieAlex · December 23, 2018, 12:58am

Ha! I have been asked about so very many scraping projects lately that I am amazed!

Well, let’s explain… Hope you can keep up… Lots to explain…

Websites display images. They are placed on the screen after the browser downloads the images.
They are placed on the webpage using < img > tags. Therefore, you can see the images and you can locate them if you right-click on them and get the info of them.

Copy is a local function in PHP. It has nothing to do with webpages and can therefore not copy images.
Mostly because the image does not exist inside the webpage, just the “img” link pointing at the image.
There are several ways to scrape pictures out of webpages. You could pull them out of your browser’s cache as they must be on your computer to be able to view them. But, that is extremely tricky. Since you can capture the entire webpage using file_get_contents() function in PHP, you can access all of the images in the webpage using a DOM function. This is the easiest way to handle this time of scraping project.

Before I show you possible code for this, you should know that it is illegal to copy images off of web pages without permission. Most websites have notices that state if it is copyrighted. If they are copyrighted, you can not legally copy or use them except inside the webpage itself. Okay, now I am covered…

The way to do this would be to load the page into a variable. Then, process it with DOM functions. I attempted to view the page with the images on it, but, you must be logged in to view it and I could not.
The DOM functions are XML orientated. This means that you can pull out all of the <IMG> tags.
Then, using those, you can load the <src> of the images. This is the pointer to the site’s location of the real image. Then, finally, you can load the image and store it locally. This sounds like a lot of work, but
it is actually quite easy in PHP. Here is a sample to get you started…

<?php
    //  Set your URL here...
    $url="https://www.insectidentification.org/";
    //  Load the entire page
    $html = file_get_contents($url);
    //  Create DOM object from the HTML
    $doc = new DOMDocument();
    //  Load webpage into the DOM object
    @$doc->loadHTML($html);
    //  Locate all of the image tags
    $tags = $doc->getElementsByTagName('img');
    //  Loop thru them and display them
    foreach ($tags as $tag) {
        echo $tag->getAttribute('src')."<br>";
    }
?>

It’s that simple. Now, use this test script as-is and you will see a problem. The image list is not fully formed. Meaning that they show the insect images are stored in a folder on their server named “imgs” . This is not the full site’s address. If you place one of these image links into a browser it will not show. You need the www.sitename.com/imgs in place. (Just an example.) To do that, you need to process each image source tag to make sure they are exactly formed as they should be. This type of link adjusting depends on the web page that is scraped. In my example, the insects image links need to be adjusted. I pull out the base URL and add it to the front of the “imgs” folder. Here is that example:

<?php
    //  Set your URL here...
    $url="https://www.insectidentification.org/";
    //  Grab the base URL
    $base_end = strpos($url, "/", 9);
    $base = substr($url, 0, $base_end+1);
    echo "URL: " . $url . "<br>Base url: " . $base . "<br><br>";
    //  Load the entire page
    $html = file_get_contents($url);
    //  Create DOM object from the HTML
    $doc = new DOMDocument();
    //  Load webpage into the DOM object
    @$doc->loadHTML($html);
    //  Locate all of the image tags
    $tags = $doc->getElementsByTagName('img');
    //  Loop thru them and display them
    foreach ($tags as $tag) {
        $link = $tag->getAttribute('src');
        //  Adjust link if missing base at front
        if (substr($link, 0, 1)=="/") $link = $base . $link;
        echo $link . "<br>";
    }
?>

Now that you have a full list of the correct images as you can see if you run this script, each of them can be loaded and saved. To do that, you need to take the filename of each and save where you want it to be. To do that, you just load each and save them in a folder. I used “images” in this final test:

<?php
    //  Set your URL here...
    $url="https://www.insectidentification.org/";
    //  Grab the base URL
    $base_end = strpos($url, "/", 9);
    $base = substr($url, 0, $base_end+1);
    echo "URL: " . $url . "<br>Base url: " . $base . "<br><br>";
    //  Load the entire page
    $html = file_get_contents($url);
    //  Create DOM object from the HTML
    $doc = new DOMDocument();
    //  Load webpage into the DOM object
    @$doc->loadHTML($html);
    //  Locate all of the image tags
    $tags = $doc->getElementsByTagName('img');
    //  Loop thru them and display them
    foreach ($tags as $tag) {
        $link = $tag->getAttribute('src');
        //  Adjust link if missing base at front
        if (substr($link, 0, 1)=="/") $link = $base . $link;
        echo $link . "<br>";
        //  Now pull out the filename for saving
        $filename_start = strrpos($link, "/")+1;
        $filename = substr($link, $filename_start);
        echo "   Filename: " . $filename . "<br>";
        //  Load image
        $image = file_get_contents($link);
        //  Save it in a folder on the server
        file_put_contents("images/" . $filename, $image);
    }
?>

As you now see, it is not complicated, but, is a process. You can skip the base-URL coding sections if you are 100% sure that the links are well formed. But, you should be able to piece together a new version for your uses. If you need to save the file locally, you would need to zip up the folder and download it to your system. That is quite easy to do also, but, for another post… Good luck!

helpneeded22 · December 23, 2018, 3:22am

Thank you ErnieAlex for your time in providing that sample code. I will go through that to see what can help me.

For this specific project though, I can get the images just fine (the links are public), such as this one:
“https://52f4e29a8321344e30ae-0f55c9129972ac85d6b1f4e703468e6b.ssl.cf2.rackcdn.com/products/pictures/90210.jpg”

I have no problem using the code that I have to transfer these images from the source location, to a file on my web server. However, rather than targeting the webserver, I would rather target a folder on my computer. I can figure out how do this, or whether its even possible – perhaps there is some Windows security feature enabled that prevents a person from doing this? (Not a big deal since I can then just download the files from the webserver to my computer, but I am hoping to save a step).

(I had to use Method 2 in my code to do this – Method 1 did not do anything).

helpneeded22 · December 23, 2018, 3:41am

After doing some more tests, I have noticed that the image repository I am getting the images from, have a lot of different images in the repository, other that the ones I want, so grabbing a series of images is not an option for me.

It looks like the image scraper method might be the quicker way. (Although with my limited coding skills, by the time I learn and incorporate the code, it might actually be quicker for me to just right click each image and select save.)

Thank you again for letting me know of this option.

ErnieAlex · December 23, 2018, 3:41am

Well, to force a download so you can save it in your own computer, you can just create a link that points to the image and then click on the link to force it to download. (Right-click that is with save-target-as)

But, that will be a pain as you would need to create a link for each image. Not an automatic way to handle it.
You can save them all to a folder on the server, then use zipping-code to create a zipped file and then have the PHP code start a download of that one file that would contain all the images.

You can only save one file at a time to a local machine unless you use FTP. You could create a folder and allow FTP access to it and then you could create a script to download all files in that folder using an FTP connection which can be done in PHP.

Not sure if this helps, but, those are some ideas how you could do it.

ErnieAlex · December 23, 2018, 3:43am

I typed that as you were typing… So, yes, it might be easier if you just right-click on the ones you want and download them manually. Just more time consuming, I think!

helpneeded22 · December 23, 2018, 6:24am

Yes, especially since there will be a total of 1000+ images I may need to copy.

I am looking at free web scraper utilities. Any that you would recommend? Chrome has a web scraper extension, but I can’t for the life of me figure out how to use it.

ErnieAlex · December 23, 2018, 6:27am

Heading to bed soon… But, I think it is best to create your own.

If there is 1000 images, just scrape them into a server folder. Then zip them up.
Then, download the zipped file. That is all easy to do. Basically, the code above does most of it.
You would just save the images into a temporary folder. Then zip up all the files. Then download it.
Once downloaded, just unzip the folder and you would have them all.

I am a bit busy tomorrow, actually later today as it is 1:30AM here, but, Monday can help you solve this.

If you are in the states, Merry Christmas!

helpneeded22 · December 23, 2018, 7:22am

Thank you again for your help with this. I’ll look into it more once I am back home. I’m in Canada, however I am headed to Dallas in the morning for 5 days. Merry Christmas to you as well.