Need help writing a script to automate something


#1

Hello everyone. I used to tinker with PHP and HTML but have been away from that for years.

I have a simple task that requires editing an HTML file, and I would like to automate it, but have no clue where to begin.

Just downloaded the free text-editor called Atom, and am thinking that might help, but again, not sure where to begin.

I realize this is a PHP forum, but I suspect that what I am trying to do would be a cakewalk for all of you PHP experts!

If someone could help me figure how to tackle things that would be appreciated.


#2

For an IDE , in other words a program editor, you might want to look at NetBeans. It is free and you can download the PHP version which works well. It will point out PHP and HTML errors to you so it helps with edits.

Next, PHP is set up to create coding and variables. If you have data acquired from anywhere such as another HTML website’s page or your MySQL database, you can display it in another HTML page. It is quite easy to do.

It does depend on how you want that data to be displayed. Sometimes it is just a name that you want shown. Or sometimes it is a table of data. All is easily done in PHP. You just need to help us by telling us what your data is and how you want it displayed.

We do not care about the “content” of your data, just what type of data it is and how you want to display it.
So, if it is the name of your user that is logged in and it is stored in a variable like $username, then you can
display it into an HTML format something like this:

 <h2>Welcome back <?PHP echo $username; ?>!</h2>

This will display Welcome back ErnieAlex! on the page.

Not sure if that is what you are looking for, but, starts you off…


#3

Please add some info on what you’re trying to achieve. A sample of the code with a before/after would be helpful. Does it change the HTML structure or is it just changing text/values?


#4

Hi Ernie,

Funny you should mention NetBeans, because that is what I used to do PHP development in! :grin:

Let me explain what i am trying to do, because people probably need to know that in order to help me more.

Whenever I try to save web pages from online (e.g. article, how-to guide, etc) it seems like the Javascript always breaks things. Some of the news websites in particular are a real PITA in how Javascript breaks an otherwise good saved page.

So I have learned that if I comment out all Javascript in the page source it pretty much addresses the issues I have been having.

After discovering this solution, when I find an article I want to save, I view the page source in my browser, select all, copy and paste it into an HTML file in Netbeans. From there, I do a “Find & Replace” of all the Javascript tags and inserts HTML comments thus turning off Javascript. Then I save this HTML file and name it by the article title.

The problem is that while powerful, NetBeans is overkill for this procedure, plus it makes me create the HTML file inside a netBeans project and then I have to go into Finder and move the edited HTML file and all of this is a real hastle.

Here is what I ideally would like…

Let’s say I am reading an article in the New York Times about Cryptocurrencies and I want to save it for future reference.

As a part 1 solution, I would like a way to open up a simple text-editor like Atom, paste the web page’s source code into a blank file, [b]click a button which runs a Find & Replace to swap out Javascript tags with HTML comment tags.[b] Then I can save the file into my “Crypto Articles” folder on my hard-drive.

So basically i want to write a simple “macro” in Atom (or similar simple, free editor) that automates the Find & Repalce process to one-click.

(I have some ideas to make things fancier in a Part 2 solution, but if I can get Part 1 working that would be a huge help.)

Does that make sense?


#5

LOL, well, what you are talking about is really called “scraping”. You scrape a webpage and save it after you adjust the script the way you want it. Such as removing scripts.

This can be done with simple text replacement functions built into PHP. But, you can not doe it easy with the way you are discussing. For this, you would need to create a macro to do it. You can Google Netbeans Macros and learn how to do it. It would be tricky though.

It might be best to just write a simple script to do this instead. I have tons of scraping experience. It is easy.
You would need to have one input for the website you want to save. You would enter the URL and run a
simple script to load that page into a variable. Then, have it locate all of the JS scripts and remove them.
And. lastly, have it save the script in a folder or force it to download to you. This can be done in just a few
lines of code.

If you want to do this, I can help you. Quickly a few ideas to start you off…

To load an entire website page into a variable…
$url=‘https://www.saranaclake.com’;
$webpage = file_get_contents($url);

That’s it! How easy can that be? You would of course need to input the webpage and use that instead.

To remove all scripts you just do a short loop removing ALL of them. There are more complicated ways to do this such as using a DOM library, but, you can do it with just simple PHP commands like this which removes one script tag. You would need to loop to remove all of them as a lot of webpages have more than one set of scripts. Quite often, they have Javascripts and JQuery scripts and others.

$script_start_pos = strpos($webpage, "<script>");
$script_end_pos = strpos($webpage, "</script>", $script_start_pos);
$webpage = substr($webpage, 0, $script_start_pos) . $substr($webpage, $script_end_pos+9);

This will locate the start of the first script tag. Then, locate the end of that tag. Then, it removes it.
It is removed by taking the page up to the start of the tag and add the page after the end of the tag.
( Just off the top of my head, not really tested. ) Hope this gives you a starting place…

You could write a PHP script to do this for all files in a folder. It would basically use GLOB() function to go thru each filename in the folder and then remove the scripts from each of them.

But, if you want to do it with macro’s there are a lot of ways to do that. I think in ATOM, you can set up a macro and that you can use preg_replace() function to remove everything between tags. I do not use ATOM, so not really sure how to code it. You might ask on the ATOM forum about how to handle it.


#6

Ernie,

Looks like I came to the right person - or more like the right person found me!

Do you think for what I want to do it would be easier to do this in Atom or PHP?

I just want something handy that as I read articles on the Internet will allow me to save them in a usable format for later. So think of what I am asking for as some handy utility.


#7

Here is an example of a nearly ideal “workflow” that I am looking for…

  • I am on the Internet surfing, and come across an excellent article that I want to save for reference.
  • I launch my new mini article “utility”
  • There is a form with the following: Article Name field, URL field, Save To field, and Process button
  • I copy & paste the article title (or type my own) into the “Article Name” field.
  • I copy & paste the URL into the “URL” field.
  • I click on a “Browse” button and navigate to where I want to save this article.
  • I click the “Process” button and it grabs all source code for the web page I was reading, strips out any Javascript, and then saves things as a new .html file with the name and location specified above.

The end goal is to automate this enough so it is a few clicks, versus me having to save the .html file and then edit in a text editor, or having to fiddle with netBeans or whatever else.

Does that make sense?

What is the best way to accomplish that?


#8

After dinner ( which is nearly ready ) I can create a simple PHP script for this. It is really extremely simple to do.

There is no since at all to use copy and paste for any HTML. You can do it just using the URL and have the
script handle it all for you. I will post an example after I eat. should not take long to drum something up for you!


#9

What’s for supper?? =)


#10

Sorry, got too many phone calls. But, just created a test script for you.
I ran it on a few sites and it removes the script’s just fine. Now, remember that if you save webpages from
other locations, they may not view 100% accurate if the javascript is loading displayable text for you.

But, this is one way to strip out tags. It runs a small loop to locate them and if found, removes them.
Then, it forces a download so you have the file. You will have to add in some code to handle naming of the
new file if you wish. I just stuck in a temp name…

Hope this works for you…

<?PHP
if (isset($_POST["submit"])) {
    //  Get user's URL from form and load into a variable
    $url = filter_input(INPUT_POST, "url");
    $webpage = file_get_contents($url);
    //  Stript out all <script> tags
    while (($script_start_pos = strpos($webpage, "<script"))!== false) {
        $script_end_pos = strpos($webpage, "</script>");
        $webpage = substr($webpage, 0, $script_start_pos) . substr($webpage, $script_end_pos+9);
    }
    header('Content-Disposition: attachment; filename="scraped_file.html"');
    header('Content-Type: text/html');
    header('Content-Length: ' . strlen($webpage));
    header('Connection: close');
    echo $webpage;
}
?>
<!DOCTYPE html>
<html>
<body>

<h3>Enter your URL and press download and the page will start download without scripts...</h3>

<form action="#" method="post">
    <div>
        <input type="text" name="url" placeholder="Enter the URL here!">
        <button type="submit" name="submit" value="submit">Download Your URL</button>
    </div>
</form>
</body>
</html>

Just copy this into a file and name it something like scraper_test.php. Post it to your server and test it.
Go to the server and the file and it will ask you for an URL to remove the scripts from… Good luck!


#11

Chicken Cordon Bleu… Was yummy!


#12

Ernie,

I tried your script but when I open scraped_file.html it just reloads your PHP script…


#13

Really? I will look at the code again… Right now


#14

Also, I looked at your code and I couldn’t follow how your While loop works. (Honestly, the logic seems flakey.)

It doesn’t help that I haven’t done any coding in years. All of this feels like Greek to me.


#15

Sorry missed a line. You need to echo the file… Just echo $webpage; Look at the code above.
I did an edit and added in that one like right after the content-parts…


#16

Can you help me understand this code…

    $script_start_pos = 0;
    $script_end_pos = 0;
    while (($script_start_pos = strpos($webpage, "<script"))!== false) {
        $script_end_pos = strpos($webpage, "</script>");
        $webpage = substr($webpage, 0, $script_start_pos) . substr($webpage, $script_end_pos+9);
    }

#17
Sorry, I wrote this several times and the first two lines zeroing out the values are

not needed now. So, just the while loop is needed… I will explain that part…

The while clause will keep looping WHILE the compare inside the clause is not false.
The compare is locating the <script> tag in the webpage.  But, since some scripts 

might have options in them, or classes or other text, you can not check for section is located.

If the value is false, this means that there is no <script> tags and we are done removing them.

If there is a <script> found, we need to remove that section.  To do that we need to locate the end of the script tags.  So, we do a second STRPOS to locate the ending.  We scan from the start again and locate the ending script with is always </script>...    Once found we know the start and end positions of the tags.
This gives us a way to break the parts of the page into two sections.  We can take the first section which is everything up to the start of the first tag.  ( <script> )   And, then we add to that the ending section starting from the end of the script.  This actually needs adjusting by 9 chars.  That is the length of the </script> tag.
( strpos will give you the starting point of the </script> code, but we want to use the second part starting after that point.  Using the first and last pieces of the page basically removes the script's code.

Does that make sense to you?   If not, let's say this is the entire page...
<html><body><script>blah blah blah</script></body></html>
The first strpos would locate the <script part.  So, substr($webpage, 0, $script_start_pos) would give us:
<html><body>
The second strpos would locate the </script> part, so substr($webpage, $script_end_pos+9) would give us:
</body></html>
And, using the  " . " for these would give us   <html><body></body></html>   which is what we want.
If there was two scripts in it, it would just repeat this as many times as needed.  Some websites have one type of scripts at the top of the page and others at the end of pages.  This depends on what they are needed for and the order they need to be processed.  So, the WHILE loop finds them all and drops them.

Hope that helps!

#18

Ernie,

You made me crack open the PHP Manual for the first time in years?! :open_mouth:

After looking closer at your WHILE loop and making some tweaks, I think it makes sense.

Here is what I came up with…

<?PHP
if (isset($_POST["submit"])) {
    //  Get user's URL from form and load into a variable
    $url = filter_input(INPUT_POST, "url");
    $webpage = file_get_contents($url);
    
    
    //  Strip out all <script> tags
    $script_start_pos = 0;
    $script_end_pos = 0;
    
    //123456789_123456789_123456789_123456789_123456789_
    //                   20                  40
    //This is some sample<SCRIPT>xxxxxxxxxxxx</SCRIPT>code to demonstrate...
    while ((strpos($webpage, "<script"))!== FALSE) {
      $script_start_pos = strpos($webpage, "<script");    //20
      $script_end_pos = strpos($webpage, "</script>");    //40
      //This is some sample<SCRIPT>xxxxxxxxxxxx</SCRIPT>code to demonstrate...
      //123456789_123456789_123456789_123456789_123456789_
      //                   20                  40
      //*** Below is the text that the above two substr calls pull out. ***
      //This is some sample
      //                                                code to demonstrate...
      //      
      $webpage = substr($webpage, 0, $script_start_pos) . ' ' . substr($webpage, $script_end_pos+9);
    }
    
    /*
    // ORIG CODE
    //  Strip out all <script> tags
    $script_start_pos = 0;
    $script_end_pos = 0;
    while (($script_start_pos = strpos($webpage, "<script"))!== false) {
        $script_end_pos = strpos($webpage, "</script>");
        $webpage = substr($webpage, 0, $script_start_pos) . substr($webpage, $script_end_pos+9);
    }
    */

    
    //  I do not understand what the following lines are doing?!
    header('Content-Disposition: attachment; filename="scraped_file.html"');
    header('Content-Type: text/html');
    header('Content-Length: ' . strlen($webpage));
    header('Connection: close');
    echo $webpage;
}
?>
<!DOCTYPE html>
<html>
<body>

<h3>Enter your URL and press download and the page will start download without scripts...</h3>

<form action="#" method="post">
    <div>
        <input type="text" name="url" placeholder="Enter the URL here!">
        <button type="submit" name="submit" value="submit">Download Your URL</button>
    </div>
</form>
</body>
</html>

See my question above in the modified code.


#19

These two lines are no longer needed. I editted them out in my post… They were from a previous test for you. Yes, the comments you added do explain how it all works. I have done a lot of scraping code and I have found that a lot of people do not understand it clearly. The most important part is to remember that a webpage is only a string of text. Nothing more. It does point to images and outside items, but, it still is just text. So, these kinds of routines do work, they just have to be exactly coded…

Now, the part you did not understand is a bit hard to explain… Let’s see if I can do it for you…

All webpages are sent to the browser in an odd format that everyone never see. First, there is sent out a header. Not a webpage header but a communication header. The header tells the browser what it is send to it. In our code, we need to tell the browser that we are sending an attachment to the browser. We told it the filename of the attachment and the file type of the file. When you do this, you should also tell the browser how many bytes it is. Then, we close the attachment’s connection. That is closing the header. At this point the browser will know what we are sending it. And, it deals with the code send to it as a downloadable file. Then, the <!doctype html> and codes starts another header for the page itself.

There are hundreds of ways to make a file download. I use it often when I am creating a PDF or Excel sheet in PHP and then force it to download so the user can save it and use them. For those, the code is different as they are other types of files. The type might be text/pdf or others as needed…

If you force a script to download a file, you need to tell the browser everything it needs to process it.

Hope that helps! If you have more question, ask away, but, it’s 11:30 PM here, so not sure how long I am up to check in again…


#20

Ernie,

I am going back and running this script on all of the web pages that I had saved over the last few months and trying to get clean copies. Am also doing lots of testing to make sure they actually work. One problem I was running into is that web pages I saved showed up on my hdd and with a file size yet later when I opened them up they would literally disappear before my eyes and get deleted. As you can imagine I am very untrusting these days, but it looks like your script will fix things for me, so thank you!!