Need help writing a script to automate something


#21

I do not understand why your files would disappear. But, it might be that you were loading files that had strange code in them that was causing other issues.

As you see by how small amount of code we now have working, you can do a lot if you just think out the logic of it ahead of time. We only use about a dozen lines of code to do this entire process. I have code that scrapes things off of sites like NFL scores and NFL schedules for my football pool. I have pulled weather reports out for others who needed just a certain area. Scraping is hard sometimes, but, for something like this use, it is easy enough… Let us know of your results of your testing… We can help if it does not!


#22

It’s because the Internet is full of crap coded pages and way too much Javascript!

In all my years working with computers, I have never seen a case where you save a file to your hdd, it has a file size, and then later when you open it the file disappears, or the page opens but when you quit out of your browser the .html file disappears.

At any rate, I now have a decent solution to strip out all the Javascript that causes issues when I try to read saved web pages.

I added an “Article name” field to this page, so that when your script runs, the filename is already named to the article name which saves me a few key strokes.

I’m sure there is more that I can do to pretty things up, but for now I think this solution will do.

Now I have about 200 web pages I have to resave… :rage:

But THANK YOU for all of your help, and saving me lost of wasted time!! (I was supposed to spend today getting my head back into a coding project from a few years ago, but I needed to get this fixed first because every day I save 5-10 articles and I am soooo backed up right now!!

If you are willing, I definitely could use some more help as I get back into web development.

Never thought I’d see the day where I would even forget HTML coding, but my mind is getting old and brittle!!!

Hopefully I can get back to where I was 3-4 years ago, when I think iw as pretty good at things.

Thanks again!! :slight_smile:


#23

You know, I was thinking out files disappearing… It could be because it was saved as a temporary file.
Those are erased after use. But, I would have to see the code you used for it. But, kind of not important now, I guess…

No problem! Glad to help you! And, there are several others on this site that are aces, a few much better than me. So, when you get your next issue, create a new post and one of us will help you. I am here a lot as I just love helping everyone with their programming puzzles…


#24

Ernie,

[quote=“ErnieAlex, post:23, topic:27724, full:true”]
You know, I was thinking out files disappearing… It could be because it was saved as a temporary file. Those are erased after use. But, I would have to see the code you used for it. But, kind of not important now, I guess…[/quote]

No…

I went into Firefox and surfed to a web page, decided to save it, choose File > Save Page As, gave it a name, and saved it to my computer. Then I went to Finder, and the file was there with a file size. There was a my-article-name.html file and a my-article-name_files folder. I would double-click on the .html file and either it would instantly disappear, or the page would open okay, but once I quit out of Firefox the .html would diappear before my eyes.

Either way, I have a better solution now as the files don’t seem to be disappearing, PLUS I am now stipping out Javascript that was breaking a lot of these saved web pages.

You know, I am listening to so good rock-n-roll now late at night, working away on my code, and forgot how much I missed doing this stuff on late Saturday nights!

It will take me some time to get my head back into LAMP, but the good thing is that I have a kick-ass code base I wrote several years ago, plus I document the s*** out of my code, so that means I have a super-duper guide to getting my head back into LAMP and my website!!

At the same time, my brain is getting old and brittle, and it would be nice to have some coding buddies online who are happy to help. (Sadly, a lot of people on these online coding forums can get rather nasty if you ask for help, but clearly you are of a different mold, and that is good to know!)

:slight_smile:


#25

If this is only for reading later I strongly suggest you just use one of the tools available for it. They do a much better job at cleaning up the source for later reading, and there are apps and browser extensions for them so you can just click one button to save pages for later.

Raindrop is a very nice service for this

Or if you want to host one yourself I’d suggest giving wallabag a go


#26

JimL,

Thanks for the suggestion, but here is the issue… I am trying to get accurate local copies so I can use them for future research. Not saying you, but people who think you simply bookmark what you like and it will be there in the future are naive.

The Internet has become all about $$$, and companies are not motivated to leave knowledge out there without paying a price.

I see information come and go on a daily basis, so I do what I can to save the original so it is there when I need it in a week, month or years later.

Until recently, a simple File > Save Page As worked, but for some sites like the New York Times, forget about it!!

Using a combination of different browsers, different approaches, saving as HTML, web archives, PDF and screenshots, I thik I have a good enough approach for my nees, but it certianly is much harder than in the past.

And that is too bad, because I also try to save copies of things for “inspiration” - the new York Times for some gorgeous news articles online, but unfortunately they are a real bit to store locally without an enormous amount of work!


#27

@ErnieAlex,

If you are looking for a challenge, I would love to see if you can find a practical way to save an article like this without losing any of the images or web design…

NYT: How China Walled Off the Internet

In order to truly benefit from that article, you need to see all of the images and graphics and at least know that there are videos as well.

I used Firefox’s built-in screen-capture and it appeared to capture a WSYIWYG as far as layout, but then you have the issue of the text being unselectable.

That is my end goal for anything I try and capture - I want the flexability and readability of a PDF with the pixel-perfect captring of the webd esign like a photograph.

Definitely a tall order on the modern Internet, but then again, maybe an expert screen-scraper like you knows of some tricks to accomplish this?


#28

Wallabag and the other “read later” apps I’ve tried do save a local copy. If you run Wallabag locally (I have an unraid server I run lots of dockerized apps on) you get a local database of all the content which you can back up.

Wallabag does NOT give you the entire source though by default. So it might not fit your use case. But I’d expect some of the “read later” apps to.


#29

Is that true for popular apps or whatever like Pocket?

I just assumed that all they did was bookmark what you wanted to read on a server, but that if someone like the NYT ever yanked an article then you would have a dead link.

Would be interested to see what you experience of a complex web page like what I referenced above and here again: NYT: How China Walled Off the Internet

So how do you quantify what you do and don’t get?

It’s not that hard for me to find browser plug ins that strip out adds and just give you readable text. But as mentioned before, often times what I want to save includes lots of photos and infographics that further explain what you are reading about. Without that supporting material, the article falls flat.

Think about famous magazines like Time and Life and how “a picture is worth a thousand words”. (It was one thing to read about Vietnam, and quit another to watch a man shot in the head at point blank range in a photograph, right?)

But thanks for sharing your thoughts! :slight_smile:


#30

My intentions are just to “read later” so I seem to have way more leeway than you who appear to be more datahoarder/archiver driven.

But let’s give it a spin to see what it comes up with!

Screenshot_642

Screenshot_643


Database entry:

    {
    "content": "<header class=\"story-header interactive... shortened because forum",
    "created_at": "2018-12-02 20:45:21",
    "domain_name": "www.nytimes.com",
    "headers": "a:24:{s:10:\"connection\";s:10:\"keep-alive\";s:14:\"content-length\";s:6:\"137925\";s:6:\"server\";s:5:\"nginx\";s:12:\"content-type\";s:24:\"text/html; charset=utf-8\";s:13:\"cache-control\";s:8:\"no-cache\";s:24:\"x-nyt-data-last-modified\";s:29:\"Sun, 02 Dec 2018 20:41:54 GMT\";s:13:\"last-modified\";s:29:\"Sun, 02 Dec 2018 20:41:54 GMT\";s:10:\"x-pagetype\";s:19:\"vi-interactive-nyt5\";s:18:\"x-vi-compatibility\";s:10:\"Compatible\";s:11:\"x-nyt-route\";s:14:\"vi-interactive\";s:13:\"x-origin-time\";s:23:\"2018-12-02 20:41:54 UTC\";s:13:\"accept-ranges\";s:5:\"bytes\";s:4:\"date\";s:29:\"Sun, 02 Dec 2018 20:45:21 GMT\";s:3:\"age\";s:3:\"207\";s:11:\"x-served-by\";s:36:\"cache-jfk8128-JFK, cache-bma1637-BMA\";s:7:\"x-cache\";s:9:\"MISS, HIT\";s:12:\"x-cache-hits\";s:4:\"0, 1\";s:7:\"x-timer\";s:26:\"S1543783521.362322,VS0,VE1\";s:4:\"vary\";s:27:\"Accept-Encoding, Fastly-SSL\";s:10:\"set-cookie\";s:176:\"nyt-a=gUCjbdjEfHKYFKjlTJc56r; Expires=Mon, 02 Dec 2019 20:45:21 GMT; Path=/; Domain=.nytimes.com, nyt-gdpr=1; Expires=Mon, 03 Dec 2018 02:45:21 GMT; Path=/; Domain=.nytimes.com\";s:15:\"x-frame-options\";s:4:\"DENY\";s:6:\"x-gdpr\";s:1:\"1\";s:13:\"x-api-version\";s:6:\"F-F-VI\";s:23:\"content-security-policy\";s:356:\"default-src data: 'unsafe-inline' 'unsafe-eval' https:; script-src data: 'unsafe-inline' 'unsafe-eval' https: blob:; style-src data: 'unsafe-inline' https:; img-src data: https: blob:; font-src data: https:; connect-src https: wss: blob:; media-src https: blob:; object-src https:; child-src https: data: blob:; form-action https:; block-all-mixed-content;\";}",
    "http_status": "200",
    "id": "3",
    "is_archived": "0",
    "is_starred": "0",
    "language": "",
    "mimetype": "text/html",
    "origin_url": "",
    "preview_picture": "https://my-wallabag-url/assets/images/0/2/020b8668/4466c846.png",
    "published_at": "2018-11-18 11:00:01",
    "published_by": "N;",
    "reading_time": "5",
    "starred_at": "",
    "title": "How China Walled Off the Internet",
    "uid": "",
    "updated_at": "2018-12-02 20:45:31",
    "url": "https://www.nytimes.com/interactive/2018/11/18/world/asia/china-internet.html",
    "user_id": "1"
}

Content block added to pastebin because forum (30k char limitation):

https://pastebin.com/SQGHmfbL


Images saved:

total 2.6M
-rw-r--r-- 1 65534 65534  36K Dec  2 21:45 015b7930.jpeg
-rw-r--r-- 1 65534 65534 109K Dec  2 21:45 0b93bcd4.gif
-rw-r--r-- 1 65534 65534  34K Dec  2 21:45 10a4da14.jpeg
-rw-r--r-- 1 65534 65534 753K Dec  2 21:45 1f756bf0.png
-rw-r--r-- 1 65534 65534 3.9K Dec  2 21:45 21ce1e63.gif
-rw-r--r-- 1 65534 65534 3.9K Dec  2 21:45 31a7b1a0.png
-rw-r--r-- 1 65534 65534 2.5K Dec  2 21:45 35d84db2.png
-rw-r--r-- 1 65534 65534 2.5K Dec  2 21:45 3743bc12.png
-rw-r--r-- 1 65534 65534  33K Dec  2 21:45 3f8f0d04.jpeg
-rw-r--r-- 1 65534 65534 753K Dec  2 21:45 4466c846.png
-rw-r--r-- 1 65534 65534  146 Dec  2 21:45 4ac6761d.png
-rw-r--r-- 1 65534 65534  95K Dec  2 21:45 5703eea6.png
-rw-r--r-- 1 65534 65534 3.0K Dec  2 21:45 58369b70.png
-rw-r--r-- 1 65534 65534  36K Dec  2 21:45 5e23ccae.jpeg
-rw-r--r-- 1 65534 65534 109K Dec  2 21:45 5f2d5382.gif
-rw-r--r-- 1 65534 65534 3.9K Dec  2 21:45 65030993.gif
-rw-r--r-- 1 65534 65534 102K Dec  2 21:45 6873db3c.jpeg
-rw-r--r-- 1 65534 65534 106K Dec  2 21:45 6ab2e857.png
-rw-r--r-- 1 65534 65534  146 Dec  2 21:45 830e9993.png
-rw-r--r-- 1 65534 65534 3.0K Dec  2 21:45 89008781.png
-rw-r--r-- 1 65534 65534 3.9K Dec  2 21:45 9beab66b.png
-rw-r--r-- 1 65534 65534 102K Dec  2 21:45 ac03ac03.jpeg
-rw-r--r-- 1 65534 65534  95K Dec  2 21:45 af694d63.png
-rw-r--r-- 1 65534 65534 106K Dec  2 21:45 bc77fdab.png
-rw-r--r-- 1 65534 65534  33K Dec  2 21:45 c6670354.jpeg
-rw-r--r-- 1 65534 65534  34K Dec  2 21:45 e007ea01.jpeg

#31

It’s definitely not perfect though

Screenshot_644

vs

Screenshot_645