How to check a large list of Indexed URL in search engine using a Java Program

Hi Guys,
I am trying to check thousands of URLs from search engine index, If they are existing or not. I need a Java program which can trigger these url one at a time to the search engine and record the response If it exists or not. I already ask this question to some Java questions and answers forum. But still I am waiting for complete solution.
Someone there who would be willing to spend some time to support me to solve this problem.

Thanks

Does this really need to be in Java? There are many ways to do this in PHP. I would guess that Java
would be similar. In PHP, there is a function that will check if a file exists. Therefore, you can just see
if there is a real file at the URL. Website “scrapping” is a common thread on the web.

There are thousands of sites that have free Java code to do this. Here is one sample library that is used a
lot and might help you with the code processes. http://jaunt-api.com/

Hope that helps…

Hi I agree with ErnieAlex, this is just not the place for Java questions ::slight_smile: !

That being said, have a look at http://jsoup.org/, it is a really useful tool to use for scraping, I’d be happy to
assist but off the forums.

PM me.

Well, actually, Java questions are fine here and he did place his post under the Java heading.

But, since this is mostly a PHP forum, he might get more responses for PHP. I did some further
surfing for him and found tons of Java samples for scraping and pulling Google info, so we do know
it can be done with ease…

Apologies!

So I can perhaps give some Java support here as well ? Important thing is to start with a good Java stack and IDE for the least pain!

I am currently using Gradle+Spring-Boot as a base in IntelliJ IDE.

Yes, we love to have anyone helping anyone here. HTML, JS, JQuery, PHP, just about anything programming!
It just helps a lot of it is placed under the correct heading so the people interested in one language more
than others can look at their preferred language first. Seems to work well so far…

Everyone has their own favorite IDE and libraries for their programming. Glad you can help Harish as I am a
bit weak in Java. I have scraped Google search with PHP and found it easy to pull out the URL’s. I have a
work in progress that takes the URL’s, scraps the actual pages and pulls all the info from them. Using this for
a self searching knowledge system. Works, but, slowly at the moment…

As far as his question…
Harish, you can check a URL to see if it exists, but, it gets tricky as you need to check for 404, 400, 403, etc
responses to see what the server sends you back. Another way that I found for you is to attempt to get
the “header” back from it instead. If you get a header, then the URL exists, if not it is a no-go. Some code
to do this would be something like:
[php]
url = new URL(“http://www.example.com”);
HttpURLConnection huc = (HttpURLConnection) url.openConnection();
int responseCode = huc.setRequestMethod(“HEAD”);

if (responseCode) {
System.out.println(“GOOD”);
} else {
System.out.println(“BAD”);
}
[/php]
This is not tested, just the basic code for you to try… (I do little with Java, but, found this online for you.)

Just a point of contention: Java != Javascript!!! ;D

Yes, Andrev, there is a difference? Did I mention Javascript?

No but this question is posted under
PHP Help Forum »
Client Side Coding »
Javascript & Ajax

Oh, yes, I see ! The heading of CLIENT-SIDE coding is a catch-all really. It covers CSS, JS, J, JQ and about
anything you do inside the browser, not on the server-side. Thanks for pointing that out.

CYA in the bitstream…

I know it’s a little late but I found a Java tutorial using Jsoup for Scraping the Google Search results. Maybe this helps you out or gives you a good starting point.

You can find it here: http://mph-web.de/web-scraping-with-java-top-10-google-search-results/

Sponsor our Newsletter | Privacy Policy | Terms of Service