Need Help Scraping Data From HTML

Hi I’m trying to extract the locations of jobs from a website so that I can have them overlaid on to google maps. I have the scraper working successfully, but it only seems to grab the first location. I have minimal PHP knowledge, but I have experience with other languages so I’m not new to coding. :slight_smile:

This is the small PHP scraper that I wrote:
[php]<?PHP
$scraped_page = file_get_contents(“http://huntersandgatherers.jobamatic.com/a/jobs/find-jobs/l-07410/sb-pd/pn-1”);

//scraping location
$regex_location= ‘/<td class=“location”>(.+?)</td>/’;
preg_match($regex_location,$scraped_page,$scraped_location_data);
var_dump($scraped_location_data);
echo “$scraped_location_data[1]”;
?>[/php]

When it executes var_dump($scraped_location_data) it outputs array(2) { [0]=> string(38) "New York, NY" [1]=> string(12) "New York, NY" }
so it’s obviously working. I see that there’s two slots in the array that both contain “New York, NY” even though it only exists once on the page.

How can I get it to store the 25 locations on the first page into an array that can easily be iterated through later?

my initial thoughts are you could explode $scraped_page where location appears and loop through each index

[php]$parts = explode(“location”,$scraped_page);
foreach ($parts as $part)
{
//run your scrape code here
}[/php]

Thanks for the response, I tried it but it still produced the same results. Just to clarify the whole thing should look like this right?

[php]$scraped_page = file_get_contents(“http://huntersandgatherers.jobamatic.com/a/jobs/find-jobs/l-07410/sb-pd/pn-1”);

//scraping location
$parts = explode(“location”,$scraped_page);
foreach ($parts as $part)
{
$regex_location= ‘/<td class=“location”>(.+?)</td>/’;
preg_match($regex_location,$scraped_page,$scraped_location_data);
}[/php]

output: Array ( [0] => West Orange, NJ [1] => West Orange, NJ )

your right that didn’t work just done a few tests and managed to extract all the locations to a single array:

[php]
$scraped_page = file_get_contents(“http://huntersandgatherers.jobamatic.com/a/jobs/find-jobs/l-07410/sb-pd/pn-1”);

//scraping location
$regex_location= ‘/<td class=“location”>(.+?)</td>/’;
preg_match_all($regex_location,$scraped_page,$scraped_location_data);

print_r($scraped_location_data[1]);
[/php]

Awesome! Thanks a lot Dave!

The confusing thing to me is that looks the exact same as mine except you used the second index in the print_r statement, what else did you change to get it to continue?

I also used preg_match_all to catch all matches rather then just one

ah that little _all was invisible to me :smiley:

haha there’s so many functions available in php it’s hard to remember them all lol

Isn’t it like that with every language? lol I learned Java last semester and I remember reading a statistic that said most Java programmers only know about 5% of the total available classes.

yeah I think it is, luckily the php manual makes it easy to look up their various functions

One more problem I just realized, In order to put these locations into the Google Maps Geocoder API each string has to be separated by a plus sign. Any idea how I could achieve this?

Here’s an example: http://maps.googleapis.com/maps/api/geocode/json?address=New+York,+NY&sensor=false

so some how New York, NY from $scraped_location_Data[1] needs to be New+York,+NY

It sucks that there isn’t an edit button lol I’m assuming that I would need to run a for loop to extract each element out of the array into a single string variable and then fee that string variable into the Geocoder API string, correct?

I think you can pass the address as a string with no need to add +

sounds good to me

Interesting, that seemed to work when I typed in “New York, NY”. Thanks I’ll try it out.

I’m trying to extract the individual locations out of the array $scraped_location_data but it seems to me that all of the 25 locations are stored in array slot 1. How can I get at each location individually since if I type echo “$scraped_location_data[5]” it spits back “Notice: Undefined offset: 5 in scraper.php on line 11”

Nevermind, I figured it out. I had to call it as being a multi-dimensional array.

Sponsor our Newsletter | Privacy Policy | Terms of Service