Web Site Scraping with PHP

system · March 28, 2011, 4:14pm

I am a newbie to PHP and I am trying to do a site scraping project to obtain all the following fields:

Test Year, Test Name, Grade Level, Question #, Question Type, Reporting Category, Standard #, Standard description, Example Question (with image)

for this web page that has a page for each question.

http://www.doe.mass.edu/mcas/search/question.aspx?mcasyear=2010&QuestionSetID=1&grade=8&subjectcode=MTH&questionnumber=36

I am a newbie at PHP and would love if you could point me in the right direction. The page uses tables and I need to extract the data from the body of the page as well as some of the info from the url and then have it inserted into a MySQL database.

Thank you so much for your help.

phphelp · March 28, 2011, 10:46pm

There are two ways to parse content from the web page: a) using DOM, b) using regular expressions.
Here is quick example how you can use regular expressions to extract content of table cells in your sample page:
[php]<?php
$s = file_get_contents(‘http://www.doe.mass.edu/mcas/search/question.aspx?mcasyear=2010&QuestionSetID=1&grade=8&subjectcode=MTH&questionnumber=36’);

if(preg_match_all("~class=‘lg’>([\s\S]?)<td colspan=“2”>([\s\S]?)~i",$s,$matches)){
echo ‘

’;

print_r($matches);

echo ‘

’;
}
?>[/php]
Basically, you need to analyze HTML code of the source page, and create a pattern for regular expression.
And here is a useful summary of regular expression: http://www.cs.tut.fi/~jkorpela/perl/regexp.html