preg_match_all HELP!

allen652 · July 31, 2012, 7:49pm

I have tried this but can’t seem to figure out the regexp. Too complex for me I suppose.
The string is several hundred questions with answers and details that I would like to extract. The ID varies in length and sometimes contains letters and is always followed by some amount of white space then “Point”… The questions and answers are always contained in << >>. I would like to capture everything, not just what I attempted below. I will then save each question with its available info to a database to be used within an online quiz.

[php]preg_match_all(’#D:\s(.)[\s\S]Points[\s\S]<<([\s\S])>>[\s\S]<<([\s\S])>>[\s\S]<<([\s\S])>>[\s\S]<<([\s\S])>>[\s\S]<<([\s\S])>>[\s\S]Answer:[\s\S]<<([\s\S]*)>>#sU’, $fc, $webids);[/php]

Here is a section of questions in pdf format…
cliffcazes.com/samplequestions.pdf

designer · August 1, 2012, 12:47am

Sorry, I can’t help. I am not extremely experienced with preg_match_all

ErnieAlex · August 1, 2012, 12:48am

Well, there are several ways to do this, but, I think preg_match may be hard to sort out.
This is due the complex regex formula to do what you want. So, there are many other ways to handle
this type of decoding.

So, first, let’s take the first question for testing:

Page: 38 of 185 09/15/11 EXAMINATION ANSWER KEY <> 28 ID: 2992000102-001.04 Points: 1.00 <> A. <> B. <> C. <> D. <> Answer: <>

Is all of this needed? I mean, do you need all of the qq-numbers and everything else? It is VERY easy to divide up the questions. (One line of code.) And, very easy to break these up into questions and answers. Tell me what you need above and I will show you a way to accomplish it. Should be just a few lines...

allen652 · August 1, 2012, 12:59am

The QQ followed by some numbers appears to be junk and is not needed. The only number (which sometimes contains letters) I need for each question is what follows "ID: ". After each question there is a details section as well. It would be nice to extract that info as well. Thanks for your help.

ErnieAlex · August 1, 2012, 1:31am

Okay, I will look at it.

But, another question, how large a file?

And, does the LOR at the beginning of some Page number’s mean anything???

allen652 · August 1, 2012, 1:47am

Pretty big file. At least 800+ questions. “lor” doesn’t matter for this application. Thanks again.

ErnieAlex · August 1, 2012, 3:33am

Well, after a lot of playing around with the test data, I came up with the following code.
It basically just loads all of the data into a simple array. Next, we would need to create a parsing routine.
That routine would be a simple “for” clause that would go down thru the individual lines and pull the data you
need for each question. Since most of the questions start with a page number, we can break down the total
page into separate questions using “PAGE:” as a hint of where to start. Then, you would have to look for
questions and then the answers. And, of course the ID: to capture the id number.

Here is the starting point. It assumes that the file is a text file. I took your PDF sample and copied the 3 pages of data and placed this into a text file. I called it test.txt. Next, I used this code in the same folder as the text file. It just lists the live data from the test data. If you look at the output of this, you will understand where I am heading with this code. You would have to loop thru the $lines array and pull your data as needed. So, if the line starts with A. or C. , then they are answers. You can test these lines using PHP string functions.

So, here is a start for you:
[php]

<?PHP // Load entire file into an array (called $lines) $fd = fopen ("test.txt", "r"); while (!feof ($fd)) { $buffer = fgets($fd, 4096); $lines[] = $buffer; } fclose ($fd); // Loop through our array... foreach($lines as $line) { echo($line)."
"; // just display the lines for now... (replace this with parsing code) } ?>

[/php]
I know this is not a solution, but, it starts you off. It gets the data into a more useable form.
If you need help with the parsing code, let us know…

allen652 · August 1, 2012, 10:03am

I have already used this recursive approach on a similar project. Yes, it does work. But I was hoping that there was a more direct and cleaner approach.

Thanks for your help though.

ErnieAlex · August 1, 2012, 9:35pm

Well, you can use preg_match, but, the regex would be crazy. You would have to be able to put each question format into a regular expression that can NOT vary. Looking down your question sampler, there is not exact non-varying expression. There are many exceptions. You could remove the exceptions, then use a regex, but, that is complicated. preg_match is mostly used for data that is always the same. Compare any two of your questions and they have different formats.

But, just my opinion. Perhaps some other member could come up with something else to try.

Good luck…

allen652 · August 1, 2012, 11:37pm

SOLVED.

[php]preg_match_all(’/([1234567890]+) ID: ([\s\S]+) Points: ([1234567890.]+)\n<QQ.+><<([\s\S]+)>>\nA. <QQ.+><<([\s\S]+)>>\nB. <QQ.+><<([\s\S]+)>>\nC. <QQ.+><<([\s\S]+)>>\nD. <QQ.+><<([\s\S]+)>>\nAnswer: <QQ.+><<(.)>>/Ui’,$string,$matches);[/php]

ErnieAlex · August 2, 2012, 4:49pm

WOW! Guess I have to study up on preg_match/regex’s…

So it works well for you and strips out what you wanted. Sorry, I tried to send you another way…

CONGRATS! Hmmm, more to study for me… LOL, seems like that never ends in programming…