Very difficult regex questions

system · December 29, 2011, 4:03pm

Please read: This question is about regex, but I’m not sure regex is the best solution. I haven’t posted an example of the code I have so far, but I want the reader to know that I am working very hard on solving this myself. I don’t want to put the idea into people’s head that regex is the best way to go because it may not be. I’ve been told by some programmers, already, that regex is not the best place to start. Other programmers have told me that my other solutions are not good, either. So, if regex is wrong, please show me what is right.

Example of input

vulture (wing) tabulations: one leg; two legs; flying father; master; patriarch mat (box) pedistal; blockade; pilar animal belly (oval) old style: naval jackal's belly; jester slope of hill (arch) key; visible; enlightened

Basically, I’m having trouble with some more complicated regex commands. Most of the code I’m finding that uses regex is very simple, but I could use it in so many places if I could get good with it. Would you look at the kind of stuff I’m trying to do and see if you can convert any of it?

Arrayize the word or words between the braces, “(” and “)”.
Arrayize the first words following a new line ending xor four spaces and then a closing brace, “)”, and a space and an open brace " (" AND the first words in the document up until a space and an open brace " (".
On any line with semicolons, arrayize the words which are separated by semicolons. Get the word or words after the last semicolon but do not get the words after a line break or four consecutive spaces. Words from lines that begin with the string “tabulations:” should not be included in this array, even though lines that begin with the string “tabulations:” have semicolons on them. If a new line ending in a close brace, “)” comes before a line containing semicolons and not starting with “tabulations” “no alternates” to the array, instead.
Get the word or words following the colon and preceding the line break on a line that begins with the string “old style:”. If a new line ending in a close brace, “)” comes before a “tabulations:”-starting line, add “no old style” to the array, instead.
The same as 3, except only for lines that begin with the string “tabulations:”. If a new line ending in a close brace, “)” comes before a “tabulations:”-starting line, add “no tabulations” to the array, instead.

I am trying to figure out how to do this via PHP because I use the data in a PHP form, and because PHP and Javascript are the only scripting languages I am familiar with and PHP seems most viable in this situation.

Laffin · December 30, 2011, 9:52pm

I really don’t think you will get a response on this one, because your description.
But will give you a little guidance. Think in a linear step fashion instead of overall output fashion.
If you had a flat tire on your car, your overall ending should be the replace the tire.
In a step by step, get the tire iron, remove the tire, pull spare from trunk, put on spare tire.

Once you have your steps organized, you will be better prepared. What you will be making is called a parsing, later you will get into lexers. But yes, this isnt a regex to the rescue question.

Fishtankalpha · January 1, 2012, 6:18pm

Someone said the same thing in another forum to me. So I put desired outputs and rewording:

Arrayize the word or words between the braces, “(” and “)”.
DESIRED OUTPUT EXAMPLE: $array = wing, box, oval, arch.
Arrayize the first words following a new line, or the start of the example. Or, four spaces and an alpha character, from the first alpha character to the space+brace " (".
DESIRED OUTPUT EXAMPLES2: $array2 = vulture, mat, animal belly, slope of hill.
Arrayize words separated by semicolons on lines without colons.
DESIRED OUTPUT EXAMPLE3: $array3 = $subarray0 = father, master, patriarch; subarray1 = pedistal, blockade, pilar; subarray2 = jackal's belly, jester; subarray3 = key, visible, enlightened.
Arrayize lines separated by semicolons that follow the string "old style: ". If a line with a bracket, “(” appears before the next "old style: "-starting line, add a “null” subarray to the array.
DESIRED OUTPUT EXAMPLE4: $array4 = $subarray0 = null; $subarray1 = null; $subarray2 = naval; $subarray3 = null.
The same as 3, except the desired lines begin with the string "tabulations: ".
DESIRED OUTPUT EXAMPLE5: $array5 = subarray0 = one leg, two legs; subarray1 = null; subarray2 = null; subarray3 = null.

People gave me many generic solutions, but I want to learn to use PHP’s PCRE functions in these ways, as much as possible, before using some other method.

Fishtankalpha · January 2, 2012, 3:32am

I got a solution for 1 and two solutions for 2:

Solution for 1:

[code]<?php
$myFile = file_get_contents(‘fakexample.txt’); //Load the file using file_get_contents() because I don’t need to find any line starts or ends.
function get_between($startString, $endString, $myFile){ //Create a function, so I can get a result with any start or end string.

//Escaping start and end strings (e.g., adding /'s before any special characters in the start and end strings):
$startStringSafe = preg_quote($startString, ‘/’);
$endStringSafe = preg_quote($endString, ‘/’);

//non-greedy matching of any characters between the start and end strings:
//the s modifier, after the regex command’s closing / should make it possible to match newlines, but I don’t see why this is useful, and I don’t know if it’s true. In the answer to array2, I had to use the m modifier.
preg_match_all("/$startStringSafe(.?)$endStringSafe/s", $myFile, $matches); //The function that takes a regex argument and identifies the results. (.?) means ‘any number of characters followed by’, it seems. Still not sure what the ? character does, though.
return $matches; //Gets the result of the preg_match_all function back to the superfunction, get_between().
}
$list = get_between("(", “)”, $myFile); //Puts the results of the get_between() function with the desired start and end strings into an array variable.

//Iterating through the results:
foreach($list[1] as $list){
echo $list."\n";
}
?>[/code]

Fishtankalpha · January 2, 2012, 10:14am

The solution for 2 is posted here, http://forums.devshed.com/regex-programming-147/help-solve-these-expert-level-regex-puzzles-873536.html#post2729304. I haven’t solved 3, yet. Someone there is helping me.

Laffin · January 2, 2012, 10:49pm

array(5) {
  [0]=>
  array(4) {
    [0]=>
    string(4) "wing"
    [1]=>
    string(3) "box"
    [2]=>
    string(4) "oval"
    [3]=>
    string(4) "arch"
  }
  [1]=>
  array(3) {
    [0]=>
    string(7) "vulture"
    [1]=>
    string(3) "mat"
    [2]=>
    string(12) "animal belly"
  }
  [2]=>
  array(4) {
    [0]=>
    string(25) "father; master; patriarch"
    [1]=>
    string(25) "pedistal; blockade; pilar"
    [2]=>
    string(22) "jackal's belly; jester"
    [3]=>
    string(25) "key; visible; enlightened"
  }
  [3]=>
  array(5) {
    [0]=>
    NULL
    [1]=>
    NULL
    [2]=>
    NULL
    [3]=>
    string(5) "naval"
    [4]=>
    NULL
  }
  [4]=>
  array(4) {
    [0]=>
    array(1) {
      [0]=>
      string(25) "one leg; two legs; flying"
    }
    [1]=>
    NULL
    [2]=>
    NULL
    [3]=>
    NULL
  }
}

Looks like my code is good, dun with mostly preg

<?php $text=<<<EOF vulture (wing) tabulations: one leg; two legs; flying father; master; patriarch mat (box) pedistal; blockade; pilar animal belly (oval) old style: naval jackal's belly; jester slope of hill (arch) key; visible; enlightened EOF; $data=array(); header('Content-Type: text/plain'); preg_match_all('@$([^)]+?)$@',$text,$matches,PREG_OFFSET_CAPTURE); foreach($matches[1] as $info) { $data[0][]=$info[0]; $offsets[]=$info[1]; } preg_match_all('@(?:^|\x20{4}(?<=[a-zA-Z]))([\w\s]+\w)\s\(@m',$text,$matches); $data[1]=$matches[1]; preg_match_all('@^(?:^|\x20{4})((?:(?:\w[^;\r\n:]+\w);\s*)+(?:(?:\w[^\r\n]+\w)+))(?:$|[\r\n]+|\x20{4})@m',$text,$matches); $data[2]=$matches[1]; preg_match_all('@(?:^|{\x20{4})old style: (.*?)[\r\n]+$@m',$text,$matches,PREG_OFFSET_CAPTURE); $coff=0; $offc=count($offsets); foreach($matches[1] as $idx=>$info) { while($coff<$offc && $offsets[$coff]<$info[1]) { $data[3][]=NULL; $coff++; } $data[3][]=$info[0]; } while($coff++<$offc) { $data[3][]=NULL; } preg_match_all('@(?:^|{\x20{4})tabulations: (.*?)[\r\n]+$@m',$text,$matches,PREG_OFFSET_CAPTURE); $coff=0; foreach($matches[1] as $idx=>$info) { while($coff<$offc && $offsets[$coff]<$info[1]) { $data[4][]=NULL; $coff++; } $data[4][$idx][]=$info[0]; } while($coff++<$offc) { $data[4][]=NULL; } var_dump($data); ?>

Laffin · January 2, 2012, 11:22pm

Reworked #3

preg_match_all('@^(?:^|\x20{4})((?:(?:\w[^;\r\n:]+\w);\s*)+(?:(?:\w[^\r\n]+\w)+))(?:$|[\r\n]+|\x20{4})@m',$text,$matches);
foreach($matches[1] as $info)
{
  $sep=explode(';',$info);
  $data[2][]=array_map('trim',$sep);
}

now outputs


  [2]=>
  array(4) {
    [0]=>
    array(3) {
      [0]=>
      string(6) "father"
      [1]=>
      string(6) "master"
      [2]=>
      string(9) "patriarch"
    }
    [1]=>
    array(3) {
      [0]=>
      string(8) "pedistal"
      [1]=>
      string(8) "blockade"
      [2]=>
      string(5) "pilar"
    }
    [2]=>
    array(2) {
      [0]=>
      string(14) "jackal's belly"
      [1]=>
      string(6) "jester"
    }
    [3]=>
    array(3) {
      [0]=>
      string(3) "key"
      [1]=>
      string(7) "visible"
      [2]=>
      string(11) "enlightened"
    }
  }

Fishtankalpha · January 3, 2012, 4:57am

I want to report your response to a moderator as “most excellent”. ;D

Thanks. You rock.

Fishtankalpha · January 3, 2012, 5:04am

So, I tried this code:

[php]<?php
$filename = “fakexample.txt”;
$file = fopen($filename, “rb”);
$myFile = fread($file, filesize($filename));
preg_match_all(’@^(?:^|\x20{4})((?:(?:\w[^;\r\n:]+\w);\s*)+(?:(?:\w[^\r\n]+\w)+))(?:$|[\r\n]+|\x20{4})@m’,$myFile,$matches);
foreach($matches[1] as $info)
{
$sep=explode(’;’,$info);
$data[2][]=array_map(‘trim’,$sep);
}
?>[/php]

Of course, the result is blank. What do you echo? :-\

Also, this is some very interesting code. Could you explain what’s going on?

Laffin · January 3, 2012, 5:21pm

the preg_match_all just isolates lines with alphanum words/pheases with semicolon seperators.
as in my first post it returns a result of “father; master; patriarch”. I did try getting all results as an array with a preg but found it way to complicated task, as the pattern string would require two matches with semicolon and one with end of string/line. so instead I used explode, to seperate the list items, and used array_map to iterate through each element applying trim (removes whitespace from the beginning/ending of a line). and puts it into the $data array.

I think you will find test cases 4/5 more intersting, since it uses a special preg flag. Just realized when I was writing the routine for that ones, that the braces was already checked by the first test case, so instead of making a complicated preg, I opted to use the first test case and had it return the offsets of the paren analyzer. so with offsets in hand, it’s just a comparison check if we hit it before or after the current checker

was interesring regex pattern development. Regex takes time to learn and may be dauting at first. But you can find some helpful tools, my favorite is Exoresso Regex Builder (can use google to find and download) Using this tool, you can plug in any regex pattern and it will give you a breakdown of what the pattern is doing in plain english (well in a tree form explaining the pattern in a step by step manner)

Fishtankalpha · January 4, 2012, 3:40am

Laffin, I’ll check the tool out. Thank you. I also found a good tutorial site, which is c_regex.html @ asiteaboutnothing. 8) It explained lookbehind in a way that I could understand. I’m not completely through it, but so far I have understood everything. So, it beats everything else, especially the php manual. >:( I’ve been finding regex patterns on StackOverflow and trying to read them before the context to test myself. I’ve also spoken with some friends about regex syntax, which is used in vi. I’m ready to change editors. Working on the built-in vi tutorial, also.

In order to see the result of your code, I used the [php]print_r()[/php] function. Using [php]print()[/php] only gives me the word Array. Using [php]foreach($array[0] as $array)print($array);[/php] or a similar block of code that uses echo, I get ArrayArray. [php]print_r()[/php] is the function that i was looking for. :’( And I found it! Ow. My eyes!~ lol

There’s also some code that works almost exclusively with [php]preg[/php] commands. It splits the lines with [php]preg_split()[/php]. It then removes any line that contains a colon. Then it uses this regex string: [php],(?x)(?:\s*([^;]+);)|(?<=;)\s*\b(\w+)\b[/php]. God knows I don’t understand it though. “Science knows I don’t understand it, though.”

Still not sure what that comma means… I thought [php],(?x)[/php] meant ‘allow Ruby syntax’, but I see that there is a lookbehind. I didn’t think they were allowed in Ruby.

php[/php] - Don’t capture strings that begins with a colon and one or more white space characters that are not semicolons. This actually confuses me because I don’t think that white space characters are semicolons, anyway.

[php];)|[/php] - And ends with a semicolon or… Actually, I don’t see how we were ready for an or, here. Everything before the pipe, including that semicolon, is still a bit of a mystery to me! Talk about some crazy regex!

php[/php] - For words that immediately precede a semicolon…

[php]\s*\b(\w+)\b[/php] - There’s some amount of white space, then there’s a word surrounded by non-word boundaries. Capture that. Not sure what the + means, entirely. I think it means ‘as many times as possible’ or greedy.