PHP array performance


#1

Since I am already using PHP for my web pages, I reused it for a little offline file parsing / processing (I am running php from the Win10 command prompt). I am now wondering if it is just the wrong language for the job or if it might just be slow due to some dumb mistake I may have made.

What my script does is this: It reads a flat file listing (each line containing the full path to one specific file) and incrementally/recursively constructs a nested array tree (one array for each folder) representing the complete folder/file structure (which in the end is dumped as a 65MB JSON file), i.e. for each read line the existing tree is traversed using the “exploded” path info and missing nodes are added as needed.

The input file is about 38MB and contains about 450’000 lines ("(sub-)folders" in the resulting data structure may contain anywhere from 1 to 8000 child nodes). To my surprise the script takes more than 30 minutes to built the tree and I it seems to get slower the bigger the existing tree already is (maybe some array reallocation/growing issue?).

Any ideas?

PS: I am currently using PHP 5.2.17 - since that is what is used on my web server.


#2

How often do you run this? Is the input file different each time so it has to run through everything?

I’d strongly suggest making a much smaller test input file and try running it with a profiler (xdebug is easy to set up). This will tell you what in your script takes most time/resources. A lot could be learned from this.

Off the top of my head I’d try running with several workers. Ie perhaps starting X workers that each take their part of the input file and iterates over the lines.


#3

To answer your question: In principle this is a one time load (see http://www.wothke.ch/playmod/ : it provides what is behind the “MODLAND” folder) - but while I am still tinkering it is a pain the elbow to wait half an hour each time I find that I want to make some change in the generated output…

I meanwhile found a workaround for the problem that I had apparently triggered by my attempt to structure my code as if I was using a real programming language… Must have been some kind of effect of PHP’s reference counting/handling…I now ditched any attempts to make the code more readable by using reference-vars for intermediate results and/or use params for functions. Instead I put my tree structure in the global context and every access is directly done via the global context.
This sucks but PHP is happy and the script now needs 3 minutes rather than 30…