Mass Find-and-Replace, and How to Get REGEXP_SUBSTR to do anything


#1

Fair warning: There’s a pretty lengthy lead-in here.

After migrating a forum from one site to another, I’ve set out to try to replace old links with their equivalents on the new site.

The trouble is that none of the IDs can be expected to match. So, I have to replace every ID in every in-site link. I have a large table of the old links and the current IDs that corresponds to them, so that’s not an issue. What is an issue is working out how to do the conversion.

There’s over 15,000 topics and 460k+ posts, so searching through every post for one topic ID, and then doing that again 14,999+ times, isn’t a time-efficient or reasonable solution.

What I thought would be a decent solution would be to look for posts where there’s content that fits the pattern of the link. After finding the first such post, get the link (although there might be more than one!), get its ID, find the matching ID in the link table, get the current replacement from that row, and then replace the link in the original post with the new one.

As usual in programming, the trouble’s in the details. I decided regular expressions might be the best way to make this happen, but I can’t get them to produce anything even remotely useful.

Here’s what the current code looks like, with genericized names. Right now it’s only single-use since I’m trying to get it to work at all, and verify that it does what I want it to do before I scale it up.

[php] <?php
session_start();
$servername = “localhost”;
$username = “username”;
$password = “password”;
$link = new PDO(“mysql:host=$servername;dbname=dbname”, $username, $password);
$link->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);
$total_targets = $link->query(“SELECT id_msg, body FROM dbname.smf_messages WHERE (CONVERT(body USING utf8) LIKE ‘%7.genericizedlink.com/sitename/index.php?showtopic=%’);”);
$row = $total_targets->fetch(PDO::FETCH_ASSOC);//) {
$ebert = $row[‘body’];
$reggie = $link->query(“SELECT REGEXP_SUBSTR(’$ebert’,’(7.genericizedlink.com/sitename/index.php?showtopic=\d+)’)”);
$fetch = $reggie->fetch(PDO::FETCH_OBJ);
var_dump($fetch);
?>[/php]

And the results:

object(stdClass)#4 (1) { ["REGEXP_SUBSTR('[url=http://b7.genericizedlink.com/sitename/index.php?showtopic=239]http://b7.genericizedlink.com/sitename/i...p?showtopic=239[/url]

Insert text from the relevant post here. -_- string(0) “” }

I don’t think Regexp_substr is doing anything at all, and I’m not sure why not. From what I could gather from the syntax in the examples in the manual, this is the correct format, but obviously that isn’t actually the case.

Any insight that anyone can offer here would be greatly appreciated!


#2

I can help later, but I just wanted to drop a line to see how you imported the data into the new system? A data dump would have included the old id’s used, so it wouldn’t have replaced them with a new auto-incremented value.


#3

It’s a good question. In this case, the data for the migration had to be acquired through web crawling. Also, the forum had gone through another move before this one, which had the side effect of messing up all of the active IDs on the site at that time. The original links still worked, redirecting to the proper places after that move – hence it being possible to create a table ‘converting’ all of those links – but the IDs were, and still are, on that version of the site, pretty well unusable by anything except for its current software.


#4

Ah… And you don’t have access to the original DB?

I think it would just be a matter of fixing your query, What are you expecting out of this?
[php]SELECT REGEXP_SUBSTR(’$ebert’,’(7.genericizedlink.com/sitename/index.php?showtopic=\d+)’)[/php]


#5

That’s right. It’d have saved the forum a lot of time and trouble if it were otherwise!

At that point, the idea was that it should select the link from the body of the text (represented by $ebert), with the ID number at the end. So returning something like “7.genericizedlink.com/sitename/index.php?showtopic=5301”. I only actually need the 5301, but the use of the stuff before that was meant to try to help keep it from accidentally selecting something else.

One problem here is that if there are other links in the same post, I wouldn’t expect it to fix those, but running the script several times to make sure it gets it all right is not the biggest problem I have right now. :wink:


#6

This post disappeared from my pending list for some reason.

Since I don’t have anything to test so, I can be a bit more helpful if I have some data to play with to work on queries.


#7

http://sqlfiddle.com/#!9/261649/12

That should be more what you are looking for!


#8

I’m sorry for taking so long to get back to you. My mind’s been occupied, and the more time I spend near this problem the more sick I am of thinking about it, almost to the point of thinking about trying the stupid way of doing this, which should work even if it takes forever and wastes an inordinate amount of bandwidth and processing time (although at least with some kind of filter, you “only” need to search for 15,000 links 1.6k times instead of 460k times). I’ve grown after only a short time with them to despise Regexps.

The fiddle is a step in the right direction, but unfortunately at present it’s kind of reinventing the wheel in this case since it returns the entire post (“body”), not just the target string, which is the link itself. The LIKE select in the original snippet does that adequately, if maybe not so elegantly.

I apparently didn’t sleep well and my brain’s discombobulated to the point where I don’t think I can be much help here right now, but here’s my reply fiddle. The regexp needs to return just the link. Also, the “O’Hara” example works fine here because it’s escaped, but a real row might not have that – it’s something I need to account for in some way. I’d use prepared statements, but doing that with Regexp is apparently … complicated.

Thanks for putting up with me.


#9

Not sure where you need the help, but I am a bit discombobulated as well for the time being.


#10

I ended up doing it locally; there’s apparently a Windows app that can import a spreadsheet, and do this kind of operation, and it worked well enough in Wine. I’m still interested in the problem of doing it automatically, with regexp or a similar method, just because it seems like it ought to be easier than it is, but I’m willing to label the case closed at this point. Thanks for putting time into it, anyway.