Read large tab delimited text file & put in MYSQL database table

We already have a table set up, and we have a large txt file that we are FTPing to our server. The file has over one million rows of data with 27 fields all tab delimited. I have never processed such a large file and was wondering the best method to buffer it so that it does not bog down my server. Could someone point me to a good tutorial or provide information on how to best do this?

Learn up on the Mysql function load data infile. Here is a complete working script I did for a client. It uses the OLD mysql calls. You should update it to use PDO or mysqli.

[php]<?php
/*
Filename: bulk_insert_csv_scenes
Author: Kevin Rubio
Last Modified Monday, Friday, July 01, 2011 9:05 AM

This file takes a csv file on the server and updates a Mysql database. It then deletes
the csv file so a user cannot reload the page and re-insert the same data creating
duplicate records.

Database should have unique keys for the ultimate protection of duplicates being added.
Deleting the csv file just adds an additional layer of protection in case database table was
not set up with unique keys.
*/

$csvfile = ‘scenes.txt’;
$csv_upload_directory_name = ‘uploads_csv’;
$table = ‘scenes’;

//------------------------------------------------------------------
// Full system path to the directory with the csv file
//------------------------------------------------------------------

$realpath = realpath(’.’);

//------------------------------------------------------------------
// Make sure our upload directory exists
//------------------------------------------------------------------

if (!file_exists("./$csv_upload_directory_name")) {
echo “<div class=“error”>The required directory $csv_upload_directory_name does not exist at $realpath\n”;
include(“page_footer.php”);
exit;
}

//------------------------------------------------------------------
// Get our directory permissions
//------------------------------------------------------------------

$directory_permissions = substr(sprintf(’%o’, fileperms("$csv_upload_directory_name")), -4);

//------------------------------------------------------------------
// Directory needs enough permissions to delete our csv file after database update
//------------------------------------------------------------------

if ($directory_permissions != ‘0777’) {
echo “<div class=“error”>Incorrect permissions on folder $realpath/$csv_upload_directory_name. Permissions are $directory_permissions. They must be 0777.\n”;
include(“page_footer.php”);
exit;
}

//------------------------------------------------------------------
// Make sure our csv file exists
//------------------------------------------------------------------

if (!file_exists("$realpath/$csv_upload_directory_name/$csvfile")) {
echo “<div class=“error”>The csv file $csvfile does not exist in the directory “$csv_upload_directory_name” at $realpath/$csv_upload_directory_name/\n”;
include(“page_footer.php”);
exit;
}

//------------------------------------------------------------------
// Open csv file and make sure row header matches db columns
//------------------------------------------------------------------

$fh = fopen("$realpath/$csv_upload_directory_name/$csvfile", “r”);
$csv_header = fgets($fh);
fclose($fh);

$csv_header = rtrim($csv_header);

// Tab delimited version
$valid_header=“partner_id scene_network scene_website scene_url_gallery scene_description_short scene_description_long scene_url_thumb scene_number_of_videos scene_number_of_pictures scene_flash scene_tube scene_mobile scene_date_updated scene_model scene_other_model scene_featured scene_status”;

// Pipe delimited version
//$valid_header=“partner_id|scene_network|scene_website|scene_url_gallery|scene_description_short|scene_description_long|scene_url_thumb|scene_number_of_videos|scene_number_of_pictures|scene_flash|scene_tube|scene_mobile|scene_date_updated|scene_model|scene_other_model|scene_featured|scene_status”;

if ($csv_header !=$valid_header) {

echo "<div class=\"error\">The row headers on your file $csvfile are named incorrectly.<br /><br />Required headers:<br />$valid_header</div>";

echo "<div class=\"warning\">Your Header:<br />$csv_header</div>\n";

unlink("$realpath/$csv_upload_directory_name/$csvfile");// Delete bad file
include("page_footer.php");
exit;

}

//------------------------------------------------------------------
// Update DB
//------------------------------------------------------------------

$sql = " LOAD DATA LOCAL INFILE ‘$realpath/$csv_upload_directory_name/$csvfile’ “;
$sql .= " IGNORE”;
$sql .= " INTO TABLE $table";
$sql .= " FIELDS TERMINATED BY ‘\t’ ";
$sql .= " LINES TERMINATED BY ‘\n’ ";
$sql .= " IGNORE 1 LINES "; // Ignores the first X amount of records. Example: 1 would remove header row

$sql .= “(partner_id,scene_network,scene_website,scene_url_gallery,scene_description_short,scene_description_long,scene_url_thumb,scene_number_of_videos,scene_number_of_pictures,scene_flash,scene_tube,scene_mobile,scene_date_updated,scene_model,scene_other_model,scene_featured,scene_status)”;

mysql_query($sql) or die(mysql_error());

//------------------------------------------------------------------
// Delete CSV file
//------------------------------------------------------------------

unlink("$realpath/$csv_upload_directory_name/$csvfile");// Uncomment this for final

//------------------------------------------------------------------
// Show how many records added
//------------------------------------------------------------------

printf(“Records updated: %d\n”, mysql_affected_rows());
?> [/php]

Here is what I did but I am concerns that running this process on 1.5 million rows. Is there a good way to buffer it? I am afraid to run it on server and bogging it down. Any help is appreciated.

[php]<?php
$con = mysql_connect(“localhost”, “root”, “”);
mysql_select_db(“cars”, $con);
$csvfile = ‘usedcars.txt’;
$csv_upload_directory_name = ‘upload’;
$table = ‘cars’;
//------------------------------------------------------------------
// Full system path to the directory with the csv file
//------------------------------------------------------------------
$realpath = realpath(’.’);
//------------------------------------------------------------------
// Make sure our upload directory exists
//------------------------------------------------------------------

if (!file_exists("./$csv_upload_directory_name")) {
	echo "<div class=\"error\">The required directory $csv_upload_directory_name does not exist at $realpath</div>\n";
	//include("page_footer.php");
	exit;
}

//------------------------------------------------------------------
// Get our directory permissions
//------------------------------------------------------------------
$directory_permissions = substr(sprintf('%o', fileperms("$csv_upload_directory_name")), -4);
//------------------------------------------------------------------
// Directory needs enough permissions to delete our csv file after database update
//------------------------------------------------------------------
//echo substr(sprintf('%o', fileperms('/tmp')), -4);
//echo substr(sprintf('%o', fileperms('/etc/passwd')), -4);  

if ($directory_permissions != '0777') {
	echo "<div class=\"error\">Incorrect permissions on folder $realpath/$csv_upload_directory_name. Permissions are $directory_permissions. They must be 0777.</div>\n";
	//  include("page_footer.php");
	exit;
}

//------------------------------------------------------------------
// Make sure our csv file exists
//------------------------------------------------------------------

if (!file_exists("$realpath/$csv_upload_directory_name/$csvfile")) {
	echo "<div class=\"error\">The csv file $csvfile does not exist in the directory \"$csv_upload_directory_name\" at $realpath/$csv_upload_directory_name/</div>\n";
	//include("page_footer.php");
	//  echo $realpath;
	exit;
}

//------------------------------------------------------------------
// Open csv file and make sure row header matches db columns
//------------------------------------------------------------------
$fh = fopen("$realpath/$csv_upload_directory_name/$csvfile", "r");
$csv_header = fgets($fh);
fclose($fh);
$csv_header = rtrim($csv_header);
//------------------------------------------------------------------
// Update DB
//------------------------------------------------------------------

echo $realpath/$csv_upload_directory_name/$csvfile;
//initialize the buffer
$buffer = null;
//set xmode to read, r = read, rb = read binary, w = write, wr = read write, a = append, x = create and write
$handle = fopen(’$realpath/$csv_upload_directory_name/$csvfile’, ‘r’)
or trigger_error(‘Fatal Error: Could not open file’, E_USER_ERROR) && exit(); //we need to ensure validity to prevent an infinite loop with feof()
//iterate through the file until EOF (end of file) is reached
while(!feof($handle)) {
//we allocate and read 4096 bytes (4 Megabytes) of the file into memory at a time until EOF
$buffer .= fread($handle, 4096);
}
//we close the stream wrapper to free resources
fclose($handle);

//do something with the buffer
print “File is " . strlen($buffer) . " bytes.”;
exit;

$sql = “LOAD DATA CONCURRENT LOCAL INFILE ‘$csv_upload_directory_name/$csvfile’”;
$sql .= " INTO TABLE ripples";
$sql .= " FIELDS TERMINATED BY ‘\t’";
$sql .= " LINES TERMINATED BY ‘\n’";
$sql .= " IGNORE 1 LINES";
$sql .= " (InventoryID, VehicleConditionDescription, VIN, StockNumber, Year, Make, Model, Trim, SellingPrice, Mileage, ExteriorColor, InteriorColor, NumberOfDoors, Engine, Features, Comments, DealerName, DealerPhone, DealerAddress, DealerCity, DealerState, DealerZipCode, DealerRadius, CertificationName, NumberOfImages, AllImages, PayoutPrice)";

mysql_query($sql) or die(mysql_error());
//mysqli_query($con,$sql);
//echo $con."<br>";
//exit;
//------------------------------------------------------------------
// Delete CSV file
//------------------------------------------------------------------
//unlink("$realpath/$csv_upload_directory_name/$csvfile");// Uncomment this for final
//------------------------------------------------------------------
// Show how many records added
//------------------------------------------------------------------
//printf("Records updated: %d\n", mysql_affected_rows());
mysql_close($con);
?>

[/php]

1.5 Million rows is not a problem for Mysql. Notice that you are using a Mysql function, Load Data Infile, and not Php to do the insert. Also, note, you are doing the insert from a file that is already on the server so there is not going to be any issues with the script timing out.

From the Manual:
When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.

The most likely issues you could have is if your data is not “Clean” and you error out. There are options you can set to ignore errors but why would you want to put bad data in your table. You can read up more on the function here: http://dev.mysql.com/doc/refman/5.1/en/load-data.html

What you can do, is install a local LAMP (Linux, Apache, Mysql, PHP) stack on your computer and test on that. Then you will know if you have any data errors and get an idea of how long it will take as well. The simplest and robust local server you can install on your PC is Zend Server Community Edition which is free. You can download it here: http://www.zend.com/en/products/server/downloads

** Another thing you can try is running Load Data Infile directly from the Mysql Command Line.

Sponsor our Newsletter | Privacy Policy | Terms of Service