help on self-updating database

ansherina · February 18, 2012, 11:21am

[size=10pt]what steps and functions should I used where I wanted to make a database and wanted to have access on the data in an online database (UniProt) in which whenever there is a new release of entries, it would automatically show and store into my database… there is no more file uploading and a notification will be shown in my database that a new release of entries are present[/size]

ErnieAlex · February 18, 2012, 5:25pm

Hmmm, not clear on what you are asking…

You have an online database that has data in it.

You have a local database that fills with data which you want to update to the online version?

Or, you have an online database that fills with data from other sources and you want to be notified about this?

Give a slightly clearer layout of what you want to do. I am sure we can help.

ansherina · February 18, 2012, 5:32pm

[size=10pt]
Sorry for that, let me state it clearer…

I have a database and wanted to have access on the data in another online database (UniProt - http://www.uniprot.org/ ) in which whenever there is a new release of entries in that said database, it would automatically show and store into my database, meaning there will be no more file uploading. Also notification will be shown in my database that a new release of entries are present.

for example,
when there are about 100 entries released, it will automatically store into my database so that there is no need for me to upload one by one all the entries. in my database, it will auto-increment

does it have something to do with a task scheduler? windows is my operating system. do you have some simple samples on this or links so that I would be able to work on it? thanks[/size]

ErnieAlex · February 18, 2012, 5:40pm

Well, first, did you create the “http://www.uniprot.org/” database and site?
Can you alter their code?

If not, you would have to alter THAT site to count new entries and when 100 new ones appear, it would have to open your second database and do an update to it.

If you do not have access to the code on that site, you would have to create a query to pull all the new entries from their site and parse them to get the data to store in your database. This could be tricky.

I am not familiar with that site, The answer to your question depends on if you coded the uniprot site.

ansherina · February 18, 2012, 5:50pm

[size=10pt]I did not create that database. it is a freely accessible resource of protein sequence and functional information.
those entries (sequences) are my interest…

how could I go over or what will i do to pull the entries in uniprot and store it to my database with no longer uploading the entries?[/size]… because if upload one by one and there are about thousands of entries, then it will be hard for me

ErnieAlex · February 18, 2012, 6:04pm

Hmmm, a 30 second search on their site found a link about external linking to their database.
It appears that you can retrieve their data from external databases.

Therefore, you would have to create a SQL or MySQL query to load data from their database into yours.
Basically, you could do this on a timer or manually tell it when to upload changes. You would have to
create a QUERY that would fit your uses and load what you want for data into your database.

This is really a question you would send to them with details of the data you want to retrieve.
The link on their site for this info is: http://www.uniprot.org/help/mapping

It appears that they list all of their fields and you can find which are which at the link I gave you.
Then, once you figure out the fields for the data you want to retrieve, you would have to create
a SQL query and build it so it INSERT’s the data you require into your database. Then, you would
have to have a way to check your database for new info. For instance, you could add a timestamp
to the INSERT’d data and each time your front-end program checked the database, it would indicate
what it new since your last visit.

Well, a lot to think about with this project. Hope this helped…

ansherina · February 19, 2012, 1:23pm

I don’t get the the link on http://www.uniprot.org/help/mapping…

So, you mean that I should ask/contact them on how to retrieve their data in order to insert into my database?
I have already have the entries of my interest and this is on this link http://www.uniprot.org/uniprot/?query=annotation%3A(type%3Asignal+confidence%3Aexperimental)+reviewed%3Ayes&sort=score

ErnieAlex · February 19, 2012, 2:35pm

Well that makes it easy. You can just web-scrap that page and get your data. What that means is that you read that page into a variable and pull the data from the text of the page.

It involves just passing the page as you posted to me into a process that get the page and then decode what is on the page.

First, is that page exactly the data you want put into your database? Or is there other data you need from the link on that page. If it is just exactly that data, we can do it.

ansherina · February 19, 2012, 2:48pm

yes. that page of data i wanted to put in my database…
actually all of the entries on it, and I wanted that when you click an entry on that page which results to this link http://www.uniprot.org/uniprot/?query=annotation%3A(type%3Asignal+confidence%3Aexperimental)+reviewed%3Ayes&sort=score
will also be on my database, not exactly that format but a text format of the entry…

so, how do i start this one?

ErnieAlex · February 19, 2012, 3:31pm

Well, first you must learn about webpage scraping. It is simple…

All webpages can be viewed. Simple. And, all source code for the general HTML and general Javascript and the general VBscript, etc. can be viewed. Simple. (Just RIGHT-CLICK on ANY webpage and select VIEW-SOURCE to see what I am talking about.)

So, if you can view this page info, you can “capture” it. You do this by what is called “scrapping” the page into a text string. Once it is saved in text format, you can use a vast number of PHP tools to edit and alter the data you have “scrapped” from the page. This is very very helpful in pulling Wiki descriptions or Google search data and manipulating the data for your own use. I use it in an A-I project to get words from dictionary.com.

So, that is the simple basics. Here is a short sample of code which pulls the page you wanted and then simply displays it. I tested it and it works well.

file: getsequences.php
[php]

<?php $url="http://www.uniprot.org/uniprot/?query=annotation%3A%28type%3Asignal+confidence%3Aexperimental%29+reviewed%3Ayes&sort=score" // note line above is your url link and no spaces in it $webpagedata = file_get_contents($url); echo $webpagedata; ?> [/php]

That’s all there is to it…
Now, put this file into your html. Just use it by itself right now for testing. You will see the data pulled from the other site. It will be not as “pretty” as you do not have their CSS codes for colors, etc. Don’t need them!

Next step is a bit trickier. You must “scrap” the data you now have to pull what you want out of it. The easiest way to start with that section of the project is to review how “THEY” create the data you need.
First, go to the page you are scrapping. RIGHT-CLICK on a blank area and select VIEW-SOURCE. This will open up a new window with the source code of their entire page. You must study this source code and find what you need. Then, you have to write code using PHP’s extensive string functions, such as preg_match.
Here is a link on web scrapping I just found for you explaining how to get data off of MSN or whatever. It will explain how to search a string a little. Should keep you busy for awhile. I am gone most of this afternoon. I will check back tonight to see how you are doing on your project. Good luck…
http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

ansherina · February 19, 2012, 3:40pm

thank you so much

ErnieAlex · February 19, 2012, 4:48pm

You are welcome. We will leave this post open as you may have further questions… Good luck

ansherina · February 21, 2012, 12:23pm

[php]
require(“WebGet.php”);
$taxon = $_GET[0];
$agent = new WebGet();
$agent->useCache = true;
$agent->cacheLocation = ‘’;
$agent->cookieFile = ‘cookie.txt’;

$query = “http://www.uniprot.org/uniprot/?query=organism:$taxon&format=fasta&include=yes”;
$file = $taxon . ‘.fasta’;

$agent->requestContent($query);

if($agent->responseStatusCode==200){
$results = $agent->responseHeaders[strtoupper(‘X-Total-Results’)];
$release = $agent->responseHeaders[strtoupper(‘X-UniProt-Release’)];
$date = date(“Y-m-d”, strtotime($agent->responseHeaders[strtoupper(‘Last-Modified’)]));
print “Downloaded $results entries of UniProt release $release ($date) to file $file\n”;
} // 304 Not Modified
elseif($agent->responseStatusCode==304){
print “Data for taxon $taxon is up-to-date.\n”;
}else{
die ('Failed, got ’ . $agent->responseStatusLine .
" for uniprot/?query=organism:$taxon&format=fasta&include=yes\n");
}[/php]

i have this file which enables to download all UniProt sequences for a given organism in FASTA format once per release, but still doesn’t work… and it has the following error :

Notice: Undefined offset: 0

Failed, got for uniprot/?query=organism:&format=fasta&include=yes

can you help me with this?

this the webget class -> https://github.com/shiplu/dxtool/blob/master/WebGet.php

ErnieAlex · February 21, 2012, 3:26pm

Well, I gave you code to pull the data in three lines and you send me back some weird code dealing with pulling and decoding headers form pages. Also, you sent me a different URL to pull than before. All of this without explanations of why things changed in your quest.

The new link you gave:
$query = “http://www.uniprot.org/uniprot/?query=organism:$taxon&format=fasta&include=yes”;
shows all the data you need in a nice simple to read format.

If you simply use my code I gave you and plug this query in it, you will get all of your data in one string variable. Then you can simply parse for the “> tr” which starts each line and decode the data and save into your database. I am not sure where you got the “Webget()” code and do not understand why you feel the need to decode every possible webpage header that can be on a webpage when all you really want is the data from that organism. This data is already displayed on the page and all you need to do is read it and load it into your database.

Also, $_GET[0] is getting a variable from a POSTED FORM PAGE. So, the first entry in that array [0] could be any variable. It would need to be $_GET[‘nameofvariable’] to make sense.

Sorry, but, not really sure what all these new changes are doing for you. Please explain in more detail. Thanks, and good luck…