cURL Scraping Data

system · September 1, 2014, 1:14am

I’m using cURL to crawl and scrape data from a website. This website contains tables with rows of data. When I send a cURL POST for the underlying data at a specific row(A), it will return the expected data. But when I move to the second row(B), the data returns blank or specifically, a tons of spaces (or &#nbsp’s.) When I access the cURL’s POST location by browser, I can see (B)'s data. The only difference in the 2 POST’s are location ID’s for the data. I don’t think it’s a problem with JavaScript as I can successfully return data from row (A) as I mentioned.

Website I’m trying to crawl: https://mycpa.cpa.state.tx.us/up/Search.jsp

Working POST URL(A): https://mycpa.cpa.state.tx.us/up/searchresults.do?d-49216-p=&d-49216-s=&how=&last=bales&other=&d-49216-o=&zip=&_chk=74170700611986R2ZZZZ26&which=View+Details
Non-working POST URL(B): https://mycpa.cpa.state.tx.us/up/searchresults.do?d-49216-p=&d-49216-s=&how=&last=bales&other=&d-49216-o=&zip=&_chk=74600015611995R1AC081084&which=View+Details

Interestingly, you can combine the data location ID’s to show more than 1 set of data per page. When trying this method, the first set of data(A) is displayed and the second(B) is shown as spaces (or &#nbsp.)
Combined POST URL: https://mycpa.cpa.state.tx.us/up/searchresults.do?d-49216-p=&d-49216-s=&how=&last=bales&other=&d-49216-o=&zip=&_chk=74170700611986R2ZZZZ26&_chk=74600015611995R1AC081084&which=View+Details

astonecipher · September 1, 2014, 3:35am

Private websites. Links to private commercial and noncommercial websites will only be allowed if there is a public purpose for establishing the link, and the link is approved by Management. Considerations in determining whether to link to a Private Web site. Where there is a legitimate business need (public purpose) to establish a link from an agency Web page to a private Web site, the proposed link will be evaluated on a case-by-case basis. If there is a public purpose for including such links (e.g. to facilitate the state fulfilling a specific statutory duty or to provide information related to the agency's programs, activities, or functions), then the link may be allowed, subject to approval by Management.

If you can pm me proof of this or anything stating you are not violating their TOS, I will unlock this. Until then…

ErnieAlex · September 8, 2014, 1:58pm

Not sure what Astone is talking about as the Texas unclaimed property site is a public site…

But, investigating the site’s code shows that the link you are using has a session number inside it.
This is most likely how the page saves your current search info. Therefore, you will get a site error
such as a #404 back because your session value is incorrect.

Sites, even open public ones sometime use sessions to protect from people reading the data openly.
Where you created the arguments from is in question. Did you read the URL from your browser’s info
and use that? If so, that is most likely the issue. Doubt that you can “crawl” that site.

You could simply embed the site in a window and pass your search entries into the embed page and
that might work for you. Then, you would have to read the results of the embedded page instead of
calling the actual page with arguments. Hope that made sense…