I am trying to build a simple webscraper using Request and Cheerio.
The goal right now is to scrape the destination page (in this case http://bukk.it), grab the text from the target selectors on the page, and push it to an array that I can use in other functions.
I understand that request() is executing asynchronously, but do not know how to the scraped data visible outside the function.
example.js
// dependencies
var request = require('request')
, cheerio = require('cheerio');
// variables
var url = 'http://bukk.it/'; // url to scrape
var bukkits = []; // hold our scraped data
request(url, function(err, resp, body){
if (err) {
return
}
$ = cheerio.load(body);
// for each of our targets (within the request body)...
$('td a').each(function(){
content = $(this).text();
// I would love to populate the bukkits array for use later...
bukkits.push(content);
})
console.log(bukkits.length); // everything is working inside request
});
console.log(bukkits.length); // nothing, because request is asynchronous?
// that's cool but... how do I actually get the data from the request into bukkits[] ?
Essentially, your entire program must now take place inside the callback. No code after that callback will ever have access to data that was retrieved asynchronously and passed to the callback.
This isn't as bad as it sounds. You can use named functions, like so:
request(url, onRequestDone);
function onRequestDone(err, resp, body) {
var bukkits = []; // not global, make it local
// as above
doMoreWork(bukkits);
}
function doMoreWork(bukkits) {
// stuff after the parsing goes here.
}
Your code ends before the request completes.
use the forever version of the agent
request = require('request').forever;
for use setTimeout to keep your program running.
setTimeout(function(){}, 1000000);
to use the values later on, it needs to also be done after the request call completes.