I can do this in Python and Ruby, but I wanted to give Node.js a shot and so far the whole scraping process is confusing. I am having trouble with the post request for logging into a site to scrape the data. Here is the code:
var request = require('request');
var cheerio = require('cheerio');
var credentials = {
username: 'kevin',
password: 'secret'
};
request.post({
uri: 'http://yourwebsite.com/login',
headers: { 'content-type': 'application/x-www-form-urlencoded' },
body: require('querystring').stringify(credentials)
}, function(err, res, body){
if(err) {
callback.call(null, new Error('Login failed'));
return;
}
});
So say I want to scrape after I logged in. Am I replacing the username and password under credentials with the field id? or is it the field name? Also, where is the part where I am hitting submit (button) on form?
Edit: Here is the full code on some other site I tried on:
var cheerio = require('cheerio');
var request = require('request');
var credentials = {
acct: '....',
pw: '.....'
};
request.post({
uri: 'https://news.ycombinator.com/login?whence=news',
headers: { 'content-type': 'application/x-www-form-urlencoded' },
body: require('querystring').stringify(credentials)
}, function(err, res, body){
request('https://news.ycombinator.com', function(err, res, body) {
if(err) {
callback.call(null, new Error('Request failed'));
return;
}
var $ = cheerio.load(body);
var text = $('.pagetop').text();
console.log(text);
});
});
So say I want to scrape after I logged in. Am I replacing the username and password under credentials with the field id? or is it the field name?
If this were an HTML form and it had input text fields named username and password, then in your credentials object you would have keys for username and password just as you do now.
Also, where is the part where I am hitting submit (button) on form?
You're making an HTTP request, not hitting a submit button. The submit button often has a name just like any other form field input. If you would like to include it, add it to your credentials object.