I'm looking to build a feature into an Angular.js web app that allows a user to paste a url to an eCommerce site like Amazon or Zappos and retrieve the main product image from that page. My plan is to post the url to my express API and handle the image retrieval on the server.
My initial plan was to download the raw html, parse it out with htmlparser, select all the html image elements with soupselect and retrieve their src attributes. Ideally I would like to implement a solution that would work across any site, and not just hardcode values for a particular retailer's site (using specific known css class names). One of the assumptions I made was that the largest image on the page would likely be the main product image, with this logic I decided I would try to sort the images by file size. My idea was to make a http head request with the src url for each of the images to determine their size with the content-length header property. So far this approach has worked well but I would really like to avoid making so many http requests even if they are only head requests.
I feel there is a better way of doing this, would it be easier to use something like PhantomJS to load the entire page and parse it that way? I was trying to make this work as quick as possible and thus avoiding downloading all of the images. Does anyone have any suggestions?
I would think the best image to use isn't the one with the largest file size, but the image that is displayed largest on the page. PhantomJS might be able to help you determine that. Load the page, but instruct PhantomJS not to load images. Then pick the image element whose calculated dimensions are biggest. This will only work if the page uses CSS or width and height attributes on the img to give it dimension.
Alternatively, you could send the image URLs back to the client, and have the client fetch the images and figure out which is biggest. That limits the number of requests your server has to make, and it allows the user to quickly pick a different image if the largest isn't the best.