I have a html file that has various html tags in it. This html also has a bunch of tables in it. I am processing this file using python. How do I find out what the size (length x width in pixels) when it is rendered by a browser (preferably chrome or firefox)?
I am essentially looking for the information when you do "inspect element" on a browser, and you are able to see the size of the various elements. I want to access this size in my python code.
I am using lxml to parse my html and can use selenium if needed.
edit: added #node.js incase I can use it to spit out the size of all the tables in a shell script and I can grab it in python.
You're going to want to use Selenium WebDriver to open the HTML file in an actual browser installed on the computer that your Python code is running on.
I'm not sure how you'd use the Selenium WebDriver API to find out how tall a rendered table is, but the value_of_css_property method might do it.
If you can call out shellscript, and you can use Node.js, I'm assuming you could also install and use PhantomJS, which is a headless WebKit port. (I.e. an actual honest to goodness WebKit renderer that just doesn't require a window to work.) This will let you use Javascript and the familiar web libraries to manipulate the document. As an example, the following gets you the width of the logo element towards the upper left Stack Overflow site:
page = require('webpage').create(); // create a new "browser"
page.open('http://stackoverflow.com/', function() {
// callback when loading completes
var logoWidth = page.evaluate(function() {
// This runs in the rendered page and uses the version of jQuery that SO loads.
return $('#hlogo').width();
});
console.log(logoWidth); // prints 250, the same as Chrome.
phantom.exit(); // for some reason you need to exit manually
});
The documentation for PhantomJS will tell you more about what you can do with it and how.
One caveat however is that loading a page takes a while, since it needs to fetch CSS and scripts and generally do everything a browser does. I'm not sure if and how PhantomJS does any caching, if it does it might make sense to reuse the same process for multiple scrapes of the same site.