python - web scraping of web that blocks access for scripts -
is a website that I used to use a Pyro script (urllib). It seems that the website is now blocking my requests and whenever I request a web page using a script I get a html with some JS but without the normal data. Accessing the website from your browser works well. I tried to change the 'user-agent' to fit one who used my browser but it was not used. I saw that a strange behavior is that after accessing a page from my browser, I can access it from the script too.
So my questions are:
- How can the server detect it is not a browser (can I change user-agent)?
- What kind of access can be the reason for the strange behavior of a web page after the web page has loaded? Is it caching? If yes, where is caching?
- Any idea how to move forward? (I do not have a very elegant solution to open every page before loading my browser, but it takes a lot of time)
Thanks!
To go without too many details, it seems like the site to include JavaScript loaders Has been updated to urlib
can not process Javascript, so it is unable to continue (pure speculation here)
A site preventing a scraper from reaching it , Which includes updating some javascript sets or cookies, or anyone This is the first test to modify the session in a similar manner. It is entirely up to the site, so you have to check it manually.
The general solution is to use a javascript-linked scraper like selenium
, which is actually installed locally firefox
, chrome Click
or IE
to open the page, and simulate the items you can also use PhantomJS
to take action on the downloaded page. .
There are a lot of posts about this, but here's one that can give you a starting point:
Comments
Post a Comment