python - Beautiful Soup error - not "seeing" entire web page? -


I want to scrap some simple web links. I want to type all the links which belong to the "indent-3" class and type i. I thought the code for this would be:

  import bsp4 import from imported beautiful soup, Sopstrenier #stats Canada webpage base_page = ("http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/geo-index-eng.cfm TABID = 5 and? LANG = E & Amp; APATH = 3 more details = 0 & amp; dim = 0 & amp; FL = & amp; free = 0 & amp; GC = 0 & amp; GID = 0 & amp; G = 0 & amp; GRP = 1 & amp; PID = 99015 & amp; prid = 0 & amp; PTYPE = 88971,97154 & amp; S = 0 & amp; SHOWALL = 0 & amp; sub = 0 & amp; Umporal = 2006 & amp; theme = 70 & amp; VID = 0 & amp; VNAMEE = & amp; VNAMEF = & amp; D1 = 0 and D2 = 0 and D3 = 0 and D4 = 0 and D5 = 0 and D6 = 0 ") http = httplib2.Http () position, response = http.request (base_page) soup = beautiful (feedback) link = soup.find_all (" li ", class_ =" indent -3 " )  

But when I run this code, there is a list of link 13 when it should be 288. And when I

  soup.get_text ()  

Soup only receives text from very small part of the webpage. The entry number on page 428 is Brakeley.

Why do not I get the most webpage?

Edit: Because it looks like there is not a beautiful soup problem, I tried to save the HTML file as the website's webfile.html. Then I read it directly in Python.

  f = file ("webfile.html", 'r') page = f.rade () soup = beautiful soup (page) link = soup.fund_a ("li", square_ = "Indent -3")  

I still get only 13 links I do not know what I'm doing ...

This beautiful is about the request made by you.

User-agent has been used to provide and provide headers:

  BS 4 import beautiful soup # cents from import requests Canada webpage base_page = "http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH = 3 more details = 0 & Amp; 0d dim = & amp; amp; FL = & amp; FREE = 0 & amp; GC = 0 & amp; GID = 0 & amp; G = 0 & amp; GRP = 1 & amp; PID = 99015 & Amp; amp; prid = 0 & amp; PTYPE = 88971,97154 & amp; S = 0 & amp; SHOWALL = 0 & amp; Sub = 0 & amp; Temporal = 2006 & Themes = 70 & amp; VID = 0 & Amp; VNAMEE = & amp; VNAMEF = & amp; D1 = 0 & amp; D2 = 0 & amp; D3 = 0 & amp; D4 = 0 & amp; D5 = 0 & amp; D6 = 0 "response = requests .get (base_page, headers = {'user-agent': 'mozilla / 5.0 (macintosh; intel mac os x 10_10_0) apple webcat / 537.36 (KHtml, like Geico) Chrome / 38.0.2125.111 Safari / 537.36 '}) Soup = Beautiful soup (reaction content) link = soup.fund_al ("li", cl ass _ = "indent -3") print lane (link) # Print228  

Comments

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -