python - Beautiful Soup error - not "seeing" entire web page? -
I want to scrap some simple web links. I want to type all the links which belong to the "indent-3" class and type i. I thought the code for this would be:
import bsp4 import from imported beautiful soup, Sopstrenier #stats Canada webpage base_page = ("http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/geo-index-eng.cfm TABID = 5 and? LANG = E & Amp; APATH = 3 more details = 0 & amp; dim = 0 & amp; FL = & amp; free = 0 & amp; GC = 0 & amp; GID = 0 & amp; G = 0 & amp; GRP = 1 & amp; PID = 99015 & amp; prid = 0 & amp; PTYPE = 88971,97154 & amp; S = 0 & amp; SHOWALL = 0 & amp; sub = 0 & amp; Umporal = 2006 & amp; theme = 70 & amp; VID = 0 & amp; VNAMEE = & amp; VNAMEF = & amp; D1 = 0 and D2 = 0 and D3 = 0 and D4 = 0 and D5 = 0 and D6 = 0 ") http = httplib2.Http () position, response = http.request (base_page) soup = beautiful (feedback) link = soup.find_all (" li ", class_ =" indent -3 " )
But when I run this code, there is a list of link 13 when it should be 288. And when I
soup.get_text ()
Soup only receives text from very small part of the webpage. The entry number on page 428 is Brakeley.
Why do not I get the most webpage?
Edit: Because it looks like there is not a beautiful soup problem, I tried to save the HTML file as the website's webfile.html. Then I read it directly in Python.
f = file ("webfile.html", 'r') page = f.rade () soup = beautiful soup (page) link = soup.fund_a ("li", square_ = "Indent -3")
I still get only 13 links I do not know what I'm doing ...
This beautiful
is about the request made by you.
BS 4 import beautiful soup # cents from import requests Canada webpage base_page = "http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH = 3 more details = 0 & Amp; 0d dim = & amp; amp; FL = & amp; FREE = 0 & amp; GC = 0 & amp; GID = 0 & amp; G = 0 & amp; GRP = 1 & amp; PID = 99015 & Amp; amp; prid = 0 & amp; PTYPE = 88971,97154 & amp; S = 0 & amp; SHOWALL = 0 & amp; Sub = 0 & amp; Temporal = 2006 & Themes = 70 & amp; VID = 0 & Amp; VNAMEE = & amp; VNAMEF = & amp; D1 = 0 & amp; D2 = 0 & amp; D3 = 0 & amp; D4 = 0 & amp; D5 = 0 & amp; D6 = 0 "response = requests .get (base_page, headers = {'user-agent': 'mozilla / 5.0 (macintosh; intel mac os x 10_10_0) apple webcat / 537.36 (KHtml, like Geico) Chrome / 38.0.2125.111 Safari / 537.36 '}) Soup = Beautiful soup (reaction content) link = soup.fund_al ("li", cl ass _ = "indent -3") print lane (link) # Print228
Comments
Post a Comment