web - Crawler to extract only content, excluding photos/ads and etc' -
Anyone know a good open source crawler, I can only use to remove the contents of the page, whose Means only text / photos / text without menus and etc. '?
If you know how ads appear in HTML, then you can do this. This is a very flexible open source web crawler when you configure your Importer module, you can ask it to strip sections of text before and after some tags, or what to strip between known tags Are there.
To give you an idea, if you know that a certain site displays it, ads between these tags:
& lt; Div class = "myAdd" & gt; ... add to ... ... & lt; / Div & gt;
The related Importer section will then look like this:
& lt; Transformer class = "com.norconex.importer.transformer.impl.StripBetweenTransformer" inclusive = "true" & gt; & Lt; StripBetween & gt; & Lt; Start & gt; & Lt ;! [CDATA [& lt; Div class = "myAdd" & gt;]] & gt; & Lt; / Start> & Lt; End & gt; & Lt; [CDATA [& lt; / Div> gt;]] & gt; & Lt; / End & gt; & Lt; / StripBetween & gt; & Lt; / Transformer & gt;
You can use the same principle to touch headers and footers. If you do not want to crawl pictures, you can filter them easily.
Comments
Post a Comment