web - Crawler to extract only content, excluding photos/ads and etc' -

- February 15, 2010

Anyone know a good open source crawler, I can only use to remove the contents of the page, whose Means only text / photos / text without menus and etc. '?

If you know how ads appear in HTML, then you can do this. This is a very flexible open source web crawler when you configure your Importer module, you can ask it to strip sections of text before and after some tags, or what to strip between known tags Are there.

To give you an idea, if you know that a certain site displays it, ads between these tags:

  & lt; Div class = "myAdd" & gt; ... add to ... ... & lt; / Div & gt;

The related Importer section will then look like this:

  & lt; Transformer class = "com.norconex.importer.transformer.impl.StripBetweenTransformer" inclusive = "true" & gt; & Lt; StripBetween & gt; & Lt; Start & gt; & Lt ;! [CDATA [& lt; Div class = "myAdd" & gt;]] & gt; & Lt; / Start> & Lt; End & gt; & Lt; [CDATA [& lt; / Div> gt;]] & gt; & Lt; / End & gt; & Lt; / StripBetween & gt; & Lt; / Transformer & gt;

You can use the same principle to touch headers and footers. If you do not want to crawl pictures, you can filter them easily.

Search This Blog

Updating

web - Crawler to extract only content, excluding photos/ads and etc' -

Comments

Post a Comment

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -