Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

My WebSPHINX crawler is running out of RAM. How can I control its memory use?

April 26, 2017CONTROL crawler memory RAM running websphinx

0

10 Posted

My WebSPHINX crawler is running out of RAM. How can I control its memory use?

1 Answer

0

10 Posted

By default, WebSPHINX retains all the pages and links that it has crawled until you clear the crawler. This can use up memory quickly, especially if you’re crawling more than a few hundred pages. Here are some tricks for changing the defaults and keeping memory under control. (Note that these tricks only apply when you’re writing your own crawler in Java, not when you’re using the Crawler Workbench.) • Use Page.discardContent() to throw away (stop referencing) a page’s content when you’re done with it, so that it can be reclaimed by the garbage collector. This method preserves the page’s array of outgoing Links, however, so you’ll still have the crawl graph if you need it. • Disconnect the crawl graph entirely by breaking references between links and pages, so that every Page and Link object can be reclaimed once the crawler has finished visiting this. To do this, call page.getOrigin().setPage(null) whenever you’re done processing a page. • Another kind of memory bloat is caused by the