My WebSPHINX crawler is running out of RAM. How can I control its memory use?
By default, WebSPHINX retains all the pages and links that it has crawled until you clear the crawler. This can use up memory quickly, especially if you’re crawling more than a few hundred pages. Here are some tricks for changing the defaults and keeping memory under control. (Note that these tricks only apply when you’re writing your own crawler in Java, not when you’re using the Crawler Workbench.) • Use Page.discardContent() to throw away (stop referencing) a page’s content when you’re done with it, so that it can be reclaimed by the garbage collector. This method preserves the page’s array of outgoing Links, however, so you’ll still have the crawl graph if you need it. • Disconnect the crawl graph entirely by breaking references between links and pages, so that every Page and Link object can be reclaimed once the crawler has finished visiting this. To do this, call page.getOrigin().setPage(null) whenever you’re done processing a page. • Another kind of memory bloat is caused by the