• Thumbnail for Apache Nutch
    Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is coded entirely in the Java programming language, but...
    13 KB (625 words) - 20:19, 5 January 2025
  • Simplified Data Processing on Large Clusters". Development started on the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006....
    48 KB (4,947 words) - 02:29, 8 June 2025
  • Thumbnail for Apache Tika
    from other programming languages. The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling...
    6 KB (503 words) - 09:30, 1 August 2024
  • as Lucene.NET, Mahout, Tika and Nutch. These three are now independent top-level projects. In March 2010, the Apache Solr search server joined as a Lucene...
    15 KB (1,258 words) - 13:31, 1 May 2025
  • Software Yandex Data Factory Yaoota Shopping Engine Yebol Zedge Apache Lucene Apache Nutch Apache Solr Datafari Community Edition DocFetcher Gigablast Grub...
    2 KB (116 words) - 22:05, 1 April 2025
  • StormCrawler (category Software using the Apache license)
    StormCrawler. InfoQ ran one in December 2016. A comparative benchmark with Apache Nutch was published in January 2017 on dzone.com. Several research papers mentioned...
    5 KB (405 words) - 09:53, 5 January 2025
  • Thumbnail for Doug Cutting
    and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutting and Cafarella were also co-founders of Apache Hadoop...
    8 KB (686 words) - 15:33, 27 July 2024
  • This list of Apache Software Foundation projects contains the software development projects of The Apache Software Foundation (ASF). Besides the projects...
    38 KB (4,300 words) - 16:50, 29 May 2025
  • Dynamics, Coveo for Commerce, and Coveo for Sitecore. Apache Lucene Apache Solr Elasticsearch Apache Nutch Algolia Lucidworks "Coveo". Craft.co. 2020-06-01...
    5 KB (467 words) - 23:25, 29 May 2025
  • started to list WACZ as an acceptable format. ArchiveBox ArchiveWeb.page Apache Nutch Conifer har2warc Heritrix web archiver in Java libarchive ReplayWeb.page...
    7 KB (466 words) - 00:29, 15 April 2025
  • Thumbnail for Web crawler
    scalability Apache Nutch is a highly extensible and scalable web crawler written in Java and released under an Apache License. It is based on Apache Hadoop...
    53 KB (6,958 words) - 13:41, 12 June 2025
  • emerging efforts in Apache Nutch and Hadoop which Mattmann participated in, OODT was given an overhaul making it more amenable towards Apache Software Foundation...
    9 KB (960 words) - 19:19, 12 November 2023
  • Name Details Apache Nutch Nutch is a well matured, production ready Web crawler. AppFuse open-source Java EE web application framework. Drools Business...
    17 KB (12 words) - 20:19, 10 December 2024
  • ht://Dig Isearch Lemur Toolkit & Indri Search Engine Lucene mnoGoSearch Nutch Openverse Recoll Searchdaimon SearXNG Seeks Sphinx SWISH-E Terrier Search...
    24 KB (848 words) - 13:09, 14 June 2025
  • extraction Terminology extraction Mining, crawling, scraping, and recognition Apache Nutch, web crawler Concept mining Named entity recognition Textmining Web scraping...
    21 KB (2,541 words) - 00:01, 23 April 2025
  • Thumbnail for List of Web archiving initiatives
    2021 ReplayWeb.page 1 Ghost Archive Common Crawl United States 2008 Apache Nutch, Apache Tika, pywb, in-house tools 3 3 GFNDC United States (global nodes...
    118 KB (2,238 words) - 21:51, 14 June 2025
  • Lucene in Action, the founder of Simpy, and committer on Lucene, Solr, Nutch, Apache Mahout, and Open Relevance projects) founded Sematext. Sematext is headquartered...
    3 KB (145 words) - 14:23, 31 May 2025
  • Thumbnail for Chris Mattmann
    create other projects including Apache Nutch an open source web crawler and the predecessor to the big data platform Apache Hadoop, in May 2013 Mattmann...
    8 KB (679 words) - 17:43, 17 June 2024
  • of excessive SEO." In 2013, Common Crawl began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler. Common Crawl switched...
    14 KB (956 words) - 19:24, 26 May 2025
  • from Hadoop nodes Nutch - An effort to build an open source search engine based on Lucene and Hadoop, also created by Doug Cutting Apache Accumulo - Secure...
    8 KB (955 words) - 20:45, 10 October 2024
  • FIPS (computer program) TestDisk ApexKB, formerly known as Jumper Lucene Nutch Solr Xapian Konstanz Information Miner (KNIME) Pentaho PeaZip 7-Zip OpenAFS...
    75 KB (5,415 words) - 12:31, 15 June 2025
  • Thumbnail for Pentaho
    software portal Nutch - an effort to build an open source search engine based on Lucene and Hadoop, also created by Doug Cutting Apache Accumulo - Secure...
    26 KB (958 words) - 21:43, 5 April 2025
  • Thumbnail for Heritrix
    Retrieved 2006-06-23. Tools by Internet Archive: Heretrix 3 Documentation NutchWAX Archived 2011-09-28 at the Wayback Machine - search web archive collections...
    10 KB (991 words) - 20:44, 5 April 2025