searchhut

Author	SHA1	Message	Date
Umar Getagazov	b3a5803c0c	crawler: ignore 405 responses for HEAD requests It also skips header checks because they usually don't have headers we need (for example, most omit the Content-Type header). Fixes: https://todo.sr.ht/~sircmpwn/searchhut/41	2022-07-15 08:58:21 +02:00
Drew DeVault	731950a326	crawler: trim excerpt Fixes: https://todo.sr.ht/~sircmpwn/searchhut/38	2022-07-13 10:26:22 +02:00
Umar Getagazov	cbd3732deb	Store page size in the database Implements: https://todo.sr.ht/~sircmpwn/searchhut/33	2022-07-13 10:14:37 +02:00
Drew DeVault	53eefd6787	crawler: fix log message	2022-07-11 21:31:20 +02:00
Drew DeVault	19a9a3a3b5	sh-index: add -u flag to add URLs to schedule This is useful for indexing parts of sites which are not reachable from the index page.	2022-07-11 20:57:59 +02:00
Umar Getagazov	fde8b75efd	Drop crawl schedule-related fields They were unused.	2022-07-11 17:50:44 +02:00
Umar Getagazov	a7e6fba60f	Rank authoritative websites and index pages higher Implements: https://todo.sr.ht/~sircmpwn/searchhut/23	2022-07-11 17:49:19 +02:00
Umar Getagazov	2971603710	Put domain labels minus eTLD into the text index Before, only the hostname (say, harelang.org) was indexed, and no results appeared for a "harelang" query. Now, all domain labels (minus the eTLD) are indexed separately (for example, "docs" and "harelang" for "docs.harelang.org"), and such queries work. eTLD is removed using the data from Mozilla's Public Suffix List (https://publicsuffix.org).	2022-07-11 17:48:46 +02:00
Umar Getagazov	5471687556	Add per-domain page exclusion mechanism	2022-07-11 13:20:31 +02:00
Drew DeVault	e44770b9b7	schema: add "source" column to page	2022-07-10 10:13:11 +02:00
Drew DeVault	d30cdbf52e	crawler: fix interval input	2022-07-10 09:55:30 +02:00
Drew DeVault	01b2b1349b	crawler: compute checksum and make unique Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30	2022-07-10 09:36:07 +02:00
Drew DeVault	9790813a55	Track pages with JavaScript and total crawl time	2022-07-10 09:12:07 +02:00
Drew DeVault	e15dffd86b	Handle Retry-After as timestamp	2022-07-09 19:16:48 +02:00
Drew DeVault	c15f968a28	crawler: re-schedule after HTTP 429 Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5	2022-07-09 19:14:55 +02:00
Drew DeVault	6978b602f4	Handle canonical URLs Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11	2022-07-09 19:06:28 +02:00
Drew DeVault	baf82f9bb8	crawler: perform HEAD before GET Implements: https://todo.sr.ht/~sircmpwn/searchhut/8	2022-07-09 18:59:23 +02:00
Drew DeVault	759ad758af	crawler: improve index settings	2022-07-09 18:57:39 +02:00
Drew DeVault	35a4faa05b	sh-index: fetch user agent from config	2022-07-09 18:14:06 +02:00
Drew DeVault	a8069bb73b	Increase default delay to 5 seconds	2022-07-08 20:56:00 +02:00
Drew DeVault	d6bc032d24	crawler: respect robots.txt	2022-07-08 20:30:09 +02:00
Drew DeVault	eb6769c904	crawler: follow links regardless of readability	2022-07-08 20:13:32 +02:00
Drew DeVault	fbd0492ef1	cmd/sh-search: initial commit	2022-07-08 20:04:37 +02:00
Drew DeVault	050694c4f2	Initial commit	2022-07-08 19:46:11 +02:00

24 commits