searchhut

Author	SHA1	Message	Date
Umar Getagazov	fde8b75efd	Drop crawl schedule-related fields They were unused.	2022-07-11 17:50:44 +02:00
Umar Getagazov	a7e6fba60f	Rank authoritative websites and index pages higher Implements: https://todo.sr.ht/~sircmpwn/searchhut/23	2022-07-11 17:49:19 +02:00
Umar Getagazov	2971603710	Put domain labels minus eTLD into the text index Before, only the hostname (say, harelang.org) was indexed, and no results appeared for a "harelang" query. Now, all domain labels (minus the eTLD) are indexed separately (for example, "docs" and "harelang" for "docs.harelang.org"), and such queries work. eTLD is removed using the data from Mozilla's Public Suffix List (https://publicsuffix.org).	2022-07-11 17:48:46 +02:00
Drew DeVault	e44770b9b7	schema: add "source" column to page	2022-07-10 10:13:11 +02:00
Drew DeVault	01b2b1349b	crawler: compute checksum and make unique Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30	2022-07-10 09:36:07 +02:00
Drew DeVault	9790813a55	Track pages with JavaScript and total crawl time	2022-07-10 09:12:07 +02:00
Drew DeVault	e15dffd86b	Handle Retry-After as timestamp	2022-07-09 19:16:48 +02:00
Drew DeVault	c15f968a28	crawler: re-schedule after HTTP 429 Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5	2022-07-09 19:14:55 +02:00
Drew DeVault	6978b602f4	Handle canonical URLs Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11	2022-07-09 19:06:28 +02:00
Drew DeVault	baf82f9bb8	crawler: perform HEAD before GET Implements: https://todo.sr.ht/~sircmpwn/searchhut/8	2022-07-09 18:59:23 +02:00
Drew DeVault	759ad758af	crawler: improve index settings	2022-07-09 18:57:39 +02:00
Drew DeVault	d6bc032d24	crawler: respect robots.txt	2022-07-08 20:30:09 +02:00
Drew DeVault	eb6769c904	crawler: follow links regardless of readability	2022-07-08 20:13:32 +02:00
Drew DeVault	050694c4f2	Initial commit	2022-07-08 19:46:11 +02:00

14 commits