Commit graph

15 commits

Author SHA1 Message Date
Drew DeVault
53eefd6787 crawler: fix log message 2022-07-11 21:31:20 +02:00
Drew DeVault
19a9a3a3b5 sh-index: add -u flag to add URLs to schedule
This is useful for indexing parts of sites which are not reachable from
the index page.
2022-07-11 20:57:59 +02:00
Umar Getagazov
a7e6fba60f Rank authoritative websites and index pages higher
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
Umar Getagazov
5471687556 Add per-domain page exclusion mechanism 2022-07-11 13:20:31 +02:00
Drew DeVault
d30cdbf52e crawler: fix interval input 2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55 Track pages with JavaScript and total crawl time 2022-07-10 09:12:07 +02:00
Drew DeVault
c15f968a28 crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
baf82f9bb8 crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
35a4faa05b sh-index: fetch user agent from config 2022-07-09 18:14:06 +02:00
Drew DeVault
a8069bb73b Increase default delay to 5 seconds 2022-07-08 20:56:00 +02:00
Drew DeVault
d6bc032d24 crawler: respect robots.txt 2022-07-08 20:30:09 +02:00
Drew DeVault
fbd0492ef1 cmd/sh-search: initial commit 2022-07-08 20:04:37 +02:00
Drew DeVault
050694c4f2 Initial commit 2022-07-08 19:46:11 +02:00