Drew DeVault
53eefd6787
crawler: fix log message
2022-07-11 21:31:20 +02:00
Drew DeVault
19a9a3a3b5
sh-index: add -u flag to add URLs to schedule
...
This is useful for indexing parts of sites which are not reachable from
the index page.
2022-07-11 20:57:59 +02:00
Umar Getagazov
a7e6fba60f
Rank authoritative websites and index pages higher
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
2971603710
Put domain labels minus eTLD into the text index
...
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org ).
2022-07-11 17:48:46 +02:00
Umar Getagazov
5471687556
Add per-domain page exclusion mechanism
2022-07-11 13:20:31 +02:00
Drew DeVault
d30cdbf52e
crawler: fix interval input
2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b
crawler: compute checksum and make unique
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55
Track pages with JavaScript and total crawl time
2022-07-10 09:12:07 +02:00
Drew DeVault
c15f968a28
crawler: re-schedule after HTTP 429
...
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
baf82f9bb8
crawler: perform HEAD before GET
...
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
35a4faa05b
sh-index: fetch user agent from config
2022-07-09 18:14:06 +02:00
Drew DeVault
a8069bb73b
Increase default delay to 5 seconds
2022-07-08 20:56:00 +02:00
Drew DeVault
d6bc032d24
crawler: respect robots.txt
2022-07-08 20:30:09 +02:00
Drew DeVault
fbd0492ef1
cmd/sh-search: initial commit
2022-07-08 20:04:37 +02:00
Drew DeVault
050694c4f2
Initial commit
2022-07-08 19:46:11 +02:00