Commit graph

24 commits

Author SHA1 Message Date
Umar Getagazov
b3a5803c0c crawler: ignore 405 responses for HEAD requests
It also skips header checks because they usually don't have headers we
need (for example, most omit the Content-Type header).

Fixes: https://todo.sr.ht/~sircmpwn/searchhut/41
2022-07-15 08:58:21 +02:00
Drew DeVault
731950a326 crawler: trim excerpt
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/38
2022-07-13 10:26:22 +02:00
Umar Getagazov
cbd3732deb Store page size in the database
Implements: https://todo.sr.ht/~sircmpwn/searchhut/33
2022-07-13 10:14:37 +02:00
Drew DeVault
53eefd6787 crawler: fix log message 2022-07-11 21:31:20 +02:00
Drew DeVault
19a9a3a3b5 sh-index: add -u flag to add URLs to schedule
This is useful for indexing parts of sites which are not reachable from
the index page.
2022-07-11 20:57:59 +02:00
Umar Getagazov
fde8b75efd Drop crawl schedule-related fields
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f Rank authoritative websites and index pages higher
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
Umar Getagazov
5471687556 Add per-domain page exclusion mechanism 2022-07-11 13:20:31 +02:00
Drew DeVault
e44770b9b7 schema: add "source" column to page 2022-07-10 10:13:11 +02:00
Drew DeVault
d30cdbf52e crawler: fix interval input 2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55 Track pages with JavaScript and total crawl time 2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b Handle Retry-After as timestamp 2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28 crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4 Handle canonical URLs
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8 crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
759ad758af crawler: improve index settings 2022-07-09 18:57:39 +02:00
Drew DeVault
35a4faa05b sh-index: fetch user agent from config 2022-07-09 18:14:06 +02:00
Drew DeVault
a8069bb73b Increase default delay to 5 seconds 2022-07-08 20:56:00 +02:00
Drew DeVault
d6bc032d24 crawler: respect robots.txt 2022-07-08 20:30:09 +02:00
Drew DeVault
eb6769c904 crawler: follow links regardless of readability 2022-07-08 20:13:32 +02:00
Drew DeVault
fbd0492ef1 cmd/sh-search: initial commit 2022-07-08 20:04:37 +02:00
Drew DeVault
050694c4f2 Initial commit 2022-07-08 19:46:11 +02:00