Commit graph

14 commits

Author SHA1 Message Date
Umar Getagazov
fde8b75efd Drop crawl schedule-related fields
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f Rank authoritative websites and index pages higher
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
Drew DeVault
e44770b9b7 schema: add "source" column to page 2022-07-10 10:13:11 +02:00
Drew DeVault
01b2b1349b crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55 Track pages with JavaScript and total crawl time 2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b Handle Retry-After as timestamp 2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28 crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4 Handle canonical URLs
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8 crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
759ad758af crawler: improve index settings 2022-07-09 18:57:39 +02:00
Drew DeVault
d6bc032d24 crawler: respect robots.txt 2022-07-08 20:30:09 +02:00
Drew DeVault
eb6769c904 crawler: follow links regardless of readability 2022-07-08 20:13:32 +02:00
Drew DeVault
050694c4f2 Initial commit 2022-07-08 19:46:11 +02:00