Drew DeVault
|
e44770b9b7
|
schema: add "source" column to page
|
2022-07-10 10:13:11 +02:00 |
|
Drew DeVault
|
d30cdbf52e
|
crawler: fix interval input
|
2022-07-10 09:55:30 +02:00 |
|
Drew DeVault
|
01b2b1349b
|
crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
|
2022-07-10 09:36:07 +02:00 |
|
Drew DeVault
|
9790813a55
|
Track pages with JavaScript and total crawl time
|
2022-07-10 09:12:07 +02:00 |
|
Drew DeVault
|
e15dffd86b
|
Handle Retry-After as timestamp
|
2022-07-09 19:16:48 +02:00 |
|
Drew DeVault
|
c15f968a28
|
crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
|
2022-07-09 19:14:55 +02:00 |
|
Drew DeVault
|
6978b602f4
|
Handle canonical URLs
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
|
2022-07-09 19:06:28 +02:00 |
|
Drew DeVault
|
baf82f9bb8
|
crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
|
2022-07-09 18:59:23 +02:00 |
|
Drew DeVault
|
759ad758af
|
crawler: improve index settings
|
2022-07-09 18:57:39 +02:00 |
|
Drew DeVault
|
35a4faa05b
|
sh-index: fetch user agent from config
|
2022-07-09 18:14:06 +02:00 |
|
Drew DeVault
|
a8069bb73b
|
Increase default delay to 5 seconds
|
2022-07-08 20:56:00 +02:00 |
|
Drew DeVault
|
d6bc032d24
|
crawler: respect robots.txt
|
2022-07-08 20:30:09 +02:00 |
|
Drew DeVault
|
eb6769c904
|
crawler: follow links regardless of readability
|
2022-07-08 20:13:32 +02:00 |
|
Drew DeVault
|
fbd0492ef1
|
cmd/sh-search: initial commit
|
2022-07-08 20:04:37 +02:00 |
|
Drew DeVault
|
050694c4f2
|
Initial commit
|
2022-07-08 19:46:11 +02:00 |
|