Commit graph

53 commits

Author SHA1 Message Date
Drew DeVault
13d5f95eab import/mediawiki: drop File: pages 2022-07-11 20:22:35 +02:00
Drew DeVault
74b26cecfa import/mediawiki: more improvements 2022-07-11 19:30:57 +02:00
Haelwenn (lanodan) Monnier
5689b79e13 import/cve.org: truncate content for excerpt 2022-07-11 19:11:37 +02:00
Haelwenn (lanodan) Monnier
062e63437a import/cve.org: New importer 2022-07-11 17:53:58 +02:00
Umar Getagazov
fde8b75efd Drop crawl schedule-related fields
They were unused.
2022-07-11 17:50:44 +02:00
Umar Getagazov
a7e6fba60f Rank authoritative websites and index pages higher
Implements: https://todo.sr.ht/~sircmpwn/searchhut/23
2022-07-11 17:49:19 +02:00
Umar Getagazov
72649f0f0e Truncate page titles and URLs in search results
Implements: https://todo.sr.ht/~sircmpwn/searchhut/25
2022-07-11 17:48:50 +02:00
Umar Getagazov
2971603710 Put domain labels minus eTLD into the text index
Before, only the hostname (say, harelang.org) was indexed, and no
results appeared for a "harelang" query. Now, all domain labels (minus
the eTLD) are indexed separately (for example, "docs" and "harelang" for
"docs.harelang.org"), and such queries work. eTLD is removed using the
data from Mozilla's Public Suffix List (https://publicsuffix.org).
2022-07-11 17:48:46 +02:00
Drew DeVault
c6777e21a7 schema.sql: set default exclusion list to {} 2022-07-11 17:48:36 +02:00
Drew DeVault
5848adfea0 mediawiki: don't parse until we know we want it 2022-07-11 14:35:22 +02:00
Drew DeVault
4567044626 import/mediawiki: delete elements when done
To avoid blowing up memory usage
2022-07-11 14:27:21 +02:00
Umar Getagazov
5471687556 Add per-domain page exclusion mechanism 2022-07-11 13:20:31 +02:00
Umar Getagazov
ef32533b75 Fix searchut typo in the config file path 2022-07-11 13:17:16 +02:00
Umar Getagazov
3b056cc0b4 Dark theme
Colors taken from the dark theme of SourceHut services; some of them
tweaked for contrast.

Implements: https://todo.sr.ht/~sircmpwn/searchhut/24
2022-07-11 13:17:02 +02:00
Drew DeVault
50fd2562f5 Highlight result title in bold 2022-07-11 13:16:47 +02:00
Umar Getagazov
dda780c694 UI fixups for f449fe8
Mostly returning the look to the previous state, code formatting, and
adjusting the look of the search results label.
2022-07-11 13:13:09 +02:00
Umar Getagazov
67c60ef5c1 Use the real crawler UA at /about 2022-07-11 13:13:05 +02:00
Umar Getagazov
3bc5cd9689 Responsive UI
Implements: https://todo.sr.ht/~sircmpwn/searchhut/20
2022-07-11 13:08:37 +02:00
Rohan Kumar
f449fe8a32 Semantic/a11y markup improvements
- Make search results an <ol> with an ARIA label. If more elements are
  erver present on the SERP (e.g. settings), the <ol> should be placed
  inside a <section> and its label should move to that section too.
- Remove list-style and padding from the <ol> in the stylesheet
- Add the "search" ARIA role to the search form.
- Make search result titles headings. This is established convention
  that assistive-technology users are already familiar with from other
  engines.
- Add an indicator for "N search results found". This is where the list
  label comes from.
- Exclude the brand name from machine translation.
2022-07-10 15:03:04 +02:00
Drew DeVault
76bc26d639 Adding missing /about bits 2022-07-10 15:02:55 +02:00
Umar Getagazov
7a67438e9c Add favicon 2022-07-10 15:02:28 +02:00
Drew DeVault
c367bbddd3 Add about page 2022-07-10 13:07:00 +02:00
Drew DeVault
c8762965ac import/mediawiki: initial commit 2022-07-10 11:11:18 +02:00
Drew DeVault
e44770b9b7 schema: add "source" column to page 2022-07-10 10:13:11 +02:00
Drew DeVault
d30cdbf52e crawler: fix interval input 2022-07-10 09:55:30 +02:00
Drew DeVault
01b2b1349b crawler: compute checksum and make unique
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/30
2022-07-10 09:36:07 +02:00
Drew DeVault
9790813a55 Track pages with JavaScript and total crawl time 2022-07-10 09:12:07 +02:00
Drew DeVault
e15dffd86b Handle Retry-After as timestamp 2022-07-09 19:16:48 +02:00
Drew DeVault
c15f968a28 crawler: re-schedule after HTTP 429
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/5
2022-07-09 19:14:55 +02:00
Drew DeVault
6978b602f4 Handle canonical URLs
Fixes: https://todo.sr.ht/~sircmpwn/searchhut/11
2022-07-09 19:06:28 +02:00
Drew DeVault
baf82f9bb8 crawler: perform HEAD before GET
Implements: https://todo.sr.ht/~sircmpwn/searchhut/8
2022-07-09 18:59:23 +02:00
Drew DeVault
759ad758af crawler: improve index settings 2022-07-09 18:57:39 +02:00
Drew DeVault
35a4faa05b sh-index: fetch user agent from config 2022-07-09 18:14:06 +02:00
Drew DeVault
2ec534d63a Add Makefile 2022-07-09 18:14:00 +02:00
Drew DeVault
3535309004 web: add link to index from search page 2022-07-09 18:07:46 +02:00
Drew DeVault
b41abd9376 main.css: change URL color in results 2022-07-09 17:51:05 +02:00
Drew DeVault
7140d0e2e5 web: add search results page 2022-07-09 17:48:52 +02:00
Drew DeVault
6e5deed8f4 web: add .index to html tag 2022-07-09 17:14:00 +02:00
Drew DeVault
738a9430cb web: autofocus search box 2022-07-09 17:12:23 +02:00
Drew DeVault
ad9dd2701e web: move infolinks to bottom of page 2022-07-09 17:02:58 +02:00
Drew DeVault
a1f6b8c8de sh-web: initial commit 2022-07-09 16:56:25 +02:00
Drew DeVault
8cf92fa220 API: Implement search resolver 2022-07-09 15:48:03 +02:00
Drew DeVault
c1f917efb4 sh-api: expand top-level server riggings 2022-07-09 15:39:04 +02:00
Drew DeVault
0d32cf49d7 Implement configuration loader
Implements: https://todo.sr.ht/~sircmpwn/searchhut/18
2022-07-09 15:31:16 +02:00
Drew DeVault
09f762ca53 Add config.example.ini
References: https://todo.sr.ht/~sircmpwn/searchhut/18
2022-07-09 13:53:02 +02:00
Drew DeVault
b5656c9a1e database: add middleware 2022-07-09 13:52:55 +02:00
Drew DeVault
208f766963 Initial GraphQL API riggings 2022-07-09 13:25:27 +02:00
Drew DeVault
a8069bb73b Increase default delay to 5 seconds 2022-07-08 20:56:00 +02:00
Drew DeVault
92ca0ecf22 Add README.md 2022-07-08 20:55:55 +02:00
Drew DeVault
d6bc032d24 crawler: respect robots.txt 2022-07-08 20:30:09 +02:00