You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
progszy is a hard-caching HTTP(S) proxy server (with programmatic cache management), designed for use as part of a data-scraping pipeline.
9
+
Progszy is a hard-caching HTTP(S) proxy server (with programmatic cache management), designed for use as part of a data-scraping pipeline.
10
10
11
11
- Brings stable reproducability to web data-scraping pipelines.
12
12
- Improves web scraper development workflow, via fast controlled caching of HTTP responses.
@@ -18,7 +18,7 @@ It is both a standalone executable CLI program, and a Go package.
18
18
19
19
It is **not** suitable for use as a regular HTTP(S) caching proxy for humans surfing with web browsers.
20
20
21
-
progszy should work with any HTTP client, but currently has only been tested with Go's http.Client.
21
+
Progszy should work with any HTTP client, but currently has only been tested with Go's http.Client.
22
22
23
23
## Caching
24
24
@@ -32,7 +32,7 @@ We may review/change this binning/naming strategy at a later date.
32
32
33
33
### Caching Strategy
34
34
35
-
progszy*intentionally* makes **no** use of HTTP headers relating to cached content control that are normally utilised by browsers and other caching proxies.
35
+
Progszy*intentionally* makes **no** use of HTTP headers relating to cached content control that are normally utilised by browsers and other caching proxies.
36
36
37
37
The body content and appropriate headers for all `200 Ok` responses are hard-cached — unless the body matches a given filter (see `X-Cache-Reject`, below).
38
38
@@ -42,19 +42,19 @@ Cache eviction/management is manual-only at present. Later we will add a REST AP
42
42
43
43
## HTTP(S) Proxy
44
44
45
-
The CLI version of progszy operates as a standalone HTTP(S) proxy server. By default it listens on port 5595, for which the client's proxy configuration URL would be `http://127.0.0.1:5595`. It should be noted that currently progszy binds only to IP 127.0.0.1, which is not suitable for access from a remote IP (without the use of an SSH tunnel).
45
+
The CLI version of Progszy operates as a standalone HTTP(S) proxy server. By default it listens on port 5595, for which the client's proxy configuration URL would be `http://127.0.0.1:5595`. It should be noted that currently Progszy binds only to IP 127.0.0.1, which is not suitable for access from a remote IP (without the use of an SSH tunnel).
46
46
47
47
Incoming requests can be either vanilla HTTP, or can be HTTPS (using `CONNECT` protocol).
48
48
49
-
When proxying HTTPS requests, the connection is intercepted by a man-in-the-middle (MITM) hijack, to allow both caching and the application of rules, and the resulting outbound stream is then re-encrypted using a private certificate, before being passed to the client. Note that clients wishing to proxy HTTPS requests using progszy will need specific configuration to prevent/ignore the resulting certificate mismatch errors caused by this process. See tests for an example of how this is done in Go.
49
+
When proxying HTTPS requests, the connection is intercepted by a man-in-the-middle (MITM) hijack, to allow both caching and the application of rules, and the resulting outbound stream is then re-encrypted using a private certificate, before being passed to the client. Note that clients wishing to proxy HTTPS requests using Progszy will need specific configuration to prevent/ignore the resulting certificate mismatch errors caused by this process. See tests for an example of how this is done in Go.
50
50
51
51
Outgoing HTTP requests utilise automatic retries with exponential backoff. Internal HTTP clients use a shared transport with pooling, and support upstream proxy chaining. Connections are not explicitly rate-limited.
52
52
53
-
Currently, progszy only supports HTTP `GET`, `HEAD` and `CONNECT` methods. Note that support for the `HEAD` method is not actually particularly useful in this context, and really only exists for spec compliance.
53
+
Currently, Progszy only supports HTTP `GET`, `HEAD` and `CONNECT` methods. Note that support for the `HEAD` method is not actually particularly useful in this context, and really only exists for spec compliance.
54
54
55
55
### HTTP Headers
56
56
57
-
progszy makes use of custom HTTP `X-*` headers to both control features and report status to the client.
57
+
Progszy makes use of custom HTTP `X-*` headers to both control features and report status to the client.
58
58
59
59
#### Request Headers
60
60
@@ -84,19 +84,19 @@ First, ensure you have a working Go environment. See [Go 'Getting Started' docum
84
84
85
85
Then fetch the code, build and install the binary:
86
86
87
-
```bash
87
+
```text
88
88
go get github.com/jimsmart/progszy/cmd/progszy
89
89
```
90
90
91
91
By default, the resulting binary executable will be `~/go/bin/progszy` (assuming no customisation has been made to `$GOPATH` or `$GOBIN`).
92
92
93
93
## Usage Examples
94
94
95
-
Once built/installed, progszy can be invoked via the command line, as follows...
95
+
Once built/installed, Progszy can be invoked via the command line, as follows...
0 commit comments