受欢迎的博客标签

Nginx 使用、配置及错误处理(Ubuntu)

Published

Table of Content

nginx配置防止爬虫

How to Debugging a HTTP 400 Bad Request error in Nginx

 

502 Bad Gateway

 

 

 

一些常见的端口号及其用途如下:

21端口:FTP 文件传输服务
22端口:SSH 端口
23端口:TELNET 终端仿真服务
25端口:SMTP 简单邮件传输服务
53端口:DNS 域名解析服务
80端口:HTTP 超文本传输服务
110端口:POP3 “邮局协议版本3”使用的端口
443端口:HTTPS 加密的超文本传输服务
1433端口:MS SQL*SERVER数据库 默认端口号
1521端口:Oracle数据库服务
1863端口:MSN Messenger的文件传输功能所使用的端口
3306端口:MYSQL 默认端口号
3389端口:Microsoft RDP 微软远程桌面使用的端口
5631端口:Symantec pcAnywhere 远程控制数据传输时使用的端口
5632端口:Symantec pcAnywhere 主控端扫描被控端时使用的端口
5000端口:MS SQL Server使用的端口
8000端口:腾讯QQ

Nginx 目录

$ cd /etc/nginx
$ ls -l
total 60
drwx------ 2 ubuntu ubuntu 4096 Jun 16 09:27 cert    ## ssl证书目录
drwxr-xr-x 2 root   root   4096 Jul 12  2017 conf.d
-rw-r--r-- 1 root   root   1077 Feb 11  2017 fastcgi.conf
-rw-r--r-- 1 root   root   1007 Feb 11  2017 fastcgi_params
-rw-r--r-- 1 root   root   2837 Feb 11  2017 koi-utf
-rw-r--r-- 1 root   root   2223 Feb 11  2017 koi-win
-rw-r--r-- 1 root   root   3957 Feb 11  2017 mime.types
-rw-r--r-- 1 root   root   1501 Aug 31 07:42 nginx.conf    ## 配置文件
-rw-r--r-- 1 root   root    180 Feb 11  2017 proxy_params
-rw-r--r-- 1 root   root    636 Feb 11  2017 scgi_params
drwxr-xr-x 2 root   root   4096 Aug 31 09:42 sites-available  ## 虚拟主机配置代理目录
drwxr-xr-x 2 root   root   4096 Jun 15 06:39 sites-enabled    ## 启动配置代理目录
drwxr-xr-x 2 root   root   4096 Jun  4 06:03 snippets
-rw-r--r-- 1 root   root    664 Feb 11  2017 uwsgi_params
-rw-r--r-- 1 root   root   3071 Feb 11  2017 win-utf

 

To check the Status:
$ sudo systemctl status nginx

To start Nginx:
$ sudo systemctl start nginx

To stop Nginx:
$ sudo systemctl stop nginx

To enable Nginx at boot:
$ sudo systemctl enable nginx

To disable Nginx at boot:
$ sudo systemctl disable nginx

To reload the Nginx service (used to apply configuration changes):
$ sudo systemctl reload nginx

To hard restart of Nginx:
$ sudo systemctl restart nginx

 

nginx配置防止爬虫

方案1:站点根目录下存放robots.txt文件

 

方案2:

Nginx可以根据User-Agent过滤请求,只需要在需要URL入口位置通过一个简单的正则表达式就可以过滤不符合要求的爬虫请求:

 location / {
        if ($http_user_agent ~* "python|curl|java|wget|httpclient|okhttp") {
            return 503;
        }
        # 正常处理
        ...
    }

变量$http_user_agent是一个可以直接在location中引用的Nginx变量。~*表示不区分大小写的正则匹配,通过python就可以过滤掉80%的Python爬虫。

 

step 1:做一个爬虫的配置文件,里面包含爬虫策略:

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
     return 403;
}

#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) {
     return 403;
}

#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
}

#屏蔽单个IP的命令是
#deny 123.45.6.7
#封整个段即从123.0.0.1到123.255.255.254的命令
#deny 123.0.0.0/8
#封IP段即从123.45.0.1到123.45.255.254的命令
#deny 124.45.0.0/16
#封IP段即从123.45.6.1到123.45.6.254的命令是
#deny 123.45.6.0/24

 

other

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
     return 403;
}
 
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|
FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|
CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|
Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|
lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|
YandexBot|FlightDeckReports|Linguee Bot|^$" ) {
     return 403;             
}

使用curl -A 模拟抓取即可,比如:

curl -I -A 'YYSpider' www.haoeasy.cn

output

[[email protected] conf]# curl -I -A 'YYSpider' www.haoeasy.cn
HTTP/1.1 403 Forbidden
Server: nginx/1.12.0
Date: Wed, 24 Apr 2019 11:35:21 GMT
Content-Type: text/html
Content-Length: 169
Connection: keep-alive

 

http_user_agent

 

step 2:把这些爬虫的信息,加入到nginx的配置文件中,在80端口和443端口都做配置。

server {
        listen 80;
        server_name www.wulaoer.org wulaoer.org;
        index index.html index.htm index.php;
        ...................;
        include enable-php.conf;
        include  /usr/local/nginx/conf/anti_spider.conf;  #爬虫配置文件

    server {
        listen 443 ssl;
        server_name www.wulaoer.org wulaoer.org;
        index index.html index.htm index.php;
        ..................;
        include enable-php.conf;
        include  /usr/local/nginx/conf/anti_spider.conf;   #爬虫配置文件

step 3:重启nginx,重启后进行一下验证

[[email protected] ~]# curl -I -A "Scrapy" www.wulaoer.org
HTTP/1.1 403 Forbidden
Server: nginx
Date: Tue, 16 Mar 2021 03:09:17 GMT
Content-Type: text/html
Content-Length: 146
Connection: keep-alive

UA类型

FeedDemon             内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  内容采集
Jullo                 内容采集
Feedly                内容采集
UniversalFeedParser   内容采集
ApacheBench           cc攻击器
Swiftbot              无用爬虫
YandexBot             无用爬虫
AhrefsBot             无用爬虫
YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
jikeSpider            无用爬虫
MJ12bot               无用爬虫
ZmEu phpmyadmin       漏洞扫描
WinHttp               采集cc攻击
EasouSpider           无用爬虫
HttpClient            tcp攻击
Microsoft URL Control 扫描
YYSpider              无用爬虫
jaunty                wordpress爆破扫描器
oBot                  无用爬虫
Python-urllib         内容采集
Indy Library          扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot           无用爬虫

 

location / {                                      
if ($request ~* (Scrapy|Curl|blogspot)) {
     return 403;
}

 

GET /common/SetLanguage?culture=en&returnUrl=http://b23finaciali.blogspot.com/ HTTP/1.1" 403 656 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)

nginx HTTP Return Codes

 

Constants
HTTP Return Codes
These constants are an easy way to reference the HTTP return codes in your modules

Option	Value
NGX_HTTP_CONTINUE	100
NGX_HTTP_SWITCHING_PROTOCOLS	101
...

more detail:

https://www.nginx.com/resources/wiki/extending/api/http/

 

How to Debug in Nginx

Once the client's IP address was identified I enabled the "debug mode" for this particular IP. Nginx allows to set a certain IP address or range into debug mode by using the "debug_connection" parameter in the events context. This context is usually found in /etc/nginx/nginx.conf:

events {
    # Debugging a certain IP
    debug_connection 192.168.55.12; # client getting http 400 errors
}

detail: https://www.iaspnetcore.com/Blog/BlogPost/5e92d6fafc60383531a48a8e/how-to-debugging-a-http-400-bad-request-error-in-nginx

 

https://nginx.org/en/docs/ngx_core_module.html#debug_connection

How to fix nginx throws 400 bad request headers?

telnet serverip 80

会造成400错误。

https://yq.aliyun.com/articles/483987

When nginx returns 400 (Bad Request) it will log the reason into error log, at "info" level.

Yes changing the error_to debug level  (edit /etc/nginx/nginx.conf ).

The second parameter determines the level of logging, and can be one of the following: debug, info, notice, warn, error, crit, alert, or emerg.

 in /etc/nginx/nginx.conf, you can put at the beginning of the file the line

error_log /var/log/nginx/error.log debug;

And then restart nginx:

sudo service nginx restart

That way you can detail what nginx is doing and why it is returning the status code 400.

2017/02/08 22:32:24 [debug] 1322#1322: *1 connect to unix:///run/uwsgi/app/socket, fd:20 #2
        2017/02/08 22:32:24 [debug] 1322#1322: *1 connected
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http upstream connect: 0
        2017/02/08 22:32:24 [debug] 1322#1322: *1 posix_memalign: 0000560E1F25A2A0:128 @16
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http upstream send request
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http upstream send request body
        2017/02/08 22:32:24 [debug] 1322#1322: *1 chain writer buf fl:0 s:454
        2017/02/08 22:32:24 [debug] 1322#1322: *1 chain writer in: 0000560E1F2A0928
        2017/02/08 22:32:24 [debug] 1322#1322: *1 writev: 454 of 454
        2017/02/08 22:32:24 [debug] 1322#1322: *1 chain writer out: 0000000000000000
        2017/02/08 22:32:24 [debug] 1322#1322: *1 event timer add: 20: 60000:1486593204249
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http finalize request: -4, "/?" a:1, c:2
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http request count:2 blk:0
        2017/02/08 22:32:24 [debug] 1322#1322: *1 post event 0000560E1F2E5DE0
        2017/02/08 22:32:24 [debug] 1322#1322: *1 post event 0000560E1F2E5E40
        2017/02/08 22:32:24 [debug] 1322#1322: *1 delete posted event 0000560E1F2E5DE0
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http run request: "/?"
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http upstream check client, write event:1, "/"
        2017/02/08 22:32:24 [debug] 1322#1322: *1 http upstream recv(): -1 (11: Resource temporarily unavailable)

 

How to Customize Nginx Web Logs

Setting Up the CLF on Nginx.Make sure to place your CLF at the beginning of the http {} block:

/etc/nginx/nginx.conf

http {
log_format myclf '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" "$gzip_ratio"';
...
...
}
$remote_addr is the IP address of the visitor
$remote_user is the authenticated user (if any)
$time_local is time of the request
$request is the first line of the request
$status is the HTTP status of the request
$body_bytes_sent is the size (in bytes) of server's response
$http_referer is the referrer URL
$http_user_agent detects the user agent used by the client

A real life request logged by this configuration would look like this:

201.217.xx.xx - - [01/Oct/2015:08:46:48 -0400] "HEAD /wp-login.php HTTP/1.1" 200 0 "http://wordpress.com/wp-login.php?redirect_to=http%3A%2F%2Fwordpress.com%2Fwp-admin%2F&reauth=1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36"
201.217.xx.xx is the IP address from the visitor.
[01/Oct/2015:08:46:48 -0400] is the time of the request.
HEAD /wp-login.php HTTP/1.1 is the first line of the requested URL.
200 is the HTTP status code, which is OK in this case.
http://wordpress.com/wp-login… is the referrer URL
Mozilla/5.0 (X11; Linux x86_64… is the User Agent, from which the request came.

In this case, “myclf” was used in the log configuration; this will be useful in the next step.

Finally, inside the server {} block, you can define the access log as usual. At the end, add the name of the CLF you created before:

server {
access_log /spool/logs/nginx-access.log myclf;
...
}

Restart Nginx to apply the changes:

service nginx restart

 

How to Modify Default Html for Welcome to nginx! 

我们通过http://IP:Port/访问nginx时,出现welcome to nginx的界面,默认这是正常的。这个文件在/usr/share/nginx/html/index.html

修改缺省页提示信息

/usr/share/nginx/html/index.html

Welcome to nginx
If you see this page, the nginx web server is successfully installed and working. Further configuration is required.
For online documentation and support please refer to nginx.org.
Commercial support is available at nginx.com.
Thank you for using nginx.

 

访问出现下面的页面,说明域名转发未配置。

nginx会在已配置的server 章节找你输入的域名,如果没找到,会显示下面的信息。这个信息由缺省的服务器提供:

# Default server configuration
#
server {
	listen 80 default_server;
	listen [::]:80 default_server;
..
}

 

80端口缺省页面

/var/www/html/index.nginx-debian.html

 

 

 

502 Bad Gateway

如果配置了域名转发,但提示下面的信息,说明nginx和后台的网站没联系上。只有两种情况:

1.nginx转发的端口和后台web server监听的端口不一致。

2.http://0.0.0.0和http://localhost,前者走外网,可能被防火墙阻止了。后者走本地机器。

502 Bad Gateway

502 Bad Gateway


nginx/1.14.0 (Ubuntu)

 

put 出现400错误

生产环境用的nginx配置是域名,而预生产环境用的是IP+端口,除此之外没有任何区别.

 I'm  running .asp net core 3.x Web api behind Nginx. 

When nginx returns 400 (Bad Request) it will log the reason into error log, at "info" level.

 

step 1:查看日志

/var/log/nginx/api.ggg.com.log

h-"api.ggg.com" -106.156.193.134 for - - - [28/Aug/2020:06:31:21 +0800] "GET /api/EmployeeRole/5f45306a50b21d800c2f0367 HTTP/1.1" "api.usdotnet.com" https:443 200 316 "-" Upstream ["127.0.0.1:9100" (0.020) 200 : -] "-" "-" - -
h-"api.ggg.com" -106.156.193.134 for - - - [28/Aug/2020:06:31:25 +0800] "PUT /api/EmployeeRole HTTP/1.1" "api.usdotnet.com" https:443 400 0 "-" Upstream ["127.0.0.1:9100" (0.004) 400 : -] "-" "-" - -

put 出现400错误

 

 

HTTP400错误

谷歌浏览器问题

解决HTTP 400错误的方法
好长时间以来,野草在访问野草博客时就经常遇到HTTP400错误,现象是访问野草的个人门户一切正常,访问别人的博客也一切正常,今天又发现在谷歌浏览器chrome里访问野草博客遇到HTTP400错误,但使用其他浏览器访问野草博客却一切正常,遇到的错误提示如下:400 Bad Request nginx/0.8.15
野草之前搜索后,发现很多人都说这个错误与DNS有关,于是野草把自己的DNS折腾了好多遍,改用Google的DNS后还会遇到HTTP400错误,改用电信的DNS也还会遇到HTTP400错误,改成自动获取DNS服务器地址以后还是会遇到HTTP400错误,真是把野草郁闷的要死。
今天万般无奈之下,向和讯博客管理员求助,因为野草实在是怀疑野草博客所在的和讯服务器是不是出了啥问题,然后,野草得到了和讯博客管理员的提示,也就是本文要分享的解决HTTP400错误的方法:
先删除一下浏览器中的cookies,操作方法是:在浏览器中点击“Internet选项”然后再点击删除cookies。再尝试通过博客首页登录。
野草按照这个思路,尝试清除了谷歌浏览器chrome的浏览记录以后,发现果然彻底解决了野草访问时经常遇到的HTTP400错误的问题。
就这么简单!搞了半天,还是客户端的问题。希望这个方法能够帮到大家解决HTTP400错误的问题。

 

https://nginx.org/en/docs/

 

nginx 出现大量TIME_WAIT连接的排查与解决