通过使用正则表达式匹配组将多个列添加到 data.table
Add multiple columns to a data.table by matching groups with a regex
R 新手,这可能很明显,但我的搜索措辞不正确。
我正在将 Web 服务器日志解析为 data.table,我想通过从请求字符串中提取部分来创建一组列。我的源数据如下所示:
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" 200 26294 "https://bela.com/home/amazeballs" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 2.031 2.031 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" 200 4485 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.173 0.173 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" 200 4851 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.168 0.168 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" 200 7499 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.290 0.290 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" 200 132880 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.366 0.366 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 1386 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.233 0.233 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 2121 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.108 0.108 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 3230 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.174 0.174 .
所以我敲了下面的代码:
alog <- fread('cat sample.log | grep -v "GET /junk" | cut -f 4,6- -d " " ')
setnames(alog, c("ip","remote_user","datetime","timezone","request","status","bytes","referer","user_agent","http_x_forwarded_for","request_time","upstream_response_time","pipe"))
request_parts <- function(x) {
m <- regexec("^([A-Z]+) /([^/]+)/([^\?]+)(\?[^ ]+)? HTTP/(.*)", x)
parts <- do.call(rbind, lapply(regmatches(x, m), `[`, c(2, 3, 4, 5, 6)))
colnames(parts) <- c("method","webapp","page","query_string", "http_version")
parts
}
parts <- request_parts(alog$request)
它似乎在一定程度上起作用:
> alog$request
[1] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1"
[3] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1"
[5] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1"
[7] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1"
> parts
method webapp page query_string http_version
[1,] "GET" "silly" "sales/1234567890" "?amazeballsTask=Y" "1.1"
[2,] "GET" "silly" "jawr/css/gzip_N676825985/bundles/app.css" "" "1.1"
[3,] "GET" "silly" "jawr/css/gzip_2073017426/bundles/lib.css" "" "1.1"
[4,] "GET" "silly" "jawr/js/gzip_1764696599/bundles/app.js" "" "1.1"
[5,] "GET" "silly" "jawr/js/gzip_N1319387470/bundles/lib.js" "" "1.1"
[6,] "GET" "silly" "js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1"
[7,] "GET" "silly" "styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1"
[8,] "GET" "silly" "js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1"
但这不是我想要的(将零件的所有列添加到 alog 中):
> alog$method
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # yay!
> alog$webapp
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # dismay :(
我做错了什么?有很多像下面这样的警告,但我真的不明白他们想告诉我什么。
1: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) :
5 column matrix RHS of := will be treated as one vector
2: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) :
Supplied 40 items to be assigned to 8 items of column 'method' (32 unused)
parts
是一个矩阵;您必须转换为 data.table 才能正常工作。这是一个例子:
m <- matrix(1:25, nc=5)
colnames(m) <- LETTERS[1:5]
library(data.table)
dt <- data.table(x=1:5)
dt[,colnames(m):=m]
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(colnames(m), m)) :
# 5 column matrix RHS of := will be treated as one vector
# ...
dt # not what you want...
# x A B C D E
# 1: 1 1 1 1 1 1
# 2: 2 2 2 2 2 2
# 3: 3 3 3 3 3 3
# 4: 4 4 4 4 4 4
# 5: 5 5 5 5 5 5
dt[,colnames(m):=as.data.table(m)]
dt # better
# x A B C D E
# 1: 1 1 6 11 16 21
# 2: 2 2 7 12 17 22
# 3: 3 3 8 13 18 23
# 4: 4 4 9 14 19 24
# 5: 5 5 10 15 20 25
R 新手,这可能很明显,但我的搜索措辞不正确。
我正在将 Web 服务器日志解析为 data.table,我想通过从请求字符串中提取部分来创建一组列。我的源数据如下所示:
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" 200 26294 "https://bela.com/home/amazeballs" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 2.031 2.031 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" 200 4485 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.173 0.173 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" 200 4851 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.168 0.168 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" 200 7499 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.290 0.290 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" 200 132880 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.366 0.366 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 1386 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.233 0.233 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 2121 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.108 0.108 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930] "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 3230 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.174 0.174 .
所以我敲了下面的代码:
alog <- fread('cat sample.log | grep -v "GET /junk" | cut -f 4,6- -d " " ')
setnames(alog, c("ip","remote_user","datetime","timezone","request","status","bytes","referer","user_agent","http_x_forwarded_for","request_time","upstream_response_time","pipe"))
request_parts <- function(x) {
m <- regexec("^([A-Z]+) /([^/]+)/([^\?]+)(\?[^ ]+)? HTTP/(.*)", x)
parts <- do.call(rbind, lapply(regmatches(x, m), `[`, c(2, 3, 4, 5, 6)))
colnames(parts) <- c("method","webapp","page","query_string", "http_version")
parts
}
parts <- request_parts(alog$request)
它似乎在一定程度上起作用:
> alog$request [1] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" [3] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" [5] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" [7] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" > parts method webapp page query_string http_version [1,] "GET" "silly" "sales/1234567890" "?amazeballsTask=Y" "1.1" [2,] "GET" "silly" "jawr/css/gzip_N676825985/bundles/app.css" "" "1.1" [3,] "GET" "silly" "jawr/css/gzip_2073017426/bundles/lib.css" "" "1.1" [4,] "GET" "silly" "jawr/js/gzip_1764696599/bundles/app.js" "" "1.1" [5,] "GET" "silly" "jawr/js/gzip_N1319387470/bundles/lib.js" "" "1.1" [6,] "GET" "silly" "js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1" [7,] "GET" "silly" "styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1" [8,] "GET" "silly" "js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" "" "1.1"
但这不是我想要的(将零件的所有列添加到 alog 中):
> alog$method
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # yay!
> alog$webapp
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # dismay :(
我做错了什么?有很多像下面这样的警告,但我真的不明白他们想告诉我什么。
1: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) : 5 column matrix RHS of := will be treated as one vector 2: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) : Supplied 40 items to be assigned to 8 items of column 'method' (32 unused)
parts
是一个矩阵;您必须转换为 data.table 才能正常工作。这是一个例子:
m <- matrix(1:25, nc=5)
colnames(m) <- LETTERS[1:5]
library(data.table)
dt <- data.table(x=1:5)
dt[,colnames(m):=m]
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(colnames(m), m)) :
# 5 column matrix RHS of := will be treated as one vector
# ...
dt # not what you want...
# x A B C D E
# 1: 1 1 1 1 1 1
# 2: 2 2 2 2 2 2
# 3: 3 3 3 3 3 3
# 4: 4 4 4 4 4 4
# 5: 5 5 5 5 5 5
dt[,colnames(m):=as.data.table(m)]
dt # better
# x A B C D E
# 1: 1 1 6 11 16 21
# 2: 2 2 7 12 17 22
# 3: 3 3 8 13 18 23
# 4: 4 4 9 14 19 24
# 5: 5 5 10 15 20 25