不使用正则表达式解析 HTTP 请求
Parse HTTP request without using regexp
我正在使用正则表达式来分隔 HTTP 请求的字段:
GET /index.asp?param1=hello¶m2=128 HTTP/1.1
这样:
smatch m;
try
{
regex re1("(GET|POST) (.+) HTTP");
regex_search(query, m, re1);
}
catch (regex_error e)
{
printf("Regex 1 Error: %d\n", e.code());
}
string method = m[1];
string path = m[2];
try
{
regex re2("/(.+)?\?(.+)?");
if (regex_search(path, m, re2))
{
document = m[1];
querystring = m[2];
}
}
catch (regex_error e)
{
printf("Regex 2 Error: %d\n", e.code());
}
不幸的是,此代码适用于 MSVC,但不适用于 GCC 4.8.2(我在 Ubuntu Server 14.04 上)。您能否建议使用普通 std::string 运算符拆分该字符串的不同方法?
我不知道如何将 URL 拆分为不同的元素,因为查询字符串分隔符 '?'字符串中可能存在也可能不存在。
如果你想避免使用正则表达式,你可以使用标准的字符串操作:
string query = "GET / index.asp ? param1 = hello¶m2 = 128 HTTP / 1.1";
string method, path, document, querystring;
try {
if (query.substr(0, 5) == "GET /") // First check the method at the beginning
method = "GET";
else if (query.substr(0, 6) == "POST /")
method = "POST";
else throw std::exception("Regex 1 Error: no valid method or missing /");
path = query.substr(method.length() + 2); // take the rest, ignoring whitespace and slash
size_t ph = path.find(" HTTP"); // find the end of the url
if (ph == string::npos) // if it's not found => error
throw std::exception("Regex 2 Error: no HTTP version found");
else path.resize(ph); // otherwise get rid of the end of the string
size_t pq = path.find("?"); // look for the ?
if (pq == string::npos) { // if it's absent, document is the whole string
document = path;
querystring = "";
}
else { // orherwie cut into 2 parts
document = path.substr(0, pq);
querystring = path.substr(pq + 1);
}
cout << "method: " << method << endl
<< "document: " << document << endl
<< "querystring:" << querystring << endl;
}
catch (std::exception &e) {
cout << e.what();
}
当然,此代码不如您原来的正则表达式基础代码好。因此,如果您不能使用最新版本的编译器,则可以将其视为一种解决方法。
您可以使用 std::istringstream
来解析:
int main()
{
std::string request = "GET /index.asp?param1=hello¶m2=128 HTTP/1.1";
// separate the 3 main parts
std::istringstream iss(request);
std::string method;
std::string query;
std::string protocol;
if(!(iss >> method >> query >> protocol))
{
std::cout << "ERROR: parsing request\n";
return 1;
}
// reset the std::istringstream with the query string
iss.clear();
iss.str(query);
std::string url;
if(!std::getline(iss, url, '?')) // remove the URL part
{
std::cout << "ERROR: parsing request url\n";
return 1;
}
// store query key/value pairs in a map
std::map<std::string, std::string> params;
std::string keyval, key, val;
while(std::getline(iss, keyval, '&')) // split each term
{
std::istringstream iss(keyval);
// split key/value pairs
if(std::getline(std::getline(iss, key, '='), val))
params[key] = val;
}
std::cout << "protocol: " << protocol << '\n';
std::cout << "method : " << method << '\n';
std::cout << "url : " << url << '\n';
for(auto const& param: params)
std::cout << "param : " << param.first << " = " << param.second << '\n';
}
输出:
protocol: HTTP/1.1
method : GET
url : /index.asp
param : param1 = hello
param : param2 = 128
它不能与 gcc 4.8.2 一起工作的原因是 regex_search
没有在 stdlibc++ 中实现。如果你往里看 regex.h
你会得到:
template<typename _Bi_iter, typename _Alloc,
typename _Ch_type, typename _Rx_traits>
inline bool
regex_search(_Bi_iter __first, _Bi_iter __last,
match_results<_Bi_iter, _Alloc>& __m,
const basic_regex<_Ch_type, _Rx_traits>& __re,
regex_constants::match_flag_type __flags
= regex_constants::match_default)
{ return false; }
改用regex_match
,已实现。您必须修改正则表达式(例如,在前后添加 .*
),因为 regex_match
匹配整个字符串。
备选方案:
- 升级到 gcc 4.9
- 改用boost::regex
- 切换到 LLVM 和 libc++(我的偏好)。
我正在使用正则表达式来分隔 HTTP 请求的字段:
GET /index.asp?param1=hello¶m2=128 HTTP/1.1
这样:
smatch m;
try
{
regex re1("(GET|POST) (.+) HTTP");
regex_search(query, m, re1);
}
catch (regex_error e)
{
printf("Regex 1 Error: %d\n", e.code());
}
string method = m[1];
string path = m[2];
try
{
regex re2("/(.+)?\?(.+)?");
if (regex_search(path, m, re2))
{
document = m[1];
querystring = m[2];
}
}
catch (regex_error e)
{
printf("Regex 2 Error: %d\n", e.code());
}
不幸的是,此代码适用于 MSVC,但不适用于 GCC 4.8.2(我在 Ubuntu Server 14.04 上)。您能否建议使用普通 std::string 运算符拆分该字符串的不同方法?
我不知道如何将 URL 拆分为不同的元素,因为查询字符串分隔符 '?'字符串中可能存在也可能不存在。
如果你想避免使用正则表达式,你可以使用标准的字符串操作:
string query = "GET / index.asp ? param1 = hello¶m2 = 128 HTTP / 1.1";
string method, path, document, querystring;
try {
if (query.substr(0, 5) == "GET /") // First check the method at the beginning
method = "GET";
else if (query.substr(0, 6) == "POST /")
method = "POST";
else throw std::exception("Regex 1 Error: no valid method or missing /");
path = query.substr(method.length() + 2); // take the rest, ignoring whitespace and slash
size_t ph = path.find(" HTTP"); // find the end of the url
if (ph == string::npos) // if it's not found => error
throw std::exception("Regex 2 Error: no HTTP version found");
else path.resize(ph); // otherwise get rid of the end of the string
size_t pq = path.find("?"); // look for the ?
if (pq == string::npos) { // if it's absent, document is the whole string
document = path;
querystring = "";
}
else { // orherwie cut into 2 parts
document = path.substr(0, pq);
querystring = path.substr(pq + 1);
}
cout << "method: " << method << endl
<< "document: " << document << endl
<< "querystring:" << querystring << endl;
}
catch (std::exception &e) {
cout << e.what();
}
当然,此代码不如您原来的正则表达式基础代码好。因此,如果您不能使用最新版本的编译器,则可以将其视为一种解决方法。
您可以使用 std::istringstream
来解析:
int main()
{
std::string request = "GET /index.asp?param1=hello¶m2=128 HTTP/1.1";
// separate the 3 main parts
std::istringstream iss(request);
std::string method;
std::string query;
std::string protocol;
if(!(iss >> method >> query >> protocol))
{
std::cout << "ERROR: parsing request\n";
return 1;
}
// reset the std::istringstream with the query string
iss.clear();
iss.str(query);
std::string url;
if(!std::getline(iss, url, '?')) // remove the URL part
{
std::cout << "ERROR: parsing request url\n";
return 1;
}
// store query key/value pairs in a map
std::map<std::string, std::string> params;
std::string keyval, key, val;
while(std::getline(iss, keyval, '&')) // split each term
{
std::istringstream iss(keyval);
// split key/value pairs
if(std::getline(std::getline(iss, key, '='), val))
params[key] = val;
}
std::cout << "protocol: " << protocol << '\n';
std::cout << "method : " << method << '\n';
std::cout << "url : " << url << '\n';
for(auto const& param: params)
std::cout << "param : " << param.first << " = " << param.second << '\n';
}
输出:
protocol: HTTP/1.1
method : GET
url : /index.asp
param : param1 = hello
param : param2 = 128
它不能与 gcc 4.8.2 一起工作的原因是 regex_search
没有在 stdlibc++ 中实现。如果你往里看 regex.h
你会得到:
template<typename _Bi_iter, typename _Alloc,
typename _Ch_type, typename _Rx_traits>
inline bool
regex_search(_Bi_iter __first, _Bi_iter __last,
match_results<_Bi_iter, _Alloc>& __m,
const basic_regex<_Ch_type, _Rx_traits>& __re,
regex_constants::match_flag_type __flags
= regex_constants::match_default)
{ return false; }
改用regex_match
,已实现。您必须修改正则表达式(例如,在前后添加 .*
),因为 regex_match
匹配整个字符串。
备选方案:
- 升级到 gcc 4.9
- 改用boost::regex
- 切换到 LLVM 和 libc++(我的偏好)。