如何return原句代替小写句
How to return the Original sentence instead of lower case sentence
我有一个简短的代码,因为我想打印出提取的原始句子而不是较低的 case.The 代码如下
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
我一直得到的是这个(例如)
>>risk factors for breast cancer have been well characterized
而不是这个:
>>Risk factors for breast cancer have been well characterized.
corpus = " ".join(words).lower()
您似乎在字符串上使用了 .lower()
,因此稍后您可以轻松地将其与 risk
进行比较。正如您所注意到的,这会降低整个字符串,并且没有简单的方法来反转该操作。
为了避免这种情况,您可以改为将 risk
与 word_tokenize(j).lower()
进行比较。更改这些行
corpus = " ".join(words).lower()
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)]
到
corpus = " ".join(words)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j).lower()]
这将保留字符串的原始状态,同时仍然能够轻松地与 risk
进行比较。
我有一个简短的代码,因为我想打印出提取的原始句子而不是较低的 case.The 代码如下
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
我一直得到的是这个(例如)
>>risk factors for breast cancer have been well characterized
而不是这个:
>>Risk factors for breast cancer have been well characterized.
corpus = " ".join(words).lower()
您似乎在字符串上使用了 .lower()
,因此稍后您可以轻松地将其与 risk
进行比较。正如您所注意到的,这会降低整个字符串,并且没有简单的方法来反转该操作。
为了避免这种情况,您可以改为将 risk
与 word_tokenize(j).lower()
进行比较。更改这些行
corpus = " ".join(words).lower()
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)]
到
corpus = " ".join(words)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j).lower()]
这将保留字符串的原始状态,同时仍然能够轻松地与 risk
进行比较。