Stata 从字符串中删除整个单词

Stata remove entire word from string

我有一个字符串变量,我想在其中删除某些单词,但许多其他单词可能是部分匹配,我不想删除。我想删除单词,当且仅当它们完全匹配时。

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words

* This is a list of words I want removed. In reality, this is a fairly long list
local removs "mor ten bad"
* For each of words, remove the complete word from teh string
foreach w of local removs {
    replace strip_words = subinstr(strip_words, "`w'","", .) 
}

list
     +---------------------------------------------------------------+
     | index                            words            strip_words |
     |---------------------------------------------------------------|
  1. |     1              more mor morph test            e ph test   |
  2. |     2   ten tennis tenner tenth keeper     nis ner th keeper  |
  3. |     3           badder baddy bad other         der dy other   |
     +---------------------------------------------------------------+

我试过用 replace strip_words = " " + strip_words + " " 填充一些空格,但这也删除了分隔其他单词的空格。我想要的输出是

     +-------------------------------------------------------------------------+
     | index                            words                      strip_words |
     |-------------------------------------------------------------------------|
  1. |     1              more mor morph test              more  morph test    |
  2. |     2   ten tennis tenner tenth keeper    tennis tenner tenth keeper    |
  3. |     3           badder baddy bad other           badder baddy  other    |
     +-------------------------------------------------------------------------+
'''

参见 help string functions subinword()

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words

* This is a list of words I want removed. In reality, this is a fairly long list
local removs "mor ten bad"
* For each of words, remove the complete word from teh string
foreach w of local removs {
    replace strip_words = subinword(strip_words, "`w'","", .) 
}

replace strip_words = itrim(strip_words) 

使用您的示例,但使用 subinword 而不是 subinstr 您可以获得所需的输出。

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words
gen strip_words_2 = words

* This is a list of words I want removed. In reality, this is a fairly long list
local removs "mor ten bad"
* For each of words, remove the complete word from teh string
foreach w of local removs {
    replace strip_words   = subinstr(strip_words, "`w'","", .) 
    replace strip_words_2 = subinword(strip_words_2,"`w'","",.)
    }

list
     

     +-------------------------------------------------------------------------------------------+
     | index                            words          strip_words                 strip_words_2 |
     |-------------------------------------------------------------------------------------------|
  1. |     1              more mor morph test           e  ph test              more  morph test |
  2. |     2   ten tennis tenner tenth keeper    nis ner th keeper    tennis tenner tenth keeper |
  3. |     3           badder baddy bad other        der dy  other           badder baddy  other |
     +-------------------------------------------------------------------------------------------+
     
     
     

这可以用正则表达式来处理。简介:link

Stata 基于 Unicode 的正则表达式命令支持 \b 指示单词边界。

clear
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

local rmv "(mor|ten|bad)"
gen wanted = ustrregexra(words, "\b`rmv'\b", "")
list

     +----------------------------------------------------------------------+
     | index                            words                        wanted |
     |----------------------------------------------------------------------|
  1. |     1              more mor morph test              more  morph test |
  2. |     2   ten tennis tenner tenth keeper    tennis tenner tenth keeper |
  3. |     3           badder baddy bad other           badder baddy  other |
     +----------------------------------------------------------------------+

从你的例子来看,你似乎想像上面那样保留空格。否则,您可以使用 strtrim()stritrim().

删除它们