过滤 Pandas 数据框或系列中的值

Question

我正在尝试从 pandas 数据框中的列中过滤值，但我似乎收到的是布尔值而不是实际值。我正在尝试按月和年过滤我们的数据。在下面的代码中，您会看到我只按年份过滤，但我以不同的方式多次尝试了月份和年份：

    In [1]: import requests

    In [2]: import pandas as pd # pandas

    In [3]: import datetime as dt # module for manipulating dates and times

    In [4]: url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"

    In [5]: source = requests.get(url).text

    In [6]: from io import StringIO, BytesIO

    In [7]: s = StringIO(source)

    In [8]: election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(convert_dates="coerce", convert_numeric=True)


    In [9]: election_data.head(n=3)
    Out[9]:
                Pollster Start Date   End Date Entry Date/Time (ET)  \
    0  Politico/GWU/Battleground 2012-11-04 2012-11-05  2012-11-06 08:40:26
    1           YouGov/Economist 2012-11-03 2012-11-05  2012-11-26 15:31:23
    2           Gravis Marketing 2012-11-03 2012-11-05  2012-11-06 09:22:02

       Number of Observations     Population             Mode  Obama  Romney  \
    0                  1000.0  Likely Voters       Live Phone   47.0    47.0
    1                   740.0  Likely Voters         Internet   49.0    47.0
    2                   872.0  Likely Voters  Automated Phone   48.0    48.0

       Undecided  Other                                       Pollster URL  \
    0        6.0    NaN  http://elections.huffingtonpost.com/pollster/p...
    1        3.0    NaN  http://elections.huffingtonpost.com/pollster/p...
    2        4.0    NaN  http://elections.huffingtonpost.com/pollster/p...

                                              Source URL     Partisan Affiliation  \
    0  http://www.politico.com/news/stories/1112/8338...  Nonpartisan        None
    1  http://cdn.yougov.com/cumulus_uploads/document...  Nonpartisan        None
    2  http://www.gravispolls.com/2012/11/gravis-mark...  Nonpartisan        None

       Question Text  Question Iteration
    0            NaN                   1
    1            NaN                   1
    2            NaN                   1

    In [10]: start_date = pd.Series(election_data["Start Date"])
        ...: start_date.head(n=3)
        ...:
    Out[10]:
    0   2012-11-04
    1   2012-11-03
    2   2012-11-03
    Name: Start Date, dtype: datetime64[ns]

    In [11]: filtered = start_date.map(lambda x: x.year == 2012)

    In [12]: filtered
    Out[12]:
    0       True
    1       True
    2       True
    ...
    587    False
    588    False
    589    False
    Name: Start Date, dtype: bool

Answer 1

我认为你需要 read_csv with url address first and then boolean indexing with mask created by year and month:

election_data = pd.read_csv('http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv', parse_dates=[1,2,3])

print (election_data.head(3))
                    Pollster Start Date   End Date Entry Date/Time (ET)  \
0  Politico/GWU/Battleground 2012-11-04 2012-11-05  2012-11-06 08:40:26   
1           YouGov/Economist 2012-11-03 2012-11-05  2012-11-26 15:31:23   
2           Gravis Marketing 2012-11-03 2012-11-05  2012-11-06 09:22:02   

   Number of Observations     Population             Mode  Obama  Romney  \
0                  1000.0  Likely Voters       Live Phone   47.0    47.0   
1                   740.0  Likely Voters         Internet   49.0    47.0   
2                   872.0  Likely Voters  Automated Phone   48.0    48.0   

   Undecided  Other                                       Pollster URL  \
0        6.0    NaN  http://elections.huffingtonpost.com/pollster/p...   
1        3.0    NaN  http://elections.huffingtonpost.com/pollster/p...   
2        4.0    NaN  http://elections.huffingtonpost.com/pollster/p...   

                                          Source URL     Partisan Affiliation  \
0  http://www.politico.com/news/stories/1112/8338...  Nonpartisan        None   
1  http://cdn.yougov.com/cumulus_uploads/document...  Nonpartisan        None   
2  http://www.gravispolls.com/2012/11/gravis-mark...  Nonpartisan        None   

   Question Text  Question Iteration  
0            NaN                   1  
1            NaN                   1  
2            NaN                   1

print (election_data.dtypes)
Pollster                          object
Start Date                datetime64[ns]
End Date                  datetime64[ns]
Entry Date/Time (ET)      datetime64[ns]
Number of Observations           float64
Population                        object
Mode                              object
Obama                            float64
Romney                           float64
Undecided                        float64
Other                            float64
Pollster URL                      object
Source URL                        object
Partisan                          object
Affiliation                       object
Question Text                    float64
Question Iteration                 int64
dtype: object


election_data[election_data["Start Date"].dt.year == 2012]

election_data[(election_data["Start Date"].dt.year == 2012) & (election_data["Start Date"].dt.month== 10)]

Answer 2

如果将 Start Date 作为索引

，则可以使用 pandas 日期过滤

获取全部2012
election_data.set_index('Start Date')['2012']

获取全部Jan, 2012
election_data.set_index('Start Date')['2012-01']

获取 Jan 1, 2012 和 Jan 13, 2012
之间的所有内容 election_data.set_index('Start Date')['2012-01-01':'2012-01-13]

过滤 Pandas 数据框或系列中的值

Filtering Values in Pandas Dataframe or Series

python

series

dataframe

python-3.x

pandas