在 Pandas 中查找具有特定标签的行的比例的最简单方法是什么?

社会演员多 python 436

原文标题What is the easiest way in Pandas to find what proportion of rows have a particular label?

在给出某个人口的人口统计数据的表格中,我想找出德国公民的比例。我想知道 Pandas 中是否有一个功能可以找出有多少行有某个标签,或者在这种情况下,有多少行在“本国”列中有“德国”。

数据来自mlcourse.ai:https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/

我尝试使用“value counts”函数来查看是否可以打印出“native-country”列下具有“:Germany”的行数。将 normalize 设置为 true,我只需将结果乘以 100得到以德国为祖国的人的比例。

data[data["native-country"]=="Germany"].value_counts(normalize=True)

输出:

age  workclass         fnlwgt  education   education-num  marital-status      occupation       relationship   race                sex     capital-gain  capital-loss  hours-per-week  native-country  salary
18   ?                 85154   12th        8              Never-married       ?                Own-child      Asian-Pac-Islander  Female  0             0             24              Germany         <=50K     0.007299
46   Private           35961   Assoc-acdm  12             Divorced            Sales            Not-in-family  White               Female  0             0             25              Germany         <=50K     0.007299
45   Private           161954  Bachelors   13             Never-married       Prof-specialty   Not-in-family  White               Female  0             0             40              Germany         <=50K     0.007299
                       174794  Bachelors   13             Separated           Prof-specialty   Unmarried      White               Female  0             0             56              Germany         <=50K     0.007299
                       204057  Bachelors   13             Divorced            Adm-clerical     Unmarried      White               Female  0             0             40              Germany         <=50K     0.007299
                                                                                                                                                                                                                  ...   
30   Private           318749  Assoc-voc   11             Married-civ-spouse  Tech-support     Wife           White               Female  0             0             35              Germany         <=50K     0.007299
                       116508  HS-grad     9              Married-civ-spouse  Craft-repair     Husband        White               Male    0             0             40              Germany         <=50K     0.007299
                       111415  HS-grad     9              Married-civ-spouse  Other-service    Husband        White               Male    0             0             55              Germany         <=50K     0.007299
                       77143   Bachelors   13             Never-married       Exec-managerial  Own-child      Black               Male    0             0             40              Germany         <=50K     0.007299
74   Self-emp-not-inc  199136  Bachelors   13             Widowed             Craft-repair     Not-in-family  White               Male    15831         0             8               Germany         >50K      0.007299
Length: 137, dtype: float64

这似乎不起作用,并返回了由以德国为本国的人组成的表格部分。我可以用它来得到我的答案,但我正在寻找一种更简单的方法,如果有的话。

原文链接:https://stackoverflow.com//questions/71686103/what-is-the-easiest-way-in-pandas-to-find-what-proportion-of-rows-have-a-particu

回复

我来回复
  • Rayan Hatout的头像
    Rayan Hatout 评论

    发生的事情是,您当前正在过滤具有["native-country"]=="Germany"的行,然后在整个结果 DataFrame 上运行值计数。这将为您提供 1 的计数,因为当考虑所有属性时,每一行都是唯一的。

    你应该做的是隔离native-country列,然后根据它进行计数。

    代码如下所示:

    native_countries = data["native-country"]
    native_countries_count = native_countries.value_counts(normalize=True)
    print(native_countries_count["Germany"])
    

    我创建了以下玩具数据集来演示:

    df = pd.DataFrame({
    "age": [12, 23, 34, 45],
    "native-country": ["Germany", "Germany", "Germany", "France"]})
    
    print(df)
    #    age native-country
    # 0   12        Germany
    # 1   23        Germany
    # 2   34        Germany
    # 3   45         France
    
    native_countries = df["native-country"]
    
    print(native_countries)
    # 0    Germany
    # 1    Germany
    # 2    Germany
    # 3     France
    # Name: native-country, dtype: object
    
    native_countries_count = native_countries.value_counts(normalize=True)
    print(native_countries_count)
    # Germany    0.75
    # France     0.25
    # Name: native-country, dtype: float64
    
    print(native_countries_count["Germany"])
    # 0.75
    
    2年前 0条评论
  • constantstranger的头像
    constantstranger 评论

    尝试这个:

    import pandas as pd
    data = pd.DataFrame(
        {'native-country':['France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany'],
        'age':[21,22,23,24,25,26,27,28,29,30,29,28,27,26,25,24]})
    
    print(data[data["native-country"]=="Germany"].shape[0] / data.shape[0])
    

    输出:

    0.25
    
    2年前 0条评论