Pandas 丟掉 DataFrame 中重複的行
Suraj Joshi
2023年1月30日
Pandas
Pandas DataFrame Row
-
DataFrame.drop_duplicates()語法 -
使用
DataFrame.drop_duplicates()方法刪除重複的行 -
在
drop_duplicates()方法中設定keep='last'
本教程介紹瞭如何使用 DataFrame.drop_duplicates() 方法從 Pandas DataFrame 中刪除所有重複的行。
DataFrame.drop_duplicates() 語法
DataFrame.drop_duplicates(subset=None, keep="first", inplace=False, ignore_index=False)
它返回一個 DataFrame,刪除 DataFrame 中所有重複的行。
使用 DataFrame.drop_duplicates() 方法刪除重複的行
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates()
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
輸出:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
它會刪除所有列的所有值都相同的行。預設情況下,DataFrame 中每一列都有相同值的行才被認為是重複的。在 df_with_duplicates DataFrame 中,第一行和第五行對所有列都有相同的值,所以第五行被刪除。
設定 subset 引數以僅基於特定列刪除重複項
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"])
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
輸出:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
在這裡,我們將 Name 作為 subset 引數傳給 drop_duplicates() 方法。第四行和第五行被刪除,因為它們的 Name 列的值與第一列相同。
在 drop_duplicates() 方法中設定 keep='last'
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep="last")
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
輸出:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
5 302 Watch 300
它刪除了所有的行,除了最後一行與 Name 列值相同的行。
我們設定 keep=False 來刪除任何一列中具有相同值的所有行。
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep=False)
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
輸出:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
它刪除了第一、五、六行,因為它們的 Name 列都有相同的值。
Enjoying our tutorials? Subscribe to DelftStack on YouTube to support us in creating more high-quality video guides. Subscribe
作者: Suraj Joshi
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn