免费设计装修网站_深圳疫情最新消息今天_谷歌推广怎么样_google搜索引擎入口2022

时间:2025/7/11 11:24:50来源：https://blog.csdn.net/u013565133/article/details/147144497 浏览次数:0次

示例代码：

import os
import json
import nltk
from tqdm import tqdmdef wr_dict(filename,dic):if not os.path.isfile(filename):data = []data.append(dic)with open(filename, 'w') as f:json.dump(data, f)else:      with open(filename, 'r') as f:data = json.load(f)data.append(dic)with open(filename, 'w') as f:json.dump(data, f)def rm_file(file_path):if os.path.exists(file_path):os.remove(file_path)with open('datasource/news_filter_token.json', 'r') as file:data = json.load(file)save_path = 'datasource/news_filter_dup.json'
count = 0
print(f"Before: {len(data)}")doc_list = []
for d in tqdm(data):if d['body'] not in doc_list:doc_list.append(d['body'])wr_dict(save_path,d)print(f"After: {len(doc_list)}")

✅ 一、功能概述：

🧠 输入文件：

datasource/news_filter_token.json
👉 里面是一堆字典（新闻项），每条至少有 'body' 字段。

🎯 目标：

从这些新闻中去掉正文内容重复的项，只保留第一次出现的，写入新文件：
datasource/news_filter_dup.json

🧩 二、详细代码解释

import os
import json
import nltk
from tqdm import tqdm

导入常用模块：

os 用于文件检查和删除
json 用于读取/保存 JSON 数据
tqdm 用于加进度条显示

👇 定义保存字典的函数

def wr_dict(filename, dic):if not os.path.isfile(filename):  # 文件不存在就创建data = []data.append(dic)with open(filename, 'w') as f:json.dump(data, f)else:  # 文件已存在，读取追加写入with open(filename, 'r') as f:data = json.load(f)data.append(dic)with open(filename, 'w') as f:json.dump(data, f)

这个函数用于将一条字典数据（dic）追加保存到 JSON 文件中。

👇 删除已有输出文件，避免重复追加

def rm_file(file_path):if os.path.exists(file_path):os.remove(file_path)

👇 加载原始数据文件（去重前）

with open('datasource/news_filter_token.json', 'r') as file:data = json.load(file)

👇 删除输出路径旧文件（否则会越追加越大）

save_path = 'datasource/news_filter_dup.json'
count = 0
print(f"Before: {len(data)}")
rm_file(save_path)

👇 开始去重逻辑

doc_list = []  # 存储已出现的正文内容
for d in tqdm(data):  # 遍历每条新闻if d['body'] not in doc_list:  # 如果正文不重复doc_list.append(d['body'])  # 添加到已出现列表wr_dict(save_path, d)       # 保存这一条到输出文件

👇 打印处理结果

print(f"After: {len(doc_list)}")  # 实际去重后剩下的数量

🧪 三、示例输入输出格式

✅ 输入：`news_filter_token.json`

[{"title": "新闻A","body": "今天发生了一件大事，很多人都关注。","date": "2025-04-10"},{"title": "新闻B","body": "今天发生了一件大事，很多人都关注。","date": "2025-04-11"},{"title": "新闻C","body": "这是一条独特的新闻。","date": "2025-04-11"}
]

你会看到虽然 title 不一样，但 body 重复了。

✅ 输出：`news_filter_dup.json`

[{"title": "新闻A","body": "今天发生了一件大事，很多人都关注。","date": "2025-04-10"},{"title": "新闻C","body": "这是一条独特的新闻。","date": "2025-04-11"}
]

只保留了重复正文的第一条，其余丢弃。

✅ 总结功能表

步骤	说明
读入文件	`news_filter_token.json`
条件	去掉 `body` 内容重复的
保存	只保留不重复的到 `news_filter_dup.json`
工具	用 `tqdm` 显示进度，`wr_dict` 写入 JSON

关键字：免费设计装修网站_深圳疫情最新消息今天_谷歌推广怎么样_google搜索引擎入口2022