当前位置：首页 > news >正文

大数据分析07 数据链接

news 2025/8/4 17:52:39

连接两个数据集

合并
必须指定列参数，普通链接空值不展示

pd.merge(students, scores, left_on='ID', right_on='SID')

内连接，外连接

frame = pd.merge(students, scores, left_on="ID", right_on="SID", how='left')
frame

判断空值
pd.isna判断是否是空值，如果是返回True; pd.isnull

frame[pd.isna(frame['score'])==True]['name']

join连接
两个数据集index一致，直接用join

students.join(scores)

merge 是panda的函数
join是dataframe的函数
先过滤数据在分组

数据读取

读取数据

import pandas as pdunames = ['uid', 'age', 'gender', 'occupation', 'zip']
users = pd.read_table('dataSources/MovieLens/u.user', sep='|', header=None, names=unames)users.head(5)

rnames = ['uid', 'mid', 'rating', 'timestamp']
ratings = pd.read_table('dataSources/MovieLens/u.data', sep='\t', header=None, names=rnames)ratings[:5]

合并

frame = pd.merge(ratings, users)

聚合
按性别分组排序平均分

frame['rating'].groupby(frame['gender']).mean()

按分值和性别统计
round四舍五入，参数-1为四舍五入到十位

frame['rating'].groupby([frame['age'].apply(round, args=[-1]), frame['gender']]).mean()

均值和聚合多个参数

.agg('mean','count')

分组统计并排序

pd.set_option('display.max_rows', None) # 显示设置
frame['rating'].groupby([frame['gender'], frame['title']]).agg(['mean', 'count']).sort_values(by=['mean', 'count'], ascending=[False, False])

过滤聚合结果

result[result['count'] > 100].sort_values(by='mean', ascending=False)

多条件分组汇总二维表格
性别和电影维度统计平均评分，性别为行电影为列

frame.pivot_table('rating', index='title', columns='gender', aggfunc='mean')

多条件统计
统计按性别电影评分大于100的，按性别排序

result = frame.pivot_table('rating', index='title', columns='gender', aggfunc='mean')result1 = result.loc[ratings_by_title.index[ratings_by_title > 100]]result1.sort_values(by='F', ascending=False)

求男女评分差异

result['diff'] = (result['M'] - result['F']).apply(abs)
result.sort_values(by='diff', ascending=False)

-算标准差

frame['rating'].groupby([frame['gender'], frame['title']]).std().sort_values(ascending= False)

子查询

数据源

data = {'ID': ['000001', '000002', '000003', '000004', '000005', '000006', '000007'],'name':['黎明', '赵怡春', '张富平', '白丽', '牛玉德', '姚华', '李南'], 'gender':[True, False, True, False, True, False, True], 'age':[16, 20, 18, 18, 17, 18, 16], 'height':[1.88, 1.78, 1.81, 1.86, 1.74, 1.75, 1.76]}frame = pd.DataFrame(data)
frame

查询最高身高

maxHeight = frame[frame['gender'] == False]['height'].max()
frame[(frame['gender'] == False) & (frame['height'] == maxHeight)]

查询身高前几
数据重复无法获得精确答案

maxHeight = frame['height'].sort_values(ascending=False).head(2)

值相同取出前几

maxHeight = frame['height'].drop_duplicates().sort_values(ascending=False).head(2)
maxHeight

在这里插入图片描述

查询指定伸到结果集的学生

frame[frame['height'].isin(maxHeight)]

查看全文

http://www.lqws.cn/news/129835.html

第 86 场周赛：矩阵中的幻方、钥匙和房间、将数组拆分成斐波那契序列、猜猜这个单词

Shopify 主题开发：促销活动页面专属设计思路

告别延迟，拥抱速度：存储加速仿真应用的解决方案【1】

DexUMI：以人手为通用操作界面，实现灵巧操作

激活函数和归一化、正则化

Unstructured.io 文件 Extract 方案概述

redis集群和哨兵的区别

MySQL 索引：为使用 B+树作为索引数据结构，而非 B树、哈希表或二叉树？

Python 解释器安装全攻略（适用于 Linux / Windows / macOS）

Spring AI 项目实战（五）：Spring Boot + AI + DeepSeek + Redis 实现聊天应用上下文记忆功能（附完整源码）

VR博物馆推动现代数字化科技博物馆

基于 ShardingSphere + Seata 的最终一致性事务完整示例实现

思维力三阶 · 序章：从认知碎片到系统思维——点亮内心的“认知操作系统”蓝图

佰力博科技与您探讨半导体电阻测试的基本原理

UE5 创建了一个C++类，现在我还有一个蓝图类，我想将编写的C++类中包含的功能加入到这个蓝图类里面，如何做到

Redis中的setIfAbsent方法和execute

使用cursor 编辑器开发 Vue项目，配置ESlint自动修复脚本，解决代码不规范引起的报错无法运行项目问题

Flutter如何支持原生View

node 进程管理工具 pm2 的详细说明 —— 一步一步配置 Ubuntu Server 的 NodeJS 服务器详细实录 7

excel从不同的excel表匹配数据

采用 Docker GPU 部署的 Ubuntu 或者 windows 桌面环境

centos 9/ubuntu 一次性的定时关机

Python训练第四十四天

【EasyExcel】导出时添加页眉页脚

【Oracle】存储过程

Oracle实用参考（13）——Oracle for Linux静默安装（1）

Delphi中实现批量插入数据

oracle从表B更新拼接字段到表A

Sql Server 中常用语句

鸿蒙Navigation路由导航-基本使用介绍

连接两个数据集

数据读取

子查询

相关文章：