{ "cells": [ { "cell_type": "markdown", "id": "98ebab25-3712-4f2e-b831-da7df8b60c79", "metadata": {}, "source": [ "# Chapter 3 Practice - Solutions\n", "\n", "[Click here](https://neuronjolt.com/data/downloads/pandas_3_practice_completed.ipynb) to download this chapter as a Jupyter (.ipynb) file.\n", "\n", "These are suggested solutions to the practice exercises for the *Concatenating, Merging, and Reshaping Data* chapter.\n", "\n", "The code below loads pandas and numpy, creates some sample DataFrames for concatenation practice and reads two datasets into DataFrames. The first DataFrame, `boxscores` DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player's statistics in one game. The second DataFrame, `teams`, has NBA team records from the 2023-24 NBA season. Following are the column definitions in the `teams` DataFrame:\n", "\n", "* `wins` - number of wins for the season\n", "* `losses` - number of losses for the season\n", "* `3_pt_g` - the team's record (wins-losses) in games decided by 3 points or fewer\n", "* `10_pt_g` - the team's record (wins-losses) in games decided by 10 points or more\n", "* `vs_ge_500` - the team's record (wins-losses) vs teams that won more than half their games\n", "* `vs_lt_500` - the team's record (wins-losses) vs teams that lost more than half their games\n", "* `ot_g` - the team's record (wins-losses) in games that were decided in overtime" ] }, { "cell_type": "code", "execution_count": 1, "id": "246558e4-1b61-46ed-98ac-0c2c57b5ca5b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Create DataFrames for concatenation practice\n", "\n", "east_div_q1 = pd.DataFrame({\n", " 'month': 'Jan Feb Mar'.split(),\n", " 'sales': [433_721.45, 513_720.77, 522_482.31],\n", " 'profit': [47_712.34, 52_936.29, 61_295.22] \n", "})\n", "\n", "west_div_q1 = pd.DataFrame({\n", " 'month': 'Jan Feb Mar'.split(),\n", " 'sales': [633_728.45, 713_780.77, 722_488.35],\n", " 'profit': [77_712.34, 84_976.29, 81_275.28] \n", "})\n", "\n", "# Import the NBA boxscore data\n", "url = \"https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv\"\n", "boxscores = pd.read_csv(url)\n", "\n", "# Convert the date column in the box score data to a datetime type\n", "boxscores['date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d')\n", "\n", "# Import the NBA team records data\n", "url = \"https://neuronjolt.com/data/nba_team_recs_23-24.csv\"\n", "teams = pd.read_csv(url)" ] }, { "cell_type": "markdown", "id": "a555e73d-fd66-4fdd-b3ba-fd22aeb9b3a0", "metadata": {}, "source": [ "## Practice Exercise 3-1\n", "\n", "The `east_div_q1` DataFrame has the monthly sales and profit for the East division for the first quarter, and the `west_div_q1` DataFrame has the monthly sales and profit for the West division for the first quarter. Concatenate those two DataFrames in a way that preserves the information about what division the monthly sales and profit represent. " ] }, { "cell_type": "code", "execution_count": 2, "id": "f91a897f-9e1e-487f-86f4-4b9cc72d6f33", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthsalesprofit
east0Jan433721.4547712.34
1Feb513720.7752936.29
2Mar522482.3161295.22
west0Jan633728.4577712.34
1Feb713780.7784976.29
2Mar722488.3581275.28
\n", "
" ], "text/plain": [ " month sales profit\n", "east 0 Jan 433721.45 47712.34\n", " 1 Feb 513720.77 52936.29\n", " 2 Mar 522482.31 61295.22\n", "west 0 Jan 633728.45 77712.34\n", " 1 Feb 713780.77 84976.29\n", " 2 Mar 722488.35 81275.28" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([east_div_q1, west_div_q1],\n", " keys = ('east', 'west'))" ] }, { "cell_type": "markdown", "id": "70aa1393-0dca-4d23-9bfd-28bbb5d8dc5d", "metadata": {}, "source": [ "## Practice Exercise 3-2\n", "\n", "Concatenate the `east_div_q1` and `west_div_q1` DataFrames and then produce from the result a DataFrame named `combined` that has four columns: `division`, `month`, `sales` and `profit` and an integer row index with no duplicates. " ] }, { "cell_type": "code", "execution_count": 3, "id": "3326444d-fd8a-4064-95d6-eb645d1d3739", "metadata": {}, "outputs": [], "source": [ "combined = pd.concat([east_div_q1, west_div_q1],\n", " keys = ('east', 'west'))" ] }, { "cell_type": "code", "execution_count": 4, "id": "c82ac616-32ba-46fe-a7f8-ad346ce55d62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthsalesprofit
east0Jan433721.4547712.34
1Feb513720.7752936.29
2Mar522482.3161295.22
west0Jan633728.4577712.34
1Feb713780.7784976.29
2Mar722488.3581275.28
\n", "
" ], "text/plain": [ " month sales profit\n", "east 0 Jan 433721.45 47712.34\n", " 1 Feb 513720.77 52936.29\n", " 2 Mar 522482.31 61295.22\n", "west 0 Jan 633728.45 77712.34\n", " 1 Feb 713780.77 84976.29\n", " 2 Mar 722488.35 81275.28" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined" ] }, { "cell_type": "code", "execution_count": 5, "id": "19505588-9559-43c0-b604-b99f317f6f5e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
level_0level_1monthsalesprofit
0east0Jan433721.4547712.34
1east1Feb513720.7752936.29
2east2Mar522482.3161295.22
3west0Jan633728.4577712.34
4west1Feb713780.7784976.29
5west2Mar722488.3581275.28
\n", "
" ], "text/plain": [ " level_0 level_1 month sales profit\n", "0 east 0 Jan 433721.45 47712.34\n", "1 east 1 Feb 513720.77 52936.29\n", "2 east 2 Mar 522482.31 61295.22\n", "3 west 0 Jan 633728.45 77712.34\n", "4 west 1 Feb 713780.77 84976.29\n", "5 west 2 Mar 722488.35 81275.28" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined = combined.reset_index()\n", "\n", "combined" ] }, { "cell_type": "code", "execution_count": 6, "id": "6fb0e83e-5a82-49aa-9d5a-1fa2d006257b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
divisionmonthsalesprofit
0eastJan433721.4547712.34
1eastFeb513720.7752936.29
2eastMar522482.3161295.22
3westJan633728.4577712.34
4westFeb713780.7784976.29
5westMar722488.3581275.28
\n", "
" ], "text/plain": [ " division month sales profit\n", "0 east Jan 433721.45 47712.34\n", "1 east Feb 513720.77 52936.29\n", "2 east Mar 522482.31 61295.22\n", "3 west Jan 633728.45 77712.34\n", "4 west Feb 713780.77 84976.29\n", "5 west Mar 722488.35 81275.28" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined.rename(columns = {'level_0': 'division'}, inplace = True)\n", "combined.drop('level_1', axis = 1, inplace = True)\n", "\n", "combined" ] }, { "cell_type": "markdown", "id": "b3bf87c0-5639-46d2-b003-f0ffb1e16342", "metadata": {}, "source": [ "**Note:** There are many ways to accomplish this, so you might have taken a different approach." ] }, { "cell_type": "markdown", "id": "51c1c200-2c21-48b0-bdd2-abb614fb0954", "metadata": {}, "source": [ "## Practice Exercise 3-3\n", "\n", "Concatenate the `east_div_q1` and `west_div_q1` DataFrames in a way that produces a DataFrame named `east_west` with a MultiIndex for the row index, with the first level of the row index indicating the division and the second level of the row index indicating the month. " ] }, { "cell_type": "code", "execution_count": 7, "id": "1d1737a1-6279-4f3c-bfe8-48af2571f830", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
month
eastJan433721.4547712.34
Feb513720.7752936.29
Mar522482.3161295.22
westJan633728.4577712.34
Feb713780.7784976.29
Mar722488.3581275.28
\n", "
" ], "text/plain": [ " sales profit\n", " month \n", "east Jan 433721.45 47712.34\n", " Feb 513720.77 52936.29\n", " Mar 522482.31 61295.22\n", "west Jan 633728.45 77712.34\n", " Feb 713780.77 84976.29\n", " Mar 722488.35 81275.28" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "east_west = pd.concat([east_div_q1.set_index('month'), \n", " west_div_q1.set_index('month')],\n", " keys = ('east', 'west'))\n", "\n", "east_west" ] }, { "cell_type": "markdown", "id": "d9e1e65e-70b9-4a12-b6c0-5ec6bd1c78f3", "metadata": {}, "source": [ "## Practice Exercise 3-4\n", "\n", "The `east_west` DataFrame that you created for the previous exercise has a MultiIndex for the row index. Manipulate the names for the levels of the row index so that the first level is named 'division' and the second level is named 'month'. " ] }, { "cell_type": "code", "execution_count": 8, "id": "8b585641-45e5-4cb3-9019-d38dfbfae603", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
Feb513720.7752936.29
Mar522482.3161295.22
westJan633728.4577712.34
Feb713780.7784976.29
Mar722488.3581275.28
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", " Feb 513720.77 52936.29\n", " Mar 522482.31 61295.22\n", "west Jan 633728.45 77712.34\n", " Feb 713780.77 84976.29\n", " Mar 722488.35 81275.28" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "east_west.index.set_names('division', level = 0, inplace = True)\n", "\n", "east_west" ] }, { "cell_type": "markdown", "id": "7b040ca1-6a87-469e-9b8b-dce606f8539b", "metadata": {}, "source": [ "**Note:** The index level names can also be specified within the `concat()` function with the `names` parameter." ] }, { "cell_type": "code", "execution_count": 9, "id": "159d0e0a-7a54-441d-bd2b-eb2ff5d67ba9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
Feb513720.7752936.29
Mar522482.3161295.22
westJan633728.4577712.34
Feb713780.7784976.29
Mar722488.3581275.28
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", " Feb 513720.77 52936.29\n", " Mar 522482.31 61295.22\n", "west Jan 633728.45 77712.34\n", " Feb 713780.77 84976.29\n", " Mar 722488.35 81275.28" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "east_west = pd.concat([east_div_q1.set_index('month'), \n", " west_div_q1.set_index('month')],\n", " keys = ('east', 'west'),\n", " names = ('division', 'month'))\n", "\n", "east_west" ] }, { "cell_type": "markdown", "id": "e752b098-6d98-4249-b137-e667be118dd5", "metadata": {}, "source": [ "## Practice Exercise 3-5\n", "\n", "Display from the `east_west` DataFrame just the rows for the East division January sales and profit and the West division February sales and profit." ] }, { "cell_type": "code", "execution_count": 10, "id": "fdb2578d-0182-49a0-937c-17e7667c8348", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
westFeb713780.7784976.29
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", "west Feb 713780.77 84976.29" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "east_west.loc[[('east', 'Jan'), ('west', 'Feb')]]" ] }, { "cell_type": "markdown", "id": "7d44d344-5078-49af-8c9c-105f2e180670", "metadata": {}, "source": [ "## Practice Exercise 3-6\n", "\n", "Display from the `east_west` DataFrame just the rows for January sales and profit for both divisions." ] }, { "cell_type": "code", "execution_count": 11, "id": "8565ea99-11a9-4eb0-8b5d-01b9c2147ec0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
westJan633728.4577712.34
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", "west Jan 633728.45 77712.34" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution using query()\n", "\n", "east_west.query('month == \"Jan\"')" ] }, { "cell_type": "code", "execution_count": 12, "id": "f01d2cca-54a9-46de-8f10-87c3ce2e13d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
westJan633728.4577712.34
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", "west Jan 633728.45 77712.34" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution using index.get_level_values() with level position specified\n", "\n", "east_west.loc[east_west.index.get_level_values(1) == \"Jan\"]" ] }, { "cell_type": "code", "execution_count": 13, "id": "c605e53c-b587-4ee8-9e54-a971059b6970", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salesprofit
divisionmonth
eastJan433721.4547712.34
westJan633728.4577712.34
\n", "
" ], "text/plain": [ " sales profit\n", "division month \n", "east Jan 433721.45 47712.34\n", "west Jan 633728.45 77712.34" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solution using index.get_level_values() with level label specified\n", "\n", "east_west.loc[east_west.index.get_level_values('month') == \"Jan\"]" ] }, { "cell_type": "markdown", "id": "d1a7d103-8c7d-445c-8b50-c17e8350b30a", "metadata": {}, "source": [ "## Practice Exercise 3-7\n", "\n", "Merge the boxscores data with the teams data in a way that will keep all the rows of the boxscores data, even if the team's record is not found in the team records data. Create a new DataFrame named `boxscores_2` from the merged data. Use the `info()` method to investigate the `boxscores_2` DataFrame." ] }, { "cell_type": "code", "execution_count": 14, "id": "750b6003-791b-4b2a-82fc-8dc4b2aefef1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 26498 entries, 0 to 26497\n", "Data columns (total 29 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 26498 non-null datetime64[ns]\n", " 1 team 26498 non-null object \n", " 2 opponent 26498 non-null object \n", " 3 player 26498 non-null object \n", " 4 position 26498 non-null object \n", " 5 min 26498 non-null int64 \n", " 6 oreb 26498 non-null int64 \n", " 7 dreb 26498 non-null int64 \n", " 8 reb 26498 non-null int64 \n", " 9 ast 26498 non-null int64 \n", " 10 stl 26498 non-null int64 \n", " 11 blk 26498 non-null int64 \n", " 12 to 26498 non-null int64 \n", " 13 pf 26498 non-null int64 \n", " 14 plus_m 26498 non-null int64 \n", " 15 pts 26498 non-null int64 \n", " 16 fg_m 26498 non-null int64 \n", " 17 fg_a 26498 non-null int64 \n", " 18 tres_m 26498 non-null int64 \n", " 19 tres_a 26498 non-null int64 \n", " 20 ft_m 26498 non-null int64 \n", " 21 ft_a 26498 non-null int64 \n", " 22 wins 26498 non-null int64 \n", " 23 losses 26498 non-null int64 \n", " 24 3_pt_g 26498 non-null object \n", " 25 10_pt_g 26498 non-null object \n", " 26 vs_ge_500 26498 non-null object \n", " 27 vs_lt_500 26498 non-null object \n", " 28 ot_g 26498 non-null object \n", "dtypes: datetime64[ns](1), int64(19), object(9)\n", "memory usage: 5.9+ MB\n" ] } ], "source": [ "boxscores_2 = boxscores.merge(teams, on = 'team', how = 'left')\n", "\n", "boxscores_2.info()" ] }, { "cell_type": "markdown", "id": "65b7f009-388b-4c38-80df-eac38fbff21a", "metadata": {}, "source": [ "## Practice Exercise 3-8\n", "\n", "Which player playing for which team had the highest scoring average while playing in at least 50 games for a team with a losing record?" ] }, { "cell_type": "code", "execution_count": 15, "id": "788da358-c02b-49ed-b52e-52393ece20a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
games_for_teamscore_avg
playerteam
Trae YoungATL5425.722222
\n", "
" ], "text/plain": [ " games_for_team score_avg\n", "player team \n", "Trae Young ATL 54 25.722222" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(boxscores_2\n", " .query('losses > wins')\n", " .groupby(['player', 'team'])\n", " .agg(games_for_team = ('date', 'count'),\n", " score_avg = ('pts', 'mean'))\n", " .query('games_for_team >= 50')\n", " .nlargest(n = 1, columns = 'score_avg', keep = 'all')\n", ")" ] }, { "cell_type": "markdown", "id": "db032ebb-b642-4955-a5f8-6b4fdd4abc3e", "metadata": {}, "source": [ "## Practice Exercise 3-9\n", "\n", "**Investigating merge types**\n", "\n", "Create a DataFrame named `bs_sub` by randomly sampling 12 rows from the `boxscores` DataFrame and keeping only the `player`, `team`, and `position` columns. Create another DataFrame named `teams_sub` by randomly sampling 12 rows from the `teams` DataFrame, keeping only the `team`, `wins`, and `losses` columns. \n", "\n", "After creating the two DataFrames, run `value_counts()` on the `team` column of each DataFrame. You should see that the relationship between `bs_sub` and `teams_sub` has the potential to be many-to-one, since there may be multiple rows in `bs_sub` for the same team, while `teams_sub` will only have one row for each team. In a merge each row from one input DataFrame is merged with ***every*** row in the other input DataFrame that has the same key value.\n", "\n", "Next, show the results of four different merges with `bs_sub` as the left DataFrame and `teams_sub` as the right DataFrame: an inner merge, an outer merge, a right merge, and a left merge. Set the `indicator = True` parameter to help you investigate the results of the various merge types.\n", "\n", "For each merge note how many rows are in the result and examine where missing values are created. \n", "\n", "See if you can evidence of a many-to-one relationship between `bs_sub` and `teams_sub`. \n", "\n", "Since the two input DataFrames are created through random sampling you can run your code multiple times to have more chance to study the merges." ] }, { "cell_type": "code", "execution_count": 16, "id": "eb109f28-01bf-4ac3-ac77-4d9ba6925c9c", "metadata": {}, "outputs": [], "source": [ "bs_sub = boxscores.sample(12).loc[:, ['player', 'team', 'position']]\n", "teams_sub = teams.sample(12).loc[:, ['team', 'wins', 'losses']]" ] }, { "cell_type": "code", "execution_count": 17, "id": "43a0373d-3c5c-4860-bb12-88508b35b181", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "team\n", "UTAH 2\n", "GS 1\n", "HOU 1\n", "CLE 1\n", "LAL 1\n", "PHI 1\n", "MEM 1\n", "MIA 1\n", "BKN 1\n", "MIN 1\n", "CHA 1\n", "Name: count, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bs_sub['team'].value_counts()" ] }, { "cell_type": "code", "execution_count": 18, "id": "e5b1a284-27e6-4dd3-ab11-aa2a9d5c88ee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "team\n", "MIL 1\n", "UTAH 1\n", "LAL 1\n", "NY 1\n", "NO 1\n", "ORL 1\n", "ATL 1\n", "CLE 1\n", "SAC 1\n", "DET 1\n", "HOU 1\n", "OKC 1\n", "Name: count, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams_sub['team'].value_counts()" ] }, { "cell_type": "code", "execution_count": 19, "id": "883463d0-116e-4303-9891-36a9bfd4f3ff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteampositionwinslosses_merge
0A. SengunHOUC4141both
1Donovan MitchellCLESG4834both
2A. ReavesLALSG4735both
3W. KesslerUTAHC3151both
4D. BazleyUTAHF3151both
\n", "
" ], "text/plain": [ " player team position wins losses _merge\n", "0 A. Sengun HOU C 41 41 both\n", "1 Donovan Mitchell CLE SG 48 34 both\n", "2 A. Reaves LAL SG 47 35 both\n", "3 W. Kessler UTAH C 31 51 both\n", "4 D. Bazley UTAH F 31 51 both" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inner merge - only keeps rows for which a matching team is found in both DataFrames\n", "pd.merge(bs_sub, teams_sub, on = 'team', how = 'inner', indicator = True)" ] }, { "cell_type": "code", "execution_count": 20, "id": "2c281719-4fbc-4d7c-a340-99be15b30983", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteampositionwinslosses_merge
0NaNATLNaN36.046.0right_only
1D. SharpeBKNCNaNNaNleft_only
2B. McGowensCHAGNaNNaNleft_only
3Donovan MitchellCLESG48.034.0both
4NaNDETNaN14.068.0right_only
5K. ThompsonGSSGNaNNaNleft_only
6A. SengunHOUC41.041.0both
7A. ReavesLALSG47.035.0both
8T. JemisonMEMCNaNNaNleft_only
9T. HerroMIAPGNaNNaNleft_only
10NaNMILNaN49.033.0right_only
11A. EdwardsMINSGNaNNaNleft_only
12NaNNONaN49.033.0right_only
13NaNNYNaN50.032.0right_only
14NaNOKCNaN57.025.0right_only
15NaNORLNaN47.035.0right_only
16P. ReedPHIFNaNNaNleft_only
17NaNSACNaN46.036.0right_only
18W. KesslerUTAHC31.051.0both
19D. BazleyUTAHF31.051.0both
\n", "
" ], "text/plain": [ " player team position wins losses _merge\n", "0 NaN ATL NaN 36.0 46.0 right_only\n", "1 D. Sharpe BKN C NaN NaN left_only\n", "2 B. McGowens CHA G NaN NaN left_only\n", "3 Donovan Mitchell CLE SG 48.0 34.0 both\n", "4 NaN DET NaN 14.0 68.0 right_only\n", "5 K. Thompson GS SG NaN NaN left_only\n", "6 A. Sengun HOU C 41.0 41.0 both\n", "7 A. Reaves LAL SG 47.0 35.0 both\n", "8 T. Jemison MEM C NaN NaN left_only\n", "9 T. Herro MIA PG NaN NaN left_only\n", "10 NaN MIL NaN 49.0 33.0 right_only\n", "11 A. Edwards MIN SG NaN NaN left_only\n", "12 NaN NO NaN 49.0 33.0 right_only\n", "13 NaN NY NaN 50.0 32.0 right_only\n", "14 NaN OKC NaN 57.0 25.0 right_only\n", "15 NaN ORL NaN 47.0 35.0 right_only\n", "16 P. Reed PHI F NaN NaN left_only\n", "17 NaN SAC NaN 46.0 36.0 right_only\n", "18 W. Kessler UTAH C 31.0 51.0 both\n", "19 D. Bazley UTAH F 31.0 51.0 both" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Outer merge - all rows are kept from both DataFrames\n", "pd.merge(bs_sub, teams_sub, on = 'team', how = 'outer', indicator = True)" ] }, { "cell_type": "code", "execution_count": 21, "id": "46193e5e-9891-4e99-94a3-98586ba30d87", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteampositionwinslosses_merge
0K. ThompsonGSSGNaNNaNleft_only
1A. SengunHOUC41.041.0both
2Donovan MitchellCLESG48.034.0both
3A. ReavesLALSG47.035.0both
4P. ReedPHIFNaNNaNleft_only
5T. JemisonMEMCNaNNaNleft_only
6W. KesslerUTAHC31.051.0both
7T. HerroMIAPGNaNNaNleft_only
8D. SharpeBKNCNaNNaNleft_only
9A. EdwardsMINSGNaNNaNleft_only
10B. McGowensCHAGNaNNaNleft_only
11D. BazleyUTAHF31.051.0both
\n", "
" ], "text/plain": [ " player team position wins losses _merge\n", "0 K. Thompson GS SG NaN NaN left_only\n", "1 A. Sengun HOU C 41.0 41.0 both\n", "2 Donovan Mitchell CLE SG 48.0 34.0 both\n", "3 A. Reaves LAL SG 47.0 35.0 both\n", "4 P. Reed PHI F NaN NaN left_only\n", "5 T. Jemison MEM C NaN NaN left_only\n", "6 W. Kessler UTAH C 31.0 51.0 both\n", "7 T. Herro MIA PG NaN NaN left_only\n", "8 D. Sharpe BKN C NaN NaN left_only\n", "9 A. Edwards MIN SG NaN NaN left_only\n", "10 B. McGowens CHA G NaN NaN left_only\n", "11 D. Bazley UTAH F 31.0 51.0 both" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Left merge - Keeps all rows from left DataFrame\n", "pd.merge(bs_sub, teams_sub, on = 'team', how = 'left', indicator = True)" ] }, { "cell_type": "code", "execution_count": 22, "id": "e3f03c04-bea8-4c1d-94b5-ae9e74f7cd44", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteampositionwinslosses_merge
0NaNMILNaN4933right_only
1W. KesslerUTAHC3151both
2D. BazleyUTAHF3151both
3A. ReavesLALSG4735both
4NaNNYNaN5032right_only
5NaNNONaN4933right_only
6NaNORLNaN4735right_only
7NaNATLNaN3646right_only
8Donovan MitchellCLESG4834both
9NaNSACNaN4636right_only
10NaNDETNaN1468right_only
11A. SengunHOUC4141both
12NaNOKCNaN5725right_only
\n", "
" ], "text/plain": [ " player team position wins losses _merge\n", "0 NaN MIL NaN 49 33 right_only\n", "1 W. Kessler UTAH C 31 51 both\n", "2 D. Bazley UTAH F 31 51 both\n", "3 A. Reaves LAL SG 47 35 both\n", "4 NaN NY NaN 50 32 right_only\n", "5 NaN NO NaN 49 33 right_only\n", "6 NaN ORL NaN 47 35 right_only\n", "7 NaN ATL NaN 36 46 right_only\n", "8 Donovan Mitchell CLE SG 48 34 both\n", "9 NaN SAC NaN 46 36 right_only\n", "10 NaN DET NaN 14 68 right_only\n", "11 A. Sengun HOU C 41 41 both\n", "12 NaN OKC NaN 57 25 right_only" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Right merge - keeps all rows from right DataFrame\n", "pd.merge(bs_sub, teams_sub, on = 'team', how = 'right', indicator = True)" ] }, { "cell_type": "markdown", "id": "75788103-b84f-4836-a5c0-3ac030fc6f5b", "metadata": {}, "source": [ "## Practice Exercise 3-10\n", "\n", "Create a subset of the boxscores DataFrame, named `bucks`, that only has the statlines for players playing for the Milwaukee Bucks (\"MIL\" is the team code), and only includes the following columns: `date`, `player`, `team`, `pts`, `reb`, `ast`, `stl`, `blk`. \n", "\n", "Display a random sample of 5 rows of the `bucks` DataFrame. This format is called \"wide\" format because each statistic has its own column, which makes the DataFrame wide." ] }, { "cell_type": "code", "execution_count": 23, "id": "5f4893af-2ec6-44fe-800d-49e8c43436df", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateplayerteamptsrebaststlblk
65872023-12-07M. BeauchampMIL56001
87812023-12-21M. BeauchampMIL53000
70702023-12-11M. BeasleyMIL194210
51282023-11-26A. GreenMIL01000
253422024-04-09T. Washington Jr.MIL00000
\n", "
" ], "text/plain": [ " date player team pts reb ast stl blk\n", "6587 2023-12-07 M. Beauchamp MIL 5 6 0 0 1\n", "8781 2023-12-21 M. Beauchamp MIL 5 3 0 0 0\n", "7070 2023-12-11 M. Beasley MIL 19 4 2 1 0\n", "5128 2023-11-26 A. Green MIL 0 1 0 0 0\n", "25342 2024-04-09 T. Washington Jr. MIL 0 0 0 0 0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bucks = (boxscores\n", " .query('team == \"MIL\"')\n", " .loc[:, ['date', 'player', 'team', 'pts', 'reb', 'ast', 'stl', 'blk']]\n", " )\n", "\n", "bucks.sample(5)" ] }, { "cell_type": "markdown", "id": "60e15bb1-401c-4fc4-b427-3811bc5cafda", "metadata": {}, "source": [ "Make a new DataFrame named `bucks_long` that is the result of transforming the `bucks` DataFrame to \"long\" format. All the columns that represent statistics should be transformed into two columns: `stat` and `value`. Display several rows of the `bucks_long` DataFrame." ] }, { "cell_type": "code", "execution_count": 24, "id": "1b9e7dbd-5b3d-45bd-a2a3-0deb5bc4d0b6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateplayerteamstatvalue
35612024-04-02D. GallinariMILstl1
33052024-02-08G. AntetokounmpoMILstl1
36922023-11-01B. PortisMILblk0
4662024-01-20P. ConnaughtonMILpts10
38752023-12-05M. BeasleyMILblk0
39292023-12-17K. MiddletonMILblk0
8772024-04-10D. LillardMILpts29
\n", "
" ], "text/plain": [ " date player team stat value\n", "3561 2024-04-02 D. Gallinari MIL stl 1\n", "3305 2024-02-08 G. Antetokounmpo MIL stl 1\n", "3692 2023-11-01 B. Portis MIL blk 0\n", "466 2024-01-20 P. Connaughton MIL pts 10\n", "3875 2023-12-05 M. Beasley MIL blk 0\n", "3929 2023-12-17 K. Middleton MIL blk 0\n", "877 2024-04-10 D. Lillard MIL pts 29" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bucks_long = bucks.melt(id_vars = ['date', 'player', 'team'],\n", " var_name = 'stat')\n", "\n", "bucks_long.sample(7)" ] }, { "cell_type": "markdown", "id": "c727a134-79b7-4513-8365-b03a87988c48", "metadata": {}, "source": [ "## Practice Exercise 3-11\n", "\n", "Make a new DataFrame named `bucks_wide` by transforming the `bucks_long` DataFrame back to wide format. Verify your tranformation by displaying several randomly-selected rows from `bucks_wide`." ] }, { "cell_type": "code", "execution_count": 25, "id": "aa59d37e-0cb1-42ea-942a-fd45a1692666", "metadata": {}, "outputs": [], "source": [ "bucks_wide = pd.pivot(bucks_long, \n", " index = ['date', 'player', 'team'],\n", " columns = 'stat',\n", " values = 'value')" ] }, { "cell_type": "code", "execution_count": 26, "id": "32e2b917-63f6-4fc3-95fc-e104034dd1c4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statastblkptsrebstl
dateplayerteam
2024-02-25G. AntetokounmpoMIL9130120
2024-01-31C. PayneMIL20010
2023-12-23M. BeasleyMIL101960
2024-02-27M. BeauchampMIL10730
2024-01-31D. LillardMIL702561
2024-03-26J. CrowderMIL00521
2023-11-22P. ConnaughtonMIL01330
\n", "
" ], "text/plain": [ "stat ast blk pts reb stl\n", "date player team \n", "2024-02-25 G. Antetokounmpo MIL 9 1 30 12 0\n", "2024-01-31 C. Payne MIL 2 0 0 1 0\n", "2023-12-23 M. Beasley MIL 1 0 19 6 0\n", "2024-02-27 M. Beauchamp MIL 1 0 7 3 0\n", "2024-01-31 D. Lillard MIL 7 0 25 6 1\n", "2024-03-26 J. Crowder MIL 0 0 5 2 1\n", "2023-11-22 P. Connaughton MIL 0 1 3 3 0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bucks_wide.sample(7)" ] }, { "cell_type": "markdown", "id": "39faadd0-77d4-4cb7-bc8c-ce7d2b46bc48", "metadata": {}, "source": [ "## Practice Exercise 3-12\n", "\n", "In the `teams` DataFrame several columns have the teams' records for various contexts. The data is more useful if we split out the wins and losses into their own columns. Based on the `ot_g` column create three new columns in the DataFrame:\n", "\n", "* `ot_win_pct` - win percentage in overtime games\n", "* `ot_wins` - number of overtime games won\n", "* `ot_losses` - number of overtime games lost" ] }, { "cell_type": "code", "execution_count": 27, "id": "ae5887a9-258d-4ea7-8625-cf6d3735c8bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamwinslosses3_pt_g10_pt_gvs_ge_500vs_lt_500ot_got_winsot_lossesot_win_pct
6DAL50325-231-2125-2725-51-0101.000000
13LAL473511-420-2526-2921-64-1410.800000
1BKN32504-617-2813-3919-111-4140.200000
24POR21615-65-388-4913-125-3530.625000
10HOU41418-523-1720-3221-92-5250.285714
\n", "
" ], "text/plain": [ " team wins losses 3_pt_g 10_pt_g vs_ge_500 vs_lt_500 ot_g ot_wins \\\n", "6 DAL 50 32 5-2 31-21 25-27 25-5 1-0 1 \n", "13 LAL 47 35 11-4 20-25 26-29 21-6 4-1 4 \n", "1 BKN 32 50 4-6 17-28 13-39 19-11 1-4 1 \n", "24 POR 21 61 5-6 5-38 8-49 13-12 5-3 5 \n", "10 HOU 41 41 8-5 23-17 20-32 21-9 2-5 2 \n", "\n", " ot_losses ot_win_pct \n", "6 0 1.000000 \n", "13 1 0.800000 \n", "1 4 0.200000 \n", "24 3 0.625000 \n", "10 5 0.285714 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams['ot_wins'] = teams['ot_g'].str.split('-').str.get(0).astype('int')\n", "teams['ot_losses'] = teams['ot_g'].str.split('-').str.get(1).astype('int')\n", "teams['ot_win_pct'] = teams['ot_wins'] / (teams['ot_wins'] + teams['ot_losses'])\n", "\n", "teams.sample(5)" ] }, { "cell_type": "markdown", "id": "f5eef9d0-72a2-467f-9807-d1914af3e75a", "metadata": {}, "source": [ "## Practice Exercise 3-13\n", "\n", "Load and examine the `batting` and `people` DataFrames. Batting has batting statistics for major league baseball players in the U.S. from 1871 to 2022. The `people` DataFrame has lots of information on major league baseball players. Both DataFrames have a `playerID` column that may be used to merge them if you want to answer a question that requires data from both DataFrames. \n", "\n", "Next, compare the collective batting averages (total hits / total at bats) for players named \"Jim\" vs players named \"Tim.\" You should calculate one batting average for each name that reflects the hitting performance of players with that first name across history." ] }, { "cell_type": "code", "execution_count": 28, "id": "3616d5d0-193a-431f-b4b3-977222675c80", "metadata": {}, "outputs": [], "source": [ "# Get files from server\n", "batting = pd.read_csv('https://neuronjolt.com/data/Batting.csv')\n", "people = pd.read_csv('https://neuronjolt.com/data/People.csv')" ] }, { "cell_type": "code", "execution_count": 29, "id": "371c4685-b02c-4acc-90dc-93af8ea47c3b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 112184 entries, 0 to 112183\n", "Data columns (total 22 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 playerID 112184 non-null object \n", " 1 yearID 112184 non-null int64 \n", " 2 stint 112184 non-null int64 \n", " 3 teamID 112184 non-null object \n", " 4 lgID 111447 non-null object \n", " 5 G 112184 non-null int64 \n", " 6 AB 112184 non-null int64 \n", " 7 R 112184 non-null int64 \n", " 8 H 112184 non-null int64 \n", " 9 2B 112184 non-null int64 \n", " 10 3B 112184 non-null int64 \n", " 11 HR 112184 non-null int64 \n", " 12 RBI 111428 non-null float64\n", " 13 SB 109816 non-null float64\n", " 14 CS 88642 non-null float64\n", " 15 BB 112184 non-null int64 \n", " 16 SO 110084 non-null float64\n", " 17 IBB 75533 non-null float64\n", " 18 HBP 109368 non-null float64\n", " 19 SH 106116 non-null float64\n", " 20 SF 76080 non-null float64\n", " 21 GIDP 86742 non-null float64\n", "dtypes: float64(9), int64(10), object(3)\n", "memory usage: 18.8+ MB\n" ] } ], "source": [ "batting.info()" ] }, { "cell_type": "code", "execution_count": 30, "id": "9126eaaa-8805-4d3a-b7bd-f57da0b3fdd2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 20676 entries, 0 to 20675\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 playerID 20676 non-null object \n", " 1 birthYear 20567 non-null float64\n", " 2 birthMonth 20398 non-null float64\n", " 3 birthDay 20256 non-null float64\n", " 4 birthCountry 20617 non-null object \n", " 5 birthState 20136 non-null object \n", " 6 birthCity 20508 non-null object \n", " 7 deathYear 10098 non-null float64\n", " 8 deathMonth 10097 non-null float64\n", " 9 deathDay 10096 non-null float64\n", " 10 deathCountry 10094 non-null object \n", " 11 deathState 10038 non-null object \n", " 12 deathCity 10089 non-null object \n", " 13 nameFirst 20639 non-null object \n", " 14 nameLast 20676 non-null object \n", " 15 nameGiven 20639 non-null object \n", " 16 weight 19864 non-null float64\n", " 17 height 19944 non-null float64\n", " 18 bats 19498 non-null object \n", " 19 throws 19699 non-null object \n", " 20 debut 20463 non-null object \n", " 21 finalGame 20463 non-null object \n", " 22 retroID 20627 non-null object \n", " 23 bbrefID 20665 non-null object \n", "dtypes: float64(8), object(16)\n", "memory usage: 3.8+ MB\n" ] } ], "source": [ "people.info()" ] }, { "cell_type": "code", "execution_count": 31, "id": "745112bb-2997-4123-a1ac-b84654afa64a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20676" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Confirm that there is one row per playerID in the people DataFrame\n", "# There are 20,676 rows and 20,676 unique player IDs, so there must be one player ID per row.\n", "people['playerID'].nunique()" ] }, { "cell_type": "code", "execution_count": 32, "id": "4f92d296-abc0-4c16-baf8-57aee681e4dd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_hitstotal_at_batsbat_avg
nameFirst
Jim655552606990.251459
Tim19916788680.252523
\n", "
" ], "text/plain": [ " total_hits total_at_bats bat_avg\n", "nameFirst \n", "Jim 65555 260699 0.251459\n", "Tim 19916 78868 0.252523" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(batting\n", " .merge(people, on = 'playerID', how = 'left')\n", " .query('nameFirst == \"Jim\" or nameFirst == \"Tim\"')\n", " .groupby('nameFirst')\n", " .agg(total_hits = ('H', 'sum'),\n", " total_at_bats = ('AB', 'sum'))\n", " .assign(bat_avg = lambda df: df['total_hits'] / df['total_at_bats'])\n", ")" ] }, { "cell_type": "markdown", "id": "da0c1aa7-6ccb-4905-a15c-b9ce576f66cf", "metadata": {}, "source": [ "## Practice Exercise 3-14\n", "\n", "For years 1990 and later, show average height and weight of professional baseball players in the U.S., by year. Be careful not to double-count any players in any one year!" ] }, { "cell_type": "code", "execution_count": 33, "id": "09639172-3dbf-4f1e-8e36-d30aed65d734", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heightweight
yearID
199073.367961188.076699
199173.328820188.611219
199273.375372189.064548
199373.442935190.302536
199473.453078189.909183
199573.366284190.443954
199673.464505191.544259
199773.399822192.233304
199873.428331193.239460
199973.474773194.637717
200073.471545195.591057
200173.454098197.180328
200273.467980198.798030
200373.473984199.674797
200473.492382201.206095
200573.489086203.257074
200673.586957206.064412
200773.659624208.138498
200873.619675209.405112
200973.676145210.467615
201073.668535212.336269
201173.656371213.244015
201273.663551213.104361
201373.625767213.030675
201473.637121213.452273
201573.674332213.088279
201673.737620213.121951
201773.722386212.988954
201873.727339212.629442
201973.727660212.746809
202073.775795212.696664
202173.777188211.017905
202273.712375208.642809
\n", "
" ], "text/plain": [ " height weight\n", "yearID \n", "1990 73.367961 188.076699\n", "1991 73.328820 188.611219\n", "1992 73.375372 189.064548\n", "1993 73.442935 190.302536\n", "1994 73.453078 189.909183\n", "1995 73.366284 190.443954\n", "1996 73.464505 191.544259\n", "1997 73.399822 192.233304\n", "1998 73.428331 193.239460\n", "1999 73.474773 194.637717\n", "2000 73.471545 195.591057\n", "2001 73.454098 197.180328\n", "2002 73.467980 198.798030\n", "2003 73.473984 199.674797\n", "2004 73.492382 201.206095\n", "2005 73.489086 203.257074\n", "2006 73.586957 206.064412\n", "2007 73.659624 208.138498\n", "2008 73.619675 209.405112\n", "2009 73.676145 210.467615\n", "2010 73.668535 212.336269\n", "2011 73.656371 213.244015\n", "2012 73.663551 213.104361\n", "2013 73.625767 213.030675\n", "2014 73.637121 213.452273\n", "2015 73.674332 213.088279\n", "2016 73.737620 213.121951\n", "2017 73.722386 212.988954\n", "2018 73.727339 212.629442\n", "2019 73.727660 212.746809\n", "2020 73.775795 212.696664\n", "2021 73.777188 211.017905\n", "2022 73.712375 208.642809" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(batting\n", " [['yearID', 'playerID']].drop_duplicates() \n", " .query('yearID >= 1990')\n", " .merge(people, on = 'playerID', how = 'left')\n", " .groupby('yearID')\n", " [['height', 'weight']].mean()\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }