Read and format project data
= "lahmansbaseballdb.sqlite"
sqlite_file = sqlite3.connect(sqlite_file) db
Course DS 250
Nefi Melgar
When you hear the word “relationship” what is the first thing that comes to mind? Probably not baseball. But a relationship is simply a way to describe how two or more objects are connected. There are many relationships in baseball such as those between teams and managers, players and salaries, even stadiums and concession prices. Let’s analyze other relationships in baseball.
HAS ANY PROFESSIONAL BASEBALL PLAYER ATTENDED TO BYU?
In the following table we can see those professional players who attended BYU-I, their salary, the year they played for a professional team and the team’s id.
Does a 4000000 salary look bad? I don’t think so.
WHAT PLAYERS HAVE GOTTEN THE HIGHEST BATTING AVERAGE OVER THEIR ENTIRE CAREERS?
The batting average is the number of hits divided by the number of at-bats of the player. If a player has a high number of at-bats it doesn’t mean he will have a high number of hits, they will vary from player to player, let’s analyz more of this.
# Include code here
# hits = """SELECT h, ab, playerid, yearid FROM batting WHERE ab >= 1"""
# hits_total = pd.read_sql_query(hits, db)
# hits_total["batting_avg"] = hits_total["H"] / hits_total["AB"]
# ab1_filtered = hits_total[["playerID", "yearID", "batting_avg"]]
# ab1_filtered = ab1_filtered.sort_values(
# by=["batting_avg", "playerID"], ascending=[False, True]
# )
# ab1_filtered.head(5)
It seems the amount of players with a batting average of 1 is high.
# Include code here
# hits = """SELECT h, ab, playerid, yearid FROM batting WHERE ab >= 10"""
# hits_total = pd.read_sql_query(hits, db)
# hits_total["batting_avg"] = hits_total["H"] / hits_total["AB"]
# ab10_filtered = hits_total[["playerID", "yearID", "batting_avg"]]
# ab10_filtered = ab10_filtered.sort_values(
# by=["batting_avg", "playerID"], ascending=[False, True]
# )
# ab10_filtered.head(5)
Batting average has decreased as the at bats increased.
# Include code here
# hits = """SELECT h, ab, playerid, yearid FROM batting WHERE ab >= 100"""
# hits_total = pd.read_sql_query(hits, db)
# hits_total["batting_avg"] = hits_total["H"] / hits_total["AB"]
# ab100_filtered = hits_total[["playerID", "batting_avg"]]
# ab100_filtered = ab100_filtered.groupby("playerID").mean()
# ab100_filtered = ab100_filtered.sort_values(
# by=["batting_avg", "playerID"], ascending=[False, True]
# )
# ab100_filtered.head(5)
Batting average seems to be low, but these players have more than 100 at-bats.
IS THERE A BIG DIFFERENCE FOR HOME RUNS NUMBERS WHEN COMPARING BIG AND FAMOUS TEAMS?
There have been many baseball teams since this beautiful sport started, many of them still making history and adding numbers to their statistics, let’s compare 2 of these teams. We will pay special attention to the number of home runs. Team 1 will be Cincinnati Reds, and team 2 will be Pittsburgh Pirates.
Both teams started registering home runs around the same years, before 1880, Pittsburgh started with a high amount of home runs compared to Cincinnati, but for some reason Cincinnati has a higher amount of home runs through the years, they have a clear advantage.
# select all teams
# teams = """SELECT * FROM teams"""
# teams_total = pd.read_sql_query(teams, db)
# teams_total
# # select first team Cincinnati Reds
# select_teams = """SELECT * FROM teams WHERE name = 'Cincinnati Reds' OR name = 'Pittsburgh Pirates'"""
# chosen_teams = pd.read_sql_query(select_teams, db)
# chosen_teams["Year"] = pd.to_datetime(chosen_teams["yearID"], format="%Y")
# chosen_teams
# chart_1 = px.scatter(
# chosen_teams,
# x="Year",
# y="HR",
# color="name",
# color_discrete_sequence=["red", "yellow"],
# labels={"HR": "Home Runs", "name": "Team"},
# title="Home Runs through the years",
# )
# chart_1.show()
This table shows the home runs through the years for both teams.