8/24/2024
I made these on Tableau while working through course guides for exploring Lightening Strike data. Also hosted on my Tableau Profile!
This bar chart depicts the Number of Lightening Strikes divided by Quarter and then separated by Year. The two years shown are 2009 and 2018. The top of each bar is labeled with the actual Strike count for that quarter.
These Box Plots show the spread and skew of lightening strike data within quartiles, separated by year.
This Geo Map shows the concentration of Lightening Strikes mapped by Latitude and Longitude.
This Histogram shows the distribution of Lightening Strike data.
This Heat Map uses color saturation to indicate the Months by Year with the most Lightening Strikes.
After getting superficially acquainted with your data, it's obvious that you need to start finding the story that your data holds. Typically, data analysts call this stage of the process EDA or Exploratory Data Analysis. There are six steps in EDA:
The interesting thing about the six EDA steps is that they are iterative and non-sequential, meaning that during the course of exploring your data, you may perform each step more than once, and you likely won't follow these steps in any specific order. The order is specified by the needs of the project and the dataset. Some projects won't involve joining with another dataset. Others may involve doing so more than once, as well as cleaning and validating the unified dataset each time.
Structuring
Let's further examine the structuring step by looking at 5 Structuring Methods described in my analytics course.
Sorting:
Extracting:
Filtering:
Grouping:
Merging:
It's clear that learning to use these methods effectively, especially in a particular coding language like Python, will greatly enhance the detail and effectiveness of any search for meaning in data.
This past week my adventure took another turn when I started learning Python in my advanced analytics course. From what I read about it, Python was created to be more user-friendly than earlier programming languages, so I was excited to start. The first few lessons were great: how to use Jupyter Notebooks, object-oriented programming, basic data types and data conversions, clean code, functions, and operators. Most things made perfect analogous sense after working with either SQL or r. But then we got to conditional statements and loops. And suddenly, things weren't making as much sense.
Part of that is simply learning to think more logically. To stop and structure ideas step by step before attempting to move forward. It is a work in progress. But that wasn't all. As I sat there stumped trying to complete programming exercises, I stopped to really think about why I was having so much trouble. And I realized something about the language itself that may have been tripping me up. Python was created to be more user-friendly, to be more like a human language and less like a computer one. When I learned the basics of SQL and r, every word was specifically meaningful. It was a function name, an operator name, an assigned variable, etc. There are no 'just words' when you code in those languages. At least as far as I've seen.
Coding in Python was a surprise because sometimes there are 'just words'. I remember the first time I got completely stumped. I had to get the answer and go back to work through it so I could understand. But when I did, all I could think was "Well, I'd have figured it out if I knew that I could just throw random words in there!". For example:
def purchases_100 (sales):
count = 0
for customer in sales:
customer_total = 0
for purchase in customer:
customer_total += purchase
if customer_total >= 100:
count += 1
return count
This lovely bit of code defines a function that we are calling 'purchases_100' that takes an argument 'sales'. It calculates the number of customers who have made purchases over $100 from the list of sales by customer. So you enter the variable name for your sales list and out pops the desired number of customers. The reason this was so confusing for me initially is because 'customer' and 'purchase' are just words. There is no table with these column names. No variable list of elements is actually assigned to these words. For example, here is the tiny list of sales used in the exercise:
sales = [[2.75], [50.0, 50.0], [150.46, 200.12, 111.30]]
Each of the internal brackets represents a single customer's activity, and each of the numbers separated by commas inside that bracket represents the individual purchases made by that customer. So the way the function definition is written makes perfect sense, but it doesn't have to be written that way. Nothing in the little list of sales data says that certain information is labeled 'customer' and other information is labeled 'purchase'. I could as easily write this:
def purchases_100 (sales):
count = 0
for x in sales:
x_total = 0
for y in x:
x_total += y
if x_total >= 100:
count += 1
return count
And the function works the same way. Now, the first way is how it should be written in best practice because it makes clear exactly what the function does in a way that others can easily understand. Also, you can come back to it later and not be confused by it if you've forgotten what x and y were meant to signify. But the point is that Python lets you have that flexibility. But with that flexibility, it also grants you the option to be unclear. It is truly more like a human language. I think that is why it's almost harder for me to learn. Just like a non-English speaker rolling their eyes at the constant exceptions to the rules and pronunciation, the relative flexibility of verbiage in Python compared to SQL or r, has me floundering a little.
Ultimately, I can see how the tradeoff is worth it. Python is a powerful programming tool that allows for human flexibility while still communicating in clear ways that computers can understand. I just need to remember to think outside the box a little as I continue my Python journey!
I’m feeling down today. Worried that I didn't accomplish enough this weekend. I didn't spend enough time on my learning course modules or practice long enough on SQL syntax. Worried about the ultimate goal of landing a good job in a tough competitive market. So I decided to focus on something that jumped out at me, after taking these courses, about the way I sometimes see myself in career spaces.
During both professional courses I have taken, the instructors have taken the time to mention imposter syndrome and give advice about overcoming its effects. At first, I was surprised, but as I moved forward in my attempts to learn new concepts, set up my portfolio, and begin applying to jobs, I realized it was a very real thing for me. I just never put it into words before.
For anyone who has never heard the phrase before, imposter syndrome is “a psychological condition that is characterized by persistent doubt concerning one's abilities or accomplishments accompanied by the fear of being exposed as a fraud despite evidence of one's ongoing success.”(Link)
I have always been an introvert. I don't really know when I began assuming that being self-deprecating or uncomfortable with strong compliments was simply part of my introversion. It's always been true though. I am a highly motivated individual who strives for excellence, who honestly puts too much pressure on myself in that quest for excellence. But when my managers would praise me or introduce me to someone new with an exaggerated compliment, I would always cringe a little inside, quickly qualifying their comment with something like “I try!". It's not that I wasn't pleased with my accomplishments or happy that they valued me and said so, but I always felt just a bit like…well, like an imposter.
It's helpful to have a word for it. To know that it's really not just me. Especially in my current position, trying to pivot to a new career space. It's important to remember that I do have strong transferable skills. I just need to step back, take a breath, and learn how to talk about them in a new way. I am a quick learner and can still be an asset to any company, even if I'm still working on mastering some new technical skills. And it's not ok to let that nagging self-deprecating voice in my head tell me I'm not good enough to claim my successes.
I went on a side quest recently. The Google AI Essentials course. AI has become such a huge conversation these days and I wanted to gain a little better understanding of the topic. I had already heard things, good and bad, from both family, friends, and news articles. My family members who are teachers lamented the use of these tools by students and the ever-greater lengths they needed to go to uncover such usage. News articles and YouTube videos spewed both glowing predictions for the wonders of the AI era and bitter diatribes about how many jobs had been or would be eliminated by AI. It's a controversial topic, for all its ubiquitousness.
So I admit, I came to the course with a muddle of preconceptions. And honestly, I can admit that I left the course with some questions unanswered, but I did learn a few useful things along the way. One of the biggest things I learned is that I wish more people would take this course or one like it. As many times as I sat there during the course, listening to the instructors sing the praises of AI, and thought “Wow, someone's been drinking the Kool-Aid!", I also thought, "Wow, that's insightful and important” too. The most important takeaway from the course for me, was the idea of human-in-the-loop.
Learning how AI software is built and being introduced to its weaknesses and limitations helps people see what makes this human-in-the-loop approach so profound and important. That is the real benefit of this course. It was fun to be introduced to some conversational AI software. To learn how to engineer prompts and get the most out of these applications. It was even intriguing to learn what others generally consider appropriate use of these applications in the workplace. But human-in-the-loop, that really matters. So what is it?
In broadest terms, AI software is generally powered by a machine learning model. Machine learning models are built by using a huge amount of data to “teach" a model to respond in helpful ways. Can these models process data at tremendous speeds? Absolutely! Can they generate great content in response to your questions? Undoubtedly! But, they have limitations. One of those major limitations is that machine learning models, and the AI software they power, are only as good as the data that went into them. If that data wasn't beautiful, comprehensive, error-free, bias-free data…and let's face it, is data ever truly that good?... Then those omissions, errors, and biases will creep into the answers. In both overt and subtle ways.
Human-in-the-loop is an approach to AI usage that helps ensure that AI tools are used responsibly, ethically, and in ways that improve the human condition. It's very, very simple. Whenever AI output is used, there should be a human somewhere in the loop evaluating that AI output to ensure it is accurate, helpful, and unbiased. Let's be honest, anyone who has actually used these tools knows that AI software does hallucinate. It kicks out things that aren't accurate. It doesn't have the critical thinking skills to wonder if the conclusion it came to re-enforced bias, or harmed some group of actual human beings. Part of the course detailed all the potential harms of AI when not used ethically and responsibly. There were five harm categories: Allocative Harm, Quality-of-Service Harm, Representational Harm, Social System Harm, and Interpersonal Harm. That's why human oversight of AI output is so essential.
When I looked at things that way, it made me realize something even more concerning about the stories my family told me about student AI usage. They were worried, and rightly so, about students actually learning skills instead of covering their lack of learning with AI-generated content. But there is another longer-term, more insidious harm that comes from the availability of these tools to student users. Not only do students then not learn fundamentals, but they also learn to continue using AI tools without the fundamental skills that would allow them to be good humans-in-the-loop. It's one thing for an adult who learned fundamental reading, writing, logic, and critical thinking skills to use these tools to expedite their work process or enhance their skills. Someone like that can adequately and critically evaluate AI output for accuracy and harm. But how do people who grew up using these crutches instead of learning these fundamental skills ever truly operate as successful critics? How many of them will successfully identify and eliminate inaccuracies before they can be spread? How many will notice biases and subtle harms before they disseminate output reinforcing them? How far could that harm spread?
I think as responsible and ethical human beings we need to ensure that courses explaining these tools and their limitations and pitfalls are made more common. We need to enforce these human-in-the-loop policies where AI is used. We need to look critically at what AI tools are available to children and students and that they too are taught about the pitfalls. AI is an amazing, powerful, transformative tool. And it's getting away from us. We need to find a way to put training and rules, legislature and policy, awareness and responsibility, into this space before the chance to shape AI’s path for the good of humanity gets away from us too.
I was so excited to finish my first case study project and start building my portfolio! It felt simultaneously like a huge accomplishment and the tip of the iceberg. I am looking forward to how much I still have to learn.
I just wasn't expecting to have to learn it right away. Now that I had a case study, I wanted to post it in two places. I wanted it here on my blog as part of my new Portfolio page, but I also wanted to put it up on Kaggle as I hope to work on that platform to keep building my skills. Should be simple, right?
Sadly, not that simple. I'd hoped to be able to just upload my file, but instead the only option I could find was to painstakingly copy and paste it over in separate markdown and code cells. Hey community, if I missed something and it could be easier, please let me know!
Nothing was going to stop me from getting my case study posted, however, so I started in. Thank goodness for split screen capability! It took a little while, but eventually, I was all set. Time to run my notebook.
I may possibly have held my breath as the notebook ran. Only to look down and see one of the saddest words in the English language: failed. And so began the battle. Even before deciding to learn more about data analytics and programming languages, I have always felt the conversation between myself and technology to be a battle. I may pick and choose the battles I dig myself trenches for, but when I choose that battle, I intend to win.
Do I want my music catalog organized the way I like it, not the way the automatic file algorithm sorts it? Well, the battle is on.
Do I want to watch this movie I only have saved on my iPod on a TV? Let's get ready to rumble.
Do I want this case study hosted on my Kaggle profile? I guess it's time to choose our weapons, for I will not back down.
In all honesty, it took me longer than it should have, but I did it. But how I did it, might be the best part. One of the absolute best things about learning r has been the community. Every time I got stuck trying to build my case study, there was always someone, somewhere who had encountered the problem before and wrote about it. On blogs, on stack overflow, everywhere, there is advice and commiseration as we all fight our battles with technology.
So I Googled, and searched, and pieced together the problem. The directory and file path that I had, rather painfully, figured out how to put together to import my CSV files in RStudio, did not work in my Kaggle notebook.
Here is my RStudio code for one file:
directory <-"C:/Users/Rebecca/Documents/Google Course/Project_names/Bellabeat_case_study/Primary data/Fitabase Data 3.12.16-4.11.16"
file_path <- file.path(directory, "dailyActivity_merged.csv")
daily_activity <- read_csv(paste0(file_path))
In other words, this is the directory path starting in my computer's C file and moving through the relevant folders. Then, this is the file path which assumes that directory path and adds the file name. Finally, I would like you to take all that path info, read the file that it belongs to, and import it under the name "daily_activity" which I have supplied for it.
After searching for why my file path no longer worked when I copied it into Kaggle, I came to the understanding that Kaggle had a specific working directory. Just making sure that the files were listed under the input tab wasn't enough. I needed the right working directory and file path names for my code.
I think I will spare you how long it took me to figure out a solution. Let's just say I went to bed 0 for 1 in my battle that night. But I polished my armor the next morning and headed back into the fray.
Here is what I eventually found. There is a function in r that will return the current working directory: getwd(). When I ran that code in my Kaggle notebook, it returned '/kaggle/input'. Awesome! Honestly, I already thought that must be right, but now I had it confirmed in black and white. The problem was when I used that and followed the folder path that I could see, it still didn't work. The error told me the file didn't exist.
So I went back to the drawing board and found another function: list.files (). Specifically for my purposes: list.files (path="/kaggle/input").
Now this function returned the files listed under that working directory: 'fitbit' and 'pmdata-fitbit-versa-activity-data'. And now, I began to see what went wrong. When I started making that same directory file path starting with the Kaggle working directory instead of the C file location, I looked at the names of the folders under my Kaggle input tab and typed them into the code just as they were. Unfortunately, the file path was not exactly the same. For whatever reason, there were differences in what was capitalized, dashes instead of spaces between words, and other small formatting differences. No wonder it told me the file didn't exist!
So, I went back through and got the correct file path names for each step of the path with list.files (). Then I rewrote my directory and file path code.
directory <-"/kaggle/input/fitbit/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16" file_path <- file.path(directory, "dailyActivity_merged.csv") daily_activity <- read_csv(paste0(file_path))
It's a really strange anxiety. Sitting there at my computer at home on a Sunday morning, holding my breath, as I push that run notebook button one more time.
But it's all worth it, the irritation, the work, the time. Because finally, my notebook successfully runs. I guess that's why we do it. Tilt at windmills searching paragraphs of code for that one stray comma messing things up. Wander around Best Buy certain that there is a cord that converts from iPod to the correct TV input. Spend hours making playlists so that your music shows up the way you want to see it.
In the end, that glow of satisfaction, of knowledge, of victory, makes it all worth while.
As one of the practice exercises in my course, I was given data in a table about sales for two different products. The fictional company wanted to know how to allocate their marketing dollars for the new year. The goal was to make simple, effective graphics that highlighted sales insights for the company. This first chart I made is a basic but effective bar chart. Monthly sales per product compared to each other. It was a great way to summarize everything about the data in one visualization, and it might even be enough on its own to answer the company’s questions about such a small, focused dataset.
So next I made a pie chart that shows the percentage of sales for Product A that occurs in a given month as a segment of the total sales. This chart makes an insight about Product A really pop. Nearly 70% of annual sales occur in three months: October, November, and December. Sales then drop down to low levels for the rest of the year. The chart is a little busy on the right side though. Typically pie charts work best with less than 7 segments according to my course advice, so I am looking into what else could be done to represent percentages of the whole so neatly besides just the original bar chart.
It is a pretty image, but it’s rather busy and doesn’t communicate clearly. Again, too many segments can make pie charts less effective. So I went back to the drawing board and made one more.
This bar chart makes it very clear that sales for Product B are fairly even year-round, with a slight uptick in the fourth quarter. The information is technically visible in the first bar chart, but pulling the Product B info out to stand on its own the same way we did with Product A was good practice. I'm excited to see what else I can create as I improve my skills with Data Viz platforms!
One big distinction between data these days is small data vs big data. These two types of data each have their PROs and CONs, so it’s important to understand the attributes, challenges, and benefits of each.
Small data is specific, focused in on a particular set of values or a shorter time period. It is usually analyzed in spreadsheets like Google Sheets and Excel with built-in tools. Small data is more likely to be utilized by small and midsize businesses. It is usually simple to collect, store, manage, sort, and visually represent due to the reasonable volume of data.
This means that small data can be processed and analyzed more quickly, allowing for swift real-time insights. It is more easily understood, interpreted, and managed directly by human beings.
The downside to this ease and speed is that small data with specific, focused datasets is not as comprehensive in its insights. The scope is often limited leading to a lack of precision. Issues with potential bias due to smaller sampling sizes can also arise.
Big data on the other hand is large and less specific. It can involve many related tables of data over longer time periods. For this reason, it is kept in databases which are queried by programming languages, rather than directly manipulated in a spreadsheet. Big data is more likely to be utilized by large businesses and organizations. Big data often needs to be broken down in order to be organized and analyzed effectively. It takes a lot of effort to collect, store, manage, sort, and visually represent big data due to its overwhelming volume.
Honestly, that last attribute forms the foundation of why big data can be so challenging. There is often way too much irrelevant, unimportant detail to wade through. Finding the tiny nuggets of relevant, important data can feel like panning for gold. These nuggets are harder to find and use. This means that analyzing big data is a slower process, leading to less efficient time frames for decision-making compared to the turnaround on small data. Even with all the amazing analysis tools we have access to these days, it can still be a struggle to provide meaningful results in a useful timeframe. There can also be some issues with unfair algorithmic bias as even our tools can require shortcuts to wade through the ever-increasing volumes of available big data.
Big data still has a lot of benefits, however, particularly in business settings. The large volume of data can help businesses identify more efficient ways of doing business, saving a lot of time and money. Big data helps spot more comprehensive trends like customer buying patterns and satisfaction levels. These things lead to a much better understanding of current market trends.
Overall, both small data and big data are useful in their own unique ways. They simply need to be utilized correctly, with the right tools and awareness in appropriate situations. If you leverage their benefits while being sensitive to their challenges you can garner meaningful insights for your projects with either one.
One of the most foundational aspects of the course I'm taking is the concept of phases of data analysis. This concept is so pivotal it is used to inform the actual overall structure of the course itself. My course follows the six-phase format. Although there are variations with slightly different numbers of phases, the core concepts are pretty much universal to all versions. These six phases in order are: ASK, PREPARE, PROCESS, ANALYZE, SHARE, and ACT. Each of these stages gets its own section in the overall course.
ASK refers to the stage of analysis where you are working to understand the challenge to be solved or questions to be answered with your project. Now you can definitely fill a book with advice on each of these phases, so this is just an overview of some useful concepts about asking the right questions for data analysis. The first concept is structured thinking: a process of thinking that recognizes the current problem, organizes available info, reveals gaps and opportunities, and identifies your options to resolve the problem. Another way of looking at the ASK phase involves these steps:
Then you can ask yourself:
The answer to this is by asking SMART questions.
Specific- significant, simple, single topic or few closely related ideas
Measurable- quantifiable
Action-oriented- encourage progress or change
Relevant- significant, important
Time-bound- about a specific period of time
Using these tools to help you structure your thinking around problems and formulate strong questions will help you ASK the right questions to get your data analysis projects off the ground.
PREPARE refers to the stage where you decide what data you need to collect to answer questions and organize your data in the most useful way. Questions you might ask yourself are:
1) What do I need in order to figure out how to solve this problem?
2) What research do I need to do?
Some things you will look at are:
I am going to expand on some of these aspects of data preparation here. I already talked a little about data itself in the last post like quantitative vs qualitative data. There are way too many other types, subtypes, and aspects of data that could be discussed, but there are a few I will address here.
The first is the source of your data and why it matters. First-party data is data collected by an entity using its own resources and it is typically the easiest to use and most reliable. Second-party data is data collected directly by an entity from its audience and then sold. Second-party data still tends to be pretty reliable but has a higher cost than first-party data. Finally, third-party data is data from outside sources that did not collect it directly. This type of data is the most unreliable, and special caution should be used to verify its credibility and accuracy before utilizing it.
Another aspect of data that deserves special mention here is data reliability. The acronym used in my Google course for this general consideration of credibility is:
Reliable- data that is not obviously inaccurate, incomplete, biased
Original- does not rely on 2nd and 3rd party reports
Comprehensive- not missing important info
Current- relevant and up to date
Cited- properly cited for accountability
Data like this ROCCCS and is good data for analysis. Other essential aspects to consider at this stage are how data privacy can be respected, and data security achieved. It isn’t enough to find the data, you must protect it and the people it represents with good data management. All these things and more are addressed at the PREPARE stage.
PROCESS refers to the stage dedicated to cleaning and formatting your data because clean data is the best data. At this stage, you get rid of possible errors, inaccuracies, or inconsistencies:
Key questions you are asking in this phase are:
1) What data errors or inaccuracies might get in the way of getting the best possible answer to my problem?
2) How can I clean my data so the info I have is more consistent?
Cleaning is a process that is unique to each dataset, but there are some common issues that you can look out for. There are also some best practices for documentation in cleaning to be aware of as well. The most common things to look for are:
There are amazing tools to help you accomplish all these cleaning tasks built into most spreadsheet programs like Google Sheets and Excel. These processes can also be performed on databases through SQL queries. My final note on this phase has to do with documentation. When you are cleaning data it’s a best practice to keep a changelog which is a document that keeps track of the changes you made, when you made them, and why. Such logs can be crucial when you need to go back and understand these changes, revert to previous versions of the data, and work in collaboration with others who may need to understand these changes. This is just a taste of the PROCESS stage for ensuring any data you use is clean and reliable for analysis.
ANALYZE is the phase we’ve all been waiting for. I’ve ASKED the right questions to make sure my project is focused and relevant. I’ve PREPARED my data by assessing what kind of data ROCCCs, and deciding how to manage and secure it. I’ve PROCESSED the dataset to make sure I’m working with clean, well-formatted data that is ready to go. Here is where I actually:
Here is where we answer the questions:
1) What story is my data telling me?
2) How will my data help me solve this problem?
You may do this with formulas, functions, and pivot tables in spreadsheets or with query languages like SQL in larger databases. Either way, this is where you get to really explore the data for insights, answers, and results. So ANALYZE that data!
SHARE refers to the phase where you present your findings. You’ve made it. You did your analysis, and you have insights to communicate. At this stage, you are ready to:
Key questions to answer at this phase are:
1) How can I make what I present to the stakeholders engaging and easy to understand?
2) What would help me understand this if I was the listener?
Here is your time to shine. All that work and perseverance adds up to this chance to SHARE your insights with the shareholders.
ACT refers to the final phase where you’re ready to act on your data. This stage takes the insights you have learned and puts them to use. Take any feedback from the SHARE phase and provide stakeholders with recommendations based on your findings. Ask yourself:
1) How can I use the feedback I received during the share phase to fully meet the stakeholders’ needs and expectations?
In this phase, you give your final recommendation based on your analysis of how the shareholders can ACT to resolve their stated problems from the very beginning of the project.
Like I said, I’m sure there could be books written about each phase and the tools used in each one. But this is a summary of the overall process and some of the interesting aspects and tools I learned about so far in my Adventures in Analysis. Stay tuned as I continue to climb!
It's not like I was oblivious before. Of course we use data to help make decisions! What else are facts for? But honestly, I never really stopped to think about the sheer amount of data being produced these days in our hyper-digital age. Everything I do leaves a footprint of data somewhere. Did I open a website? Search for an item I need? Look up a restaurant I want to go to? Odds are I left data behind me, and it all adds up.
Besides the data I generate, I also never truly stopped to think about how many things I use on a daily basis are built on data. Do I want to check the temperature on my weather app when I get up in the morning so I dress for the weather? Do I want to check the nutrition facts on my cereal box as I eat my breakfast? Do I want to pull up Google Maps and check the traffic so I can choose the best path for my morning commute? So much data went into each and every one of those platforms. And honestly, I rarely think about it.
But now, I'm thinking. My course informed me that data is the world's most valuable resource. How much could I enrich my life by learning to use it? What could I do if I learned to transform data into insights, to do analysis?
During my certification course, I have been introduced or reintroduced to important aspects of data itself. Those topics include the life cycle of data, qualitative vs quantitative data, and fairness in data analysis. Each of these topics helped me expand my understanding of data and its uses.
The Life Cycle includes the following stages: plan, capture, manage, analyze, archive, destroy.
The data lifecycle is described in several other systems with a varying number of stages, but the key concepts are fairly universal.
Another big topic when discussing data is whether it is qualitative or quantitative. Quantitative data, or things you can measure, usually answer questions of what, how many, how often and so on. Quantitative data is easier to mathematically manipulate and represent in charts and graphs. Common data tools to gather quantitative data are structured interviews, surveys, and polls. Qualitative data is subjective or explanatory. It measures qualities and characteristics and often answers questions about why. This can help explain why the numbers are the way that they are and provide bigger context. Common data tools to gather qualitative data are focus groups, social media text analysis, and in person interviews. Each type of data provides a different type of insight and together they can help drive great decisions.
The final topic I wanted to include in this short exploration of data is fairness. While fairness is not exactly a quality of the data itself in raw form, it is an absolutely essential aspect of anything we do with it. Promoting fairness in data analysis involves ensuring that your analysis doesn’t create or reinforce bias. Some best practices in fairness are:
Following these practices will help promote fair and unbiased use of data. Another way of looking at data fairness evaluates any data you use according to the following questions:
Knowing these things about your data can help you understand the context it exists in and evaluate or reduce any bias that is present.
So, what does this add up to? Data is everywhere. If I want to make it work for me and any future employer, I need to understand the lifecycle of data moving through our systems so that I can manage any data I use properly. I need to understand the different types of data because they are analyzed differently and answer different questions. Finally, fairness is an essential aspect of any analysis and needs to be addressed from beginning to end of every project and even included in an assessment of what data can be included as well. I can’t wait to see how understanding these aspects of data and its analysis will help me move forward in my analysis adventures.
We need your consent to load the translations
We use a third-party service to translate the website content that may collect data about your activity. Please review the details in the privacy policy and accept the service to view the translations.