From the course: Data Cleaning and Manipulating with Python in Excel
Cleaning text data
From the course: Data Cleaning and Manipulating with Python in Excel
Cleaning text data
- [Instructor] Cleaning up text data has got to be one of the most frustrating parts in my job. I'm finally able to get my hands on some data I've been needing all week, just for it to be completely messed up. With more time and experience, handling text data can still be difficult, but knowing how to think smarter, not harder is key. This is where Python can really help transform our data into being clean and usable. In this video, we will learn how to handle text data using Python in Excel. We will also dive into the different ways to handle this, for example, trimming white spaces. Lastly, we will understand how the Python code conducts our data cleaning. We have a bit to cover in this video, so let's dive right in. So, let's open up the Exercise Files for this video. We'll use the Chapter 2, 02_01 tab. So, looking at our data here, we have a few columns. We have the Company, their City and the State that they're located in. Now, there's going to be some things that we probably notice right away. We look in the Company column and some of them are appropriately titled and some of them aren't. The City has a little bit of white space in some of the cells, and the State is all lowercased, so we're going to go to each column and clean it up the way that we want. So, we're going to do for Company, we're going to go over here to G2, go to Formulas and Insert Python. So, what we want to do in this one is make sure every single string in here is capitalized or titled. For example, look at metro candles in A3. Metro Candles, the m and c, the m in metro, c in candles are lowercased. We want to make sure that they are uppercase to make sure it's appropriately titled. For this, for my data frame here, I'm going to be typing in company equals, and I'm going to name the data frame I'm pulling from, which is df, and for this one in my brackets, I have to tell Python what is the column I'm going to look at. I'm looking at Company, end off my brackets, and I'm going to type in .str. And then from here, I got to tell Python, okay, with this text, what I want to do with it. I want to use the title function to do this, end off with a couple parentheses. Let's Control Enter this and take a look at our results. Okay, so let's take a look and see what we got, and let's see, like, that's a perfect example with Metro Candles, you can see right there, the M is now capitalized, so is the C, and we can see all the companies on here are appropriately titled for their name. What I want to do next is go to the City column, and like I said before, there's a ton of white spaces in there. We could see clearly with Lancaster and Trenton that there's a bunch of white space before it, and we got to finally clean it up so we could actually use it in our analysis. So, over here, going into H2, going to Insert Python, and for this one, I'm just going to call this one city is equal to, and just like before, the beginning's very similar to the title one. So, I have to say the data frame I'm pulling from, which is df, and in my brackets, going to put in City end it off with a period. And now, I'm going to type in my str, and then from here, we have to think, okay, so to take care of the white spaces, AKA, trimming the text data, what do I have to use? And for that, we're using the strip function for that, end off with our parentheses, and let's commit our code. Okay, now let's take a look, see what we got, and perfect. Everything is all good to go, everything's in order, and this allows us to actually use the city column in our analysis now that we have trimmed out all the white spaces within. Next, for the last thing we want to do is look at the State column, and these are all the different dates that the cities are in, however, I want to make sure that they're uppercase. I don't want them in lowercase, that's not going to look too great when I'm trying to put together any kind of analysis I want to do. So, go over to I2, go to Insert Python, and for this one, I'm just call this one state is equal to, and just like the others, I want to make sure I'm specifying the data frame, so df. In my brackets, I have to say what column I want to look at, which would be the State column, do a period, and then type in my str. And then for this one, I have to think once again, I want to look at the State column but what do I want to do with it? I want them all to be uppercase. So, I'm going to type in the word upper, do my parentheses at the end, and commit my code and take a look. Okay, perfect. Let's take a look, see what we got here, and perfect. Literally perfect. Every single state is in uppercase, everything looks great, and now we finally could actually use all this data within our analysis that'll be great to impress any of our bosses or stakeholders. This is fantastic work, and I'm really looking forward to seeing you on the next video.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.