Extract Data Using Python: A Guide for Non-Coders

If you have ever felt like your actual job description is just “professional copy-paster,” you are not alone. Spending hours manually moving numbers from PDFs or websites into Excel doesn’t just drain your energy; it turns you into a data janitor rather than an analyst.

This post breaks down why manual entry is a trap and introduces Python as the ultimate tool to escape it. We look at simple, non-technical ways to automate scraping web tables, combining massive folders of CSVs, and pulling data directly from PDFs.

This transforms you from a bicycle-pedaling spreadsheet user to a motorcycle-riding automation pro. By mastering these streamlined techniques, you can eliminate manual data entry and significantly accelerate your workflow.

It is 4:45 PM on a Friday. You are staring at a folder called “Weekly_Invoices_To_Process”. Inside, there are 50 PDF files.

Your job? Open PDF #1. Find the “Total Amount” at the bottom right. Copy it. Switch to Excel. Paste it into cell B2. Close the PDF. Open PDF #2. Find the “Total Amount”…

Your wrist hurts. Your eyes are dry. And deep down, you have a sinking feeling that you might have pasted the number from PDF #14 into the row for PDF #15.

If you are “good at Excel,” this is your nightmare. You are an analyst, a strategist, a thinker. But you are spending 80% of your week acting like a robot. You are a Data Janitor.

There is a way out. It’s time to hire a digital intern who works for free, never makes a typo, and finishes that 4-hour job in 4 seconds.

That intern is called Python.

The “Extract Data” Trap

Most Excel users think their job is to analyze data. But the reality is that we spend most of our time extracting it. We are stuck in the “Manual Data Entry” tax bracket. It costs businesses billions, but it costs you your career growth because you are too busy copy-pasting to actually think.

Consider the three main traps:

The Web Trap: You need prices from a competitor’s website. You spend hours highlighting tables, right-clicking “Copy,” and praying the formatting doesn’t explode when you hit “Paste.”
The PDF Trap: You get data locked inside “read-only” documents. You can’t just copy the table, so you re-type the numbers manually.
The “Dirty Data” Trap: This is the silent killer. You copy a date like 02/05/2024 from a website, and Excel decides it is May 2nd instead of February 5th. Or you copy a product ID like 00123, and Excel “helpfully” deletes the zeros and makes it 123, ruining your VLOOKUPs.

Why Python is Your “Universal Can Opener”

When you hear “Python,” you probably think of software engineers building apps like Instagram or Spotify.
Forget that.
For us—the Excel people—Python is just a Universal Can Opener.

It can open any file, any website, and any document, extract exactly what you want, and dump it into Excel for you.

Here is why it beats the “Ctrl+C” method every time:

It Doesn’t Use the Mouse: When you copy-paste, you are using a visual interface. It’s slow. Python reads the raw code behind the file. It’s instant.
It’s “Audit-Proof”: If you make a mistake manually, you have to start over. If your Python script has a mistake, you fix one line of code and re-run it in 3 seconds.
It Scales: Copying 10 files takes 10 times longer than copying 1 file. For Python, looping through 10 files or 10,000 files takes almost the same amount of effort to write.

How to Extract Data Using Python (The “Non-Technical” Guide)

I am going to show you four “spells” that will save you hours of work. You don’t need to understand the computer science behind them. You just need to know they exist.

Scenario 1: The “Wikipedia Table” Nightmare

The Task: You need a list of the “Top 500 Companies” from a Wikipedia page or a financial website.
The Excel Way: You highlight the table with your mouse. You scroll down… and down… and your mouse slips. You start over. You paste it into Excel, and the flags turn into giant images that break your row heights.
The Python Way:
There is a specific tool in Python called pandas. It has a function that feels like magic. It is called read_html.

You literally just give it the URL.

tables = pd.read_html("https://website.com/page_with_tables")

That’s it. Python visits the site, finds EVERY table on the page, scrapes the data, cleans the formatting, removes the images, and hands them to you as neat little dataframes. No mouse required.

Scenario 2: The “Folder of 100 CSVs”

The Task: Your ERP system exports a separate sales report for every single day. You need to combine 30 of them into one Monthly Report.
The Excel Way: Open. Copy. Paste. Close. Repeat x30. (And pray you didn’t paste Tuesday twice).
The Python Way:
Python has a library called glob (think “Global Search”). It works like a vacuum cleaner.

# 1. Find all files that end in .csv
files = glob.glob("*.csv")

# 2. Suck them all into one big table
master_table = pd.concat((pd.read_csv(f) for f in files))

Two lines of code. It doesn’t matter if the folder has 5 files or 5,000 files. You press “Run,” and one second later, you have a Master Sheet ready to export.

Scenario 3: The “PDF Prison”

The Task: You have a PDF invoice. You need the “Invoice Number” and the “Total”.
The Excel Way: You look at the PDF on one monitor. You type the number into Excel on the other monitor. You transpose two digits (typing 98 instead of 89). The audit fails next month.
The Python Way:
There are libraries like pdfplumber that treat PDFs like spreadsheets.

with pdfplumber.open("invoice.pdf") as pdf:
    table = pdf.pages[0].extract_table()

It looks at the PDF, “sees” the lines that make up the table, and pulls the text out into structured rows and columns. It’s like having X-ray vision.

Scenario 4: The “Bad Formatting” Disaster

The Task: Your colleague sends you an Excel file. It’s a mess. The first 5 rows are blank. The header is on row 6. The dates are formatted as text. Column A has “Merged Cells.”
The Excel Way: You spend 20 minutes “Cleaning” the file. You delete rows, unmerge cells, and use “Text to Columns” to fix the dates.
The Python Way:
You don’t clean the file. You just tell Python to ignore the garbage.

df = pd.read_excel("messy_file.xlsx", skiprows=5, dtype={'ID': str})

skiprows=5: Automatically ignores the blank top rows.
dtype={'ID': str}: Forces the ID column to keep its leading zeros (so 007 stays 007 and doesn’t become 7).

The Secret Tunnel: APIs (No More Scraping)

There is an even better way to extract data that most Excel users don’t know about. It’s called an API.

Imagine you are at a restaurant.

Web Scraping is like walking into the kitchen and grabbing food off the stove yourself. It’s messy and the chef might yell at you.
An API is the waiter. You give the waiter a specific order (“Give me the sales data for January”), and they bring it to you on a clean plate.

Many websites (like Google Analytics, Facebook Ads, or your CRM) have these “waiters” waiting for you. You don’t need to log in and click “Export.” You just ask Python to talk to the waiter.

# "Hey Facebook, give me the ad spend for yesterday"
data = facebook_api.get("ad_spend", date="yesterday")

No CSVs. No downloads. Just data appearing in your analysis.

“But… Where Do I Type This?”

This is the number one reason people quit before they start. They think they need to install “Matrix” software or use a command line.

You don’t.
You can use Google Colab.

If you have a Google account (Gmail), you already have Python.

Go to Google.
Search “Google Colab.”
Click “New Notebook.”

It looks just like a Google Doc. You type the code in a box, press “Play,” and it runs on Google’s powerful computers, not your laptop. You can’t break anything. It’s free.

The “Ctrl+C” Retirement Party

Every time you manually copy and paste data, you are wasting your potential. You are a human being with a creative brain, and you are doing the work of a script.

The keyword here is Leverage.

Excel is a bicycle. It’s faster than walking, but you have to pedal every inch of the way. If you stop pedaling, you stop moving.
Python is a motorcycle. You twist the throttle, and the engine does the work. You can travel 100 miles in the time it takes the bicycle to travel 5.

Next time you are staring at a folder of 50 files, or a website with 100 pages of tables, don’t sigh and start clicking.
Stop. Take a breath. And ask yourself:
“How can I extract data using Python instead?”

Your wrists (and your career) will thank you.

Fernando Lopez

Website | + posts

FROM EXCEL TO PYTHON

Extract Data Using Python: A Guide for Non-Coders

Table of Contents

The “Extract Data” Trap

Why Python is Your “Universal Can Opener”

How to Extract Data Using Python (The “Non-Technical” Guide)

Scenario 1: The “Wikipedia Table” Nightmare

Scenario 2: The “Folder of 100 CSVs”

Scenario 3: The “PDF Prison”

Scenario 4: The “Bad Formatting” Disaster

The Secret Tunnel: APIs (No More Scraping)

“But… Where Do I Type This?”

The “Ctrl+C” Retirement Party

Fernando Lopez

Table of Contents

The “Extract Data” Trap

Why Python is Your “Universal Can Opener”

How to Extract Data Using Python (The “Non-Technical” Guide)

Scenario 1: The “Wikipedia Table” Nightmare

Scenario 2: The “Folder of 100 CSVs”

Scenario 3: The “PDF Prison”

Scenario 4: The “Bad Formatting” Disaster

The Secret Tunnel: APIs (No More Scraping)

“But… Where Do I Type This?”

The “Ctrl+C” Retirement Party

Fernando Lopez

Related Posts