Automating the Tracking of U.S. Politicians' Stock Trades: A Tech Dive

Dan Sebastianelli
Sep 12, 2024
4 min read

Updated: Oct 19, 2024

Step 1: Scraping U.S. Senate Financial Data

Background: The first phase of the project was straightforward—scraping stock transactions from U.S. Senate financial disclosure reports. Since several Python scrapers for Senate data already existed, I decided to start there. Initial Attempt:

I had immediate success in scraping the data that was already very clean. It was easy to clean and organize in a structured format that could be analyzed. I thought "This will be a breeze."

How I Solved It: After experimenting with different scrapers and parsing methods, I figured out that focusing on specific patterns in the HTML structure allowed me to cleanly extract relevant data and format it properly into CSV files. Success was quick here, and I moved on to the next step, confident in the process.

Step 2: Extracting Data from the U.S. House

Background: The next phase involved tackling the U.S. House of Representatives' financial disclosures. Unlike the Senate, no Python scrapers were readily available for House data. Initially, I assumed I could pull the data via the OpenSecrets API. Initial Attempt: Unfortunately, this option was off the table as OpenSecrets had recently shut down their API, leaving me no easy way to access the data. I went directly to the source: the U.S. House Financial Disclosure Reports. However, these reports are available only as PDFs with poorly formatted tables, making it tough to extract the information. Some PDFs have complex tables or rows that span over more than one column. Here you see the 'Asset' column rolling into the next two columns:

I tried PyPDF2, PDFMiner, Tabula-py, Camelot, PyMuPDF (fitz), pdfplumber, etc. and found them not up to the task for complex tables. Pytesseract is very good on high contrast images for certain tasks, but it falters on complex tables like the one above.

How I Solved It:

After some research, I leveraged machine learning Python library PubTables-1M, which is available for download from Microsoft Research Open Data.

The key to solving this puzzle lies in first utilizing PubTables-1M to identify and extract only the tables from the PDFs as images, and then applying smart OCR to convert those images into CSV files. Despite the messy format, I successfully automated this extraction process step-by-step. You can find a detailed explanation here. The area in pink is identified by the pretrained transformer PubTables-1M and saved as a image file for OCR processing:

Step 3: Automating the Bulk Processing

Background: Once I successfully extracted table data from individual PDF files, the next logical step was to automate the bulk processing of all the reports. The automated task of downloading all the reports in PDF format is relatively straightforward and doesn’t warrant detailed discussion here. As outlined in the detailed process of cleaning the data from the tables into a CSV, the final step leverages OpenAI's latest LLM, ChatGPT-4, to refine cells containing garbled data after extraction and merging. This ensures that the output is accurate and clean for further analysis. Initial Attempt: I figured it would be relatively easy to scale up from processing a single file to automating the extraction across all House and Senate PDFs. But I quickly ran into an issue: API rate limits. After a weekend of running test after test, I received an error—"OpenAI API error 429: You exceeded your current quota." My initial subscription plan simply couldn't handle the volume of requests. How I Solved It:I upgraded my OpenAI API plan and restructured the code to batch the requests, staying well within quota limits. This allowed me to process the files in bulk without hitting errors, successfully automating this phase.

Step 4: Handling JSON Formatting Issues

Background: I tried using GPT-4's OCR image processing features to compare to easyOCR. GPT4's image processing of tables is on par with easyOCR, so I tried sending the images with the table information to GPT4 via the API. Even after getting the bulk processing running smoothly, I found myself stuck in what I started calling "JSON hell." The GPT-4 model couldn’t handle more than one image at a time, often returning garbled responses when I attempted to send multiple images. Initial Attempt: At first, I manually sent each image to the API in separate requests using two different browsers. It worked, but it completely defeated the purpose of automation.

How I Solved It: After debugging and refining my API calls, I limited the input to one image per request and wrote logic to process them sequentially. I also implemented more robust error-handling to manage JSON issues. This approach allowed me to continue using GPT-4's image processing features effectively, turning the PDF data into clean tables and ensuring consistent output.

Step 5: Merging CSV Files into a Master Excel Sheet

Background: The final step was to combine all the CSVs generated from Senate and House financial reports into a single Excel file that could be regularly updated. Initial Attempt: At first, I tried merging all CSVs in one go, but the variations in formatting across different reports caused alignment issues, leading to mismatched data in the final file. How I Solved It: I resolved this by building a custom data-merging function that cleaned and standardized the CSVs before combining them. The final output was a well-organized master file that I could easily update with new data as more reports are filed. The file, Merged_US_Representatives_Stock_Transactions_2024.xlsx, was ready for analysis.

In summary, this project was a series of technical challenges and roadblocks, each providing a unique learning experience. From scraping data to automating bulk processes and finally merging the results, the project required persistence and creativity, but the outcome was a solid, automated system for tracking U.S. politicians' stock trades.

If you're interested in the data or have projects requiring advanced data processing, feel free to reach out. Below are some of the tables for the 118th U.S. House of Representatives in 2024, generated effortlessly with Python in just a few clicks: