【项⽬⽬标】
对⼤量的公司年报(PDF⽂件)进⾏关键词的识别与提取,判断⽂件是否含有“增值税留抵税额:XXXX”,并将这份⽂件的名字和此内容写⼊表格【项⽬实现】
1.导⼊处理PDF的python库
1 import pdfplumber2 import PyPDF23 import re4 import os5 import csv6 import json
2.定义函数,得到PDF⽂件的页数
def get_pages(filename):
with open(filename, 'rb', ) as fb:
pages = PyPDF2.PdfFileReader(fb).getNumPages() return pages
3.因为增值税留抵税额这条信息⼀般出现在⽂件的后半部分,所以循环查找从100页开始,利⽤正则表达式查找关键词,并提取
def get_text(filename, pages):
with pdfplumber.open(filename) as pdf: for i in range(100, pages-10):
find = re.findall('增值税留抵税额(.*)', pdf.pages[i].extract_text()) if find:
return find[0].strip().split(\" \")
4.保存表格
def save(company_name, report_date, end_balance, start_balance):
with open('annual_report.csv', 'a', newline=\"\", encoding='utf-8') as f_csv: writer = csv.writer(f_csv)
writer.writerow([company_name, report_date, end_balance, start_balance])
5.运⾏代码
if __name__ == '__main__': file_list = os.listdir() file_list.remove('.idea') file_list.remove('pdf6.py')
file_list.remove('annual_report.csv') file_list_copy = file_list[::]
for file in file_list_copy:
name = re.findall(r'\\d+(.*?):', file)[0]
date = re.findall(r'(\\d+年)年度报告', file)[0] pages_num = get_pages(file)
if get_text(file, pages_num) is not None: try:
end, start = get_text(file, pages_num) save(name, date, end, start) file_list.remove(file) except Exception as e: print(e)
with open('rest.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(file_list, ensure_ascii=False))
因篇幅问题不能全部显示,请点此查看更多更全内容