Greetings! π
My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming complex business ideas into scalable, secure, and efficient AI-driven products. I have expert knowledge in various areas, enabling me to deliver cutting-edge, top-tier AI solutions that drive business growth and improve efficiency.
π Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR): 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.
β‘ Big Data Processing with Apache Spark: 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Sparkβs capabilities.
π Data De-identification & Anonymization: Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.
𧬠Healthcare, Pharma, MedTech, BioTech Expertise: Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.
β End-to-End Expertise
β Complex Problem-Solving Ability
β Timely Delivery
β Transparent Communication
β Scalable Solutions
π οΈ Programming Languages: Python, Scala
π Data Science & Machine Learning: NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)
π‘ LLMs and Related Tools: OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio
β‘ Big Data & Distributed Systems: Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks
π Cloud Computing & Infrastructure: Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana
βοΈ Databases: PostgreSQL, MongoDB, Redis, DynamoDB
πΌ CRMs: Hubspot, ZohoCRM
Committed to long-term collaborations. Available full-time for your next project.
Source Code: https://github.com/StabRise/spark-pdf
Home page: https://stabrise.com/spark-pdf/
Quick Start Jupyter Notebook: PdfDataSource.ipynb
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
Source Code: https://github.com/StabRise/scaledp
Home page: https://stabrise.com/scaledp/
Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results