Skip to content
View mykolamelnykml's full-sized avatar

Block or report mykolamelnykml

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mykolamelnykml/README.md

Greetings! πŸ‘‹

My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming complex business ideas into scalable, secure, and efficient AI-driven products. I have expert knowledge in various areas, enabling me to deliver cutting-edge, top-tier AI solutions that drive business growth and improve efficiency.

Key Areas of My Specialization:

πŸ“„ Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR): 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.

⚑ Big Data Processing with Apache Spark: 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Spark’s capabilities.

πŸ”’ Data De-identification & Anonymization: Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.

🧬 Healthcare, Pharma, MedTech, BioTech Expertise: Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.

TOP 5 Reasons to Work With Me

βœ… End-to-End Expertise

βœ… Complex Problem-Solving Ability

βœ… Timely Delivery

βœ… Transparent Communication

βœ… Scalable Solutions

Professional Skills

πŸ› οΈ Programming Languages: Python, Scala

πŸ“Š Data Science & Machine Learning: NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)

πŸ’‘ LLMs and Related Tools: OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio

⚑ Big Data & Distributed Systems: Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks

πŸš€ Cloud Computing & Infrastructure: Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana

βš™οΈ Databases: PostgreSQL, MongoDB, Redis, DynamoDB

πŸ’Ό CRMs: Hubspot, ZohoCRM

Availability

Committed to long-term collaborations. Available full-time for your next project.

My Projects

Spark PDF DataSource

Spark Pdf


Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: PdfDataSource.ipynb


The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

  • Read PDF documents to the Spark DataFrame
  • Support read PDF files lazy per page
  • Support big files, up to 10k pages
  • Support scanned PDF files (call OCR)
  • No need to install Tesseract OCR, it's included in the package

ScaleDP

ScaleDP


Source Code: https://github.com/StabRise/scaledp

Home page: https://stabrise.com/scaledp/

Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb


ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

Github

Mykola's GitHub stats

Pinned Loading

  1. StabRise/spark-pdf StabRise/spark-pdf Public

    PDF DataSource for Apache Spark

    Scala 28

  2. StabRise/ScaleDP StabRise/ScaleDP Public

    ScaleDP is an Open-Source extension of Apache Spark for Document Processing

    Python 3