🧬 DNA Sequence Classification using Machine Learning

🔹 Overview

This project classifies DNA sequences using machine learning techniques. DNA sequence classification is a fundamental task in bioinformatics with applications in genomics, proteomics, and personalized medicine. This project offers a simple and effective approach to DNA sequence classification using k-mer analysis and the Multinomial Naive Bayes algorithm.

📂 Dataset

The dataset used in this project is sourced from the DNA-Dataset repository. It contains DNA sequences and their corresponding classes, such as human, chimpanzee, dog, etc. The human_data.txt file is used for training and testing the model.

🔬 K-mer Analysis

K-mer analysis involves breaking down DNA sequences into smaller, overlapping substrings of length k (called k-mers).

Example:
DNA sequence: ATCGTAC
6-mers: ATCGTA, TCGTAC

By counting the frequency of different k-mers in a sequence, we can create a feature vector that captures the composition of the sequence.
This approach is effective for DNA sequence classification because different classes of DNA may have distinct k-mer frequency profiles.

⚙️ Dependencies

pandas
numpy
scikit-learn
matplotlib
seaborn

You can install these dependencies using pip:

pip install pandas numpy scikit-learn matplotlib seaborn

🚀 Usage

Clone this repository (project code):

git clone https://github.com/ZahraSahranavard/DNA-Sequence-ML

cd DNA-Sequence-ML

Clone the dataset repository:

git clone https://github.com/ZahraSahranavard/DNA-Dataset.git

Run the notebook:

Open DNA-Data-Analysis.ipynb in Google Colab or Jupyter Notebook.
Run all cells to train and evaluate the model.

📊 Results

The Multinomial Naive Bayes classifier shows excellent performance on the dataset:

Accuracy: 0.984
Precision: 0.984
Recall: 0.984
F1-score: 0.984

These results demonstrate that k-mer based feature extraction + Naive Bayes is an effective approach for DNA sequence classification.

🔮 Future Work

Explore alternative classifiers such as Support Vector Machines (SVM), Random Forest, and Gradient Boosting to compare their performance with Naive Bayes.
Incorporate DNA-specific embeddings (e.g., DNA2Vec or k-mer embedding models) to capture deeper biological patterns in nucleotide sequences.
Leverage deep learning architectures like Convolutional Neural Networks (CNNs) for motif detection, Recurrent Neural Networks (RNNs) for sequence modeling, and Transformers for capturing long-range dependencies in DNA.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
Image		Image
DNA-Data-Analysis.ipynb		DNA-Data-Analysis.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 DNA Sequence Classification using Machine Learning

🔹 Overview

📂 Dataset

🔬 K-mer Analysis

⚙️ Dependencies

🚀 Usage

📊 Results

🔮 Future Work

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 DNA Sequence Classification using Machine Learning

🔹 Overview

📂 Dataset

🔬 K-mer Analysis

⚙️ Dependencies

🚀 Usage

📊 Results

🔮 Future Work

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages