#U1 transformers intro

#What is NLP

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The following is a list of common NLP tasks

  • Classifying whole sentences
  • Classifying each word in a sentence
  • Generating text content
  • Extracting an answer from a text
  • Generating a new sentence from an input text

#Transformers

Transformer models are used to solve all kinds of NLP tasks

The pipeline() function connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer. By default, this pipeline selects a particular pretrained model

There are three main steps involved when you pass some text to a pipeline:

  • The text is preprocessed into a format the model can understand.
  • The preprocessed inputs are passed to the model.
  • The predictions of the model are post-processed, so you can make sense of them.

Some available pipelines

  • feature-extraction (get the vector representation of a text)
  • fill-mask
  • ner (named entity recognition)
  • question-answering
  • sentiment-analysis
  • summarization
  • text-generation
  • translation
  • zero-shot-classification
# Sentiment analysis from transformers import pipeline classifier = pipeline("sentiment") classifier([ "I've been waiting for a HuggingFace course my whole life.", "This sucks so bad", ])
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification. All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training. [{'label': 'POSITIVE', 'score': 0.9598046541213989}, {'label': 'NEGATIVE', 'score': 0.9997201561927795}]
# Zero-shot classification # Classify texts that haven’t been labelled # This is called zero-shot because you don’t need to fine-tune the model classifier = pipeline("zero-shot-classification") classifier( "This is a course about the Transformers library", candidate_labels=["education", "politics", "business"], )
No model was supplied, defaulted to roberta-large-mnli and revision 130fb28 (https://huggingface.co/roberta-large-mnli). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFRobertaForSequenceClassification. All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training. {'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.9562344551086426, 0.02697218768298626, 0.01679336279630661]}
# Text generation generator = pipeline("text-generation") generator("In this course, we will teach you how to")
No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. [{'generated_text': 'In this course, we will teach you how to: Improve your language skills Learn to use English and English vocabulary\n\nRead and understand the basic grammar rules\n\nBuild grammar in your everyday life for the benefit of all\n\nLearn about:'}]
generator = pipeline("text-generation", max_length=15, num_return_sequences=20) generator("In this course, we will teach you how to")
No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. [{'generated_text': 'In this course, we will teach you how to configure your own custom data'}, {'generated_text': 'In this course, we will teach you how to create and publish a website'}, {'generated_text': 'In this course, we will teach you how to use the NIST standard'}, {'generated_text': 'In this course, we will teach you how to navigate with the same ease'}, {'generated_text': 'In this course, we will teach you how to create an email based email'}, {'generated_text': 'In this course, we will teach you how to use web developers to build'}, {'generated_text': 'In this course, we will teach you how to develop an emotional intelligence that'}, {'generated_text': 'In this course, we will teach you how to build an app for a'}, {'generated_text': 'In this course, we will teach you how to use virtual machines to produce'}, {'generated_text': 'In this course, we will teach you how to create your own apps within'}, {'generated_text': 'In this course, we will teach you how to navigate the most powerful platform'}, {'generated_text': 'In this course, we will teach you how to: 1. Create an'}, {'generated_text': 'In this course, we will teach you how to create beautiful images and make'}, {'generated_text': 'In this course, we will teach you how to set up the program in'}, {'generated_text': 'In this course, we will teach you how to create multi-color screens'}, {'generated_text': 'In this course, we will teach you how to design smart web applications for'}, {'generated_text': 'In this course, we will teach you how to take part in the testing'}, {'generated_text': 'In this course, we will teach you how to create your own custom code'}, {'generated_text': 'In this course, we will teach you how to use the OpenSSL library'}, {'generated_text': 'In this course, we will teach you how to use Python and Java to'}]
generator = pipeline("text-generation", model="distilgpt2") generator( "In this course, we will teach you how to", max_length=30, num_return_sequences=2, )
All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. [{'generated_text': 'In this course, we will teach you how to handle, and how to get rid of (your) scented scents and fragrance. For more'}, {'generated_text': 'In this course, we will teach you how to create, maintain, manage, and manage the digital world in your everyday life.\n\nThe main'}]
# Fill mask unmasker = pipeline("fill-mask") unmasker("This course will teach you all about <mask> models.", top_k=2)
No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFRobertaForMaskedLM. All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training. [{'score': 0.19619633257389069, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04052688181400299, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]
# Entity recognition ner = pipeline("ner", grouped_entities=True) # Groups entities like Hugging, Face together ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFBertForTokenClassification. All the weights of TFBertForTokenClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training. [{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
# Q&A question_answerer = pipeline("question-answering") question_answerer( question="Where do I work?", context="My name is Sylvain and I work at Hugging Face in Brooklyn", )
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering. All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training. {'score': 0.6949762105941772, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
# Summarization summarizer = pipeline("summarization") summarizer( """ America has changed dramatically during recent years. Not only has the number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering declined, but in most of the premier American universities engineering curricula now concentrate on and encourage largely the study of engineering science. As a result, there are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects, largely supporting increasingly complex scientific developments. While the latter is important, it should not be at the expense of more traditional engineering. Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering. Both China and India, respectively, graduate six and eight times as many traditional engineers as does the United States. Other industrial countries at minimum maintain their output, while America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers. """ )
No model was supplied, defaulted to t5-small and revision d769bba (https://huggingface.co/t5-small). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFT5ForConditionalGeneration. All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training. [{'summary_text': 'the number of graduates in traditional engineering disciplines has declined . in most of the premier american universities engineering curricula now concentrate on and encourage largely the study of engineering science . rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]